天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于Web的大規(guī)模中文人物信息提取研究

發(fā)布時(shí)間:2018-07-20 15:17
【摘要】:現(xiàn)代人越來越依賴于從互聯(lián)網(wǎng)上檢索信息,人物信息是人們關(guān)注檢索的一個(gè)重要領(lǐng)域。本文致力于抽取盡可能多的重要人物信息,構(gòu)建一個(gè)人物信息的知識庫,既可以作為人物搜索引擎的知識庫,也可以作為語義搜索引擎的知識庫的人物相關(guān)部分。網(wǎng)絡(luò)上有海量的人物信息,但是這些信息格式多樣、內(nèi)容紛亂,大量的垃圾信息又充斥其中,如何從互聯(lián)網(wǎng)中自動(dòng)高效地抽取準(zhǔn)確的信息相對復(fù)雜,有很多問題需要解決。本文研究了一個(gè)從網(wǎng)頁數(shù)據(jù)采集、網(wǎng)頁正文抽取、中文分詞處理到人物信息結(jié)構(gòu)化的完整過程,每個(gè)部分都對應(yīng)論文的一章。 首先是網(wǎng)頁數(shù)據(jù)的采集。論文詳述了人物信息網(wǎng)頁來源的選取和網(wǎng)頁的下載方法。網(wǎng)頁下載越來越困難,網(wǎng)站對爬蟲程序的限制越來越嚴(yán),甚至采取了各種反爬蟲措施,比如對同一IP訪問頻率的限制。作者自己編寫程序下載網(wǎng)頁數(shù)據(jù),針對網(wǎng)站的不同情況采用了三種網(wǎng)頁數(shù)據(jù)的下載方式:一般下載方式、代理下載方式和動(dòng)態(tài)網(wǎng)頁數(shù)據(jù)的下載方式。 然后是對網(wǎng)頁正文進(jìn)行抽取。論文綜述了網(wǎng)頁正文抽取的相關(guān)研究,采用了基于統(tǒng)計(jì)和DOM的方法進(jìn)行正文抽取。方法采用的統(tǒng)計(jì)信息是正文字長、超鏈接數(shù)和結(jié)束標(biāo)點(diǎn)符號數(shù)。對每個(gè)容器標(biāo)簽,統(tǒng)計(jì)三個(gè)信息值后,利用它們的數(shù)量比值判斷標(biāo)簽是否正文標(biāo)簽,進(jìn)而抽取正文。 接著是對網(wǎng)頁正文進(jìn)行分詞處理。常見的分詞系統(tǒng)在實(shí)體識別方面存在不足,不能很好適用于知識抽取、自然語言處理等。本文分詞處理使用的是西南交大思維與智慧研究所開發(fā)的分詞系統(tǒng),該系統(tǒng)在實(shí)體識別方面顯著優(yōu)于其它分詞系統(tǒng)。機(jī)構(gòu)名識別算法由本文作者實(shí)現(xiàn),算法基于詞頻統(tǒng)計(jì)。實(shí)驗(yàn)中訓(xùn)練數(shù)據(jù)主要通過百度百科詞條整理得到。訓(xùn)練時(shí),作者利用百度百科詞條名在詞條文本中的頻數(shù)統(tǒng)計(jì),進(jìn)行機(jī)構(gòu)構(gòu)成詞的詞頻統(tǒng)計(jì)。在此基礎(chǔ)上,構(gòu)建了數(shù)學(xué)模型,實(shí)現(xiàn)了組織機(jī)構(gòu)名識別算法。 最后是網(wǎng)頁人物信息的結(jié)構(gòu)化。網(wǎng)頁上的人物信息一般以半結(jié)構(gòu)化和非結(jié)構(gòu)化呈現(xiàn),人物信息抽取的最后部分就是抽取半結(jié)構(gòu)化和非結(jié)構(gòu)化的人物信息并保存為結(jié)構(gòu)化的人物信息。對于半結(jié)構(gòu)化人物信息,需要正文去匹配人物屬性詞典,然后結(jié)合簡單規(guī)則,直接提取屬性值就行了,方法簡單而有效。對于非結(jié)構(gòu)化人物信息的提取,采用基于規(guī)則的提取方法,過程中建立觸發(fā)詞庫和規(guī)則庫,觸發(fā)詞庫包括基本人物屬性和對應(yīng)的觸發(fā)詞,規(guī)則庫是人工定義的提取屬性值的規(guī)則。
[Abstract]:Modern people rely more and more on retrieving information from the Internet. This paper is devoted to extracting as much important person information as possible and constructing a knowledge base of character information, which can be used as the knowledge base of the people search engine as well as the personal-related part of the knowledge base of the semantic search engine. There are a lot of people information on the network, but these information formats are various, the content is chaotic, a lot of junk information is filled in, how to extract accurate information automatically and efficiently from the Internet is relatively complex, there are many problems to be solved. In this paper, a complete process from data collection, text extraction, Chinese word segmentation to the structure of character information is studied. Each part corresponds to a chapter of the thesis. The first is the collection of web data. In this paper, the selection of the source of the people information page and the download method of the web page are described in detail. It is becoming more and more difficult to download web pages, and the restrictions on crawler programs are becoming more and more strict, and even various anti-crawler measures have been taken, such as restrictions on the frequency of access to the same IP. The author writes the program to download the web page data, and adopts three kinds of downloading ways of the web page data according to the different situation of the website: the general downloading way, the agent downloading way and the dynamic web page data downloading way. Then the text of the page is extracted. This paper summarizes the research of web page text extraction, and uses statistical and Dom methods to extract text. The statistical information used was the length of positive text, the number of hyperlinks and the number of ending punctuation marks. For each container label, after three information values are counted, the number ratio of each label is used to determine whether the label is the body label, and then the text is extracted. Then the text of the web page word segmentation processing. The common participle system is not suitable for knowledge extraction, natural language processing and so on. The word segmentation system developed by the Institute of thinking and Wisdom of Southwest Jiaotong University is better than other word segmentation systems in entity recognition. The mechanism name recognition algorithm is implemented by the author of this paper, and the algorithm is based on word frequency statistics. Training data in the experiment mainly through Baidu encyclopedia word collation. During the training, the author makes use of the frequency statistics of Baidu encyclopedia names in the entry text, and carries on the frequency statistics of the organization composition words. On the basis of this, a mathematical model is constructed and an organization name recognition algorithm is implemented. Finally, the structure of the information of people on the web page. The character information on a web page is generally presented as semi-structured and unstructured, and the final part of character information extraction is to extract semi-structured and unstructured personage information and save it as structured personage information. For semi-structured character information, we need the text to match the character attribute dictionary, and then combine the simple rules to extract the attribute value directly. The method is simple and effective. For the extraction of unstructured character information, a rule-based extraction method is adopted. In the process, a trigger lexicon and a rule base are established. The trigger lexicon includes the basic character attributes and the corresponding trigger words. A rule base is a manually defined rule for extracting attribute values.
【學(xué)位授予單位】:西南交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP393.092;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 劉金紅;陸余良;施凡;宋舜宏;;基于語義上下文分析的因特網(wǎng)人物信息挖掘[J];安徽大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年04期

2 黃德根;李澤中;萬如;;基于SVM和CRF的雙層模型中文機(jī)構(gòu)名識別[J];大連理工大學(xué)學(xué)報(bào);2010年05期

3 周俊生;戴新宇;尹存燕;陳家駿;;基于層疊條件隨機(jī)場模型的中文機(jī)構(gòu)名自動(dòng)識別[J];電子學(xué)報(bào);2006年05期

4 周雅倩,郭以昆,黃萱菁,吳立德;基于最大熵方法的中英文基本名詞短語識別[J];計(jì)算機(jī)研究與發(fā)展;2003年03期

5 胡文博;都云程;呂學(xué)強(qiáng);施水才;;基于多層條件隨機(jī)場的中文命名實(shí)體識別[J];計(jì)算機(jī)工程與應(yīng)用;2009年01期

6 冀高峰;湯庸;道煒;吳桂賓;黃帆;王鵬;;基于XML的自動(dòng)學(xué)習(xí)Web信息抽取[J];計(jì)算機(jī)科學(xué);2008年03期

7 劉輝;陳靜玉;徐學(xué)洲;;基于模板流程配置的Web信息抽取[J];計(jì)算機(jī)工程;2008年20期

8 張華平,劉群;基于角色標(biāo)注的中國人名自動(dòng)識別研究[J];計(jì)算機(jī)學(xué)報(bào);2004年01期

9 李勇軍,冀汶莉,馬光思;用DOM解析XML文檔[J];計(jì)算機(jī)應(yīng)用;2001年S1期

10 鄭家恒,張輝;基于HMM的中國組織機(jī)構(gòu)名自動(dòng)識別[J];計(jì)算機(jī)應(yīng)用;2002年11期

相關(guān)碩士學(xué)位論文 前2條

1 王江偉;基于最大熵模型的中文命名實(shí)體識別[D];南京理工大學(xué);2005年

2 燕敏;基于語義和版式的網(wǎng)上人物信息提取[D];天津工業(yè)大學(xué);2008年



本文編號:2133947

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2133947.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶ae010***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com
国产又粗又猛又大爽又黄| 亚洲夫妻性生活免费视频| 亚洲中文字幕人妻系列| 俄罗斯胖女人性生活视频| 丰满熟女少妇一区二区三区| 嫩草国产福利视频一区二区| 91人妻人人做人碰人人九色| 99热九九热这里只有精品| 五月婷婷综合缴情六月| 日韩人妻精品免费一区二区三区| 国产专区亚洲专区久久| 欧美大粗爽一区二区三区| 日韩av欧美中文字幕| 久久香蕉综合网精品视频| 熟女高潮一区二区三区| 男人和女人干逼的视频| 日韩不卡一区二区视频| 日韩免费午夜福利视频| 久久99这里只精品热在线| 国产a天堂一区二区专区| 殴美女美女大码性淫生活在线播放| 91麻豆精品欧美视频| 国产av乱了乱了一区二区三区| 日本黄色录像韩国黄色录像| 亚洲精品深夜福利视频| 日韩美成人免费在线视频| 久久精品福利在线观看| 欧美91精品国产自产| 大香伊蕉欧美一区二区三区| 色哟哟在线免费一区二区三区| 亚洲国产精品久久琪琪| 日本在线高清精品人妻| 国产精品香蕉免费手机视频| 中文字幕熟女人妻视频| 日韩夫妻午夜性生活视频| 亚洲午夜av久久久精品| 日韩日韩欧美国产精品| 亚洲一区二区精品免费| 国产免费一区二区不卡| 国产日产欧美精品视频| 国产麻豆一线二线三线|