基于Web的大規(guī)模中文人物信息提取研究

發(fā)布時間：2018-07-20 15:17

【摘要】：現(xiàn)代人越來越依賴于從互聯(lián)網(wǎng)上檢索信息,人物信息是人們關(guān)注檢索的一個重要領(lǐng)域。本文致力于抽取盡可能多的重要人物信息,構(gòu)建一個人物信息的知識庫,既可以作為人物搜索引擎的知識庫,也可以作為語義搜索引擎的知識庫的人物相關(guān)部分。網(wǎng)絡(luò)上有海量的人物信息,但是這些信息格式多樣、內(nèi)容紛亂,大量的垃圾信息又充斥其中,如何從互聯(lián)網(wǎng)中自動高效地抽取準確的信息相對復(fù)雜,有很多問題需要解決。本文研究了一個從網(wǎng)頁數(shù)據(jù)采集、網(wǎng)頁正文抽取、中文分詞處理到人物信息結(jié)構(gòu)化的完整過程,每個部分都對應(yīng)論文的一章。首先是網(wǎng)頁數(shù)據(jù)的采集。論文詳述了人物信息網(wǎng)頁來源的選取和網(wǎng)頁的下載方法。網(wǎng)頁下載越來越困難,網(wǎng)站對爬蟲程序的限制越來越嚴,甚至采取了各種反爬蟲措施,比如對同一IP訪問頻率的限制。作者自己編寫程序下載網(wǎng)頁數(shù)據(jù),針對網(wǎng)站的不同情況采用了三種網(wǎng)頁數(shù)據(jù)的下載方式：一般下載方式、代理下載方式和動態(tài)網(wǎng)頁數(shù)據(jù)的下載方式。然后是對網(wǎng)頁正文進行抽取。論文綜述了網(wǎng)頁正文抽取的相關(guān)研究,采用了基于統(tǒng)計和DOM的方法進行正文抽取。方法采用的統(tǒng)計信息是正文字長、超鏈接數(shù)和結(jié)束標點符號數(shù)。對每個容器標簽,統(tǒng)計三個信息值后,利用它們的數(shù)量比值判斷標簽是否正文標簽,進而抽取正文。接著是對網(wǎng)頁正文進行分詞處理。常見的分詞系統(tǒng)在實體識別方面存在不足,不能很好適用于知識抽取、自然語言處理等。本文分詞處理使用的是西南交大思維與智慧研究所開發(fā)的分詞系統(tǒng),該系統(tǒng)在實體識別方面顯著優(yōu)于其它分詞系統(tǒng)。機構(gòu)名識別算法由本文作者實現(xiàn),算法基于詞頻統(tǒng)計。實驗中訓(xùn)練數(shù)據(jù)主要通過百度百科詞條整理得到。訓(xùn)練時,作者利用百度百科詞條名在詞條文本中的頻數(shù)統(tǒng)計,進行機構(gòu)構(gòu)成詞的詞頻統(tǒng)計。在此基礎(chǔ)上,構(gòu)建了數(shù)學(xué)模型,實現(xiàn)了組織機構(gòu)名識別算法。最后是網(wǎng)頁人物信息的結(jié)構(gòu)化。網(wǎng)頁上的人物信息一般以半結(jié)構(gòu)化和非結(jié)構(gòu)化呈現(xiàn),人物信息抽取的最后部分就是抽取半結(jié)構(gòu)化和非結(jié)構(gòu)化的人物信息并保存為結(jié)構(gòu)化的人物信息。對于半結(jié)構(gòu)化人物信息,需要正文去匹配人物屬性詞典,然后結(jié)合簡單規(guī)則,直接提取屬性值就行了,方法簡單而有效。對于非結(jié)構(gòu)化人物信息的提取,采用基于規(guī)則的提取方法,過程中建立觸發(fā)詞庫和規(guī)則庫,觸發(fā)詞庫包括基本人物屬性和對應(yīng)的觸發(fā)詞,規(guī)則庫是人工定義的提取屬性值的規(guī)則。
[Abstract]:Modern people rely more and more on retrieving information from the Internet. This paper is devoted to extracting as much important person information as possible and constructing a knowledge base of character information, which can be used as the knowledge base of the people search engine as well as the personal-related part of the knowledge base of the semantic search engine. There are a lot of people information on the network, but these information formats are various, the content is chaotic, a lot of junk information is filled in, how to extract accurate information automatically and efficiently from the Internet is relatively complex, there are many problems to be solved. In this paper, a complete process from data collection, text extraction, Chinese word segmentation to the structure of character information is studied. Each part corresponds to a chapter of the thesis. The first is the collection of web data. In this paper, the selection of the source of the people information page and the download method of the web page are described in detail. It is becoming more and more difficult to download web pages, and the restrictions on crawler programs are becoming more and more strict, and even various anti-crawler measures have been taken, such as restrictions on the frequency of access to the same IP. The author writes the program to download the web page data, and adopts three kinds of downloading ways of the web page data according to the different situation of the website: the general downloading way, the agent downloading way and the dynamic web page data downloading way. Then the text of the page is extracted. This paper summarizes the research of web page text extraction, and uses statistical and Dom methods to extract text. The statistical information used was the length of positive text, the number of hyperlinks and the number of ending punctuation marks. For each container label, after three information values are counted, the number ratio of each label is used to determine whether the label is the body label, and then the text is extracted. Then the text of the web page word segmentation processing. The common participle system is not suitable for knowledge extraction, natural language processing and so on. The word segmentation system developed by the Institute of thinking and Wisdom of Southwest Jiaotong University is better than other word segmentation systems in entity recognition. The mechanism name recognition algorithm is implemented by the author of this paper, and the algorithm is based on word frequency statistics. Training data in the experiment mainly through Baidu encyclopedia word collation. During the training, the author makes use of the frequency statistics of Baidu encyclopedia names in the entry text, and carries on the frequency statistics of the organization composition words. On the basis of this, a mathematical model is constructed and an organization name recognition algorithm is implemented. Finally, the structure of the information of people on the web page. The character information on a web page is generally presented as semi-structured and unstructured, and the final part of character information extraction is to extract semi-structured and unstructured personage information and save it as structured personage information. For semi-structured character information, we need the text to match the character attribute dictionary, and then combine the simple rules to extract the attribute value directly. The method is simple and effective. For the extraction of unstructured character information, a rule-based extraction method is adopted. In the process, a trigger lexicon and a rule base are established. The trigger lexicon includes the basic character attributes and the corresponding trigger words. A rule base is a manually defined rule for extracting attribute values.
【學(xué)位授予單位】：西南交通大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP393.092;TP391.1

【參考文獻】

相關(guān)期刊論文前10條

1 劉金紅;陸余良;施凡;宋舜宏;;基于語義上下文分析的因特網(wǎng)人物信息挖掘[J];安徽大學(xué)學(xué)報(自然科學(xué)版);2009年04期

2 黃德根;李澤中;萬如;;基于SVM和CRF的雙層模型中文機構(gòu)名識別[J];大連理工大學(xué)學(xué)報;2010年05期

3 周俊生;戴新宇;尹存燕;陳家駿;;基于層疊條件隨機場模型的中文機構(gòu)名自動識別[J];電子學(xué)報;2006年05期

4 周雅倩,郭以昆,黃萱菁,吳立德;基于最大熵方法的中英文基本名詞短語識別[J];計算機研究與發(fā)展;2003年03期

5 胡文博;都云程;呂學(xué)強;施水才;;基于多層條件隨機場的中文命名實體識別[J];計算機工程與應(yīng)用;2009年01期

6 冀高峰;湯庸;道煒;吳桂賓;黃帆;王鵬;;基于XML的自動學(xué)習(xí)Web信息抽取[J];計算機科學(xué);2008年03期

7 劉輝;陳靜玉;徐學(xué)洲;;基于模板流程配置的Web信息抽取[J];計算機工程;2008年20期

8 張華平,劉群;基于角色標注的中國人名自動識別研究[J];計算機學(xué)報;2004年01期

9 李勇軍,冀汶莉,馬光思;用DOM解析XML文檔[J];計算機應(yīng)用;2001年S1期

10 鄭家恒,張輝;基于HMM的中國組織機構(gòu)名自動識別[J];計算機應(yīng)用;2002年11期

相關(guān)碩士學(xué)位論文前2條

1 王江偉;基于最大熵模型的中文命名實體識別[D];南京理工大學(xué);2005年

2 燕敏;基于語義和版式的網(wǎng)上人物信息提取[D];天津工業(yè)大學(xué);2008年

，

本文編號：2133947

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2133947.html

上一篇：面向資源檢索的元數(shù)據(jù)倉儲建設(shè)研究
下一篇：微博熱點發(fā)現(xiàn)技術(shù)的研究與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Web的大規(guī)模中文人物信息提取研究