基于Web的大規(guī)模中文人物信息提取研究
[Abstract]:Modern people rely more and more on retrieving information from the Internet. This paper is devoted to extracting as much important person information as possible and constructing a knowledge base of character information, which can be used as the knowledge base of the people search engine as well as the personal-related part of the knowledge base of the semantic search engine. There are a lot of people information on the network, but these information formats are various, the content is chaotic, a lot of junk information is filled in, how to extract accurate information automatically and efficiently from the Internet is relatively complex, there are many problems to be solved. In this paper, a complete process from data collection, text extraction, Chinese word segmentation to the structure of character information is studied. Each part corresponds to a chapter of the thesis. The first is the collection of web data. In this paper, the selection of the source of the people information page and the download method of the web page are described in detail. It is becoming more and more difficult to download web pages, and the restrictions on crawler programs are becoming more and more strict, and even various anti-crawler measures have been taken, such as restrictions on the frequency of access to the same IP. The author writes the program to download the web page data, and adopts three kinds of downloading ways of the web page data according to the different situation of the website: the general downloading way, the agent downloading way and the dynamic web page data downloading way. Then the text of the page is extracted. This paper summarizes the research of web page text extraction, and uses statistical and Dom methods to extract text. The statistical information used was the length of positive text, the number of hyperlinks and the number of ending punctuation marks. For each container label, after three information values are counted, the number ratio of each label is used to determine whether the label is the body label, and then the text is extracted. Then the text of the web page word segmentation processing. The common participle system is not suitable for knowledge extraction, natural language processing and so on. The word segmentation system developed by the Institute of thinking and Wisdom of Southwest Jiaotong University is better than other word segmentation systems in entity recognition. The mechanism name recognition algorithm is implemented by the author of this paper, and the algorithm is based on word frequency statistics. Training data in the experiment mainly through Baidu encyclopedia word collation. During the training, the author makes use of the frequency statistics of Baidu encyclopedia names in the entry text, and carries on the frequency statistics of the organization composition words. On the basis of this, a mathematical model is constructed and an organization name recognition algorithm is implemented. Finally, the structure of the information of people on the web page. The character information on a web page is generally presented as semi-structured and unstructured, and the final part of character information extraction is to extract semi-structured and unstructured personage information and save it as structured personage information. For semi-structured character information, we need the text to match the character attribute dictionary, and then combine the simple rules to extract the attribute value directly. The method is simple and effective. For the extraction of unstructured character information, a rule-based extraction method is adopted. In the process, a trigger lexicon and a rule base are established. The trigger lexicon includes the basic character attributes and the corresponding trigger words. A rule base is a manually defined rule for extracting attribute values.
【學(xué)位授予單位】:西南交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP393.092;TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 劉金紅;陸余良;施凡;宋舜宏;;基于語義上下文分析的因特網(wǎng)人物信息挖掘[J];安徽大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年04期
2 黃德根;李澤中;萬如;;基于SVM和CRF的雙層模型中文機(jī)構(gòu)名識別[J];大連理工大學(xué)學(xué)報(bào);2010年05期
3 周俊生;戴新宇;尹存燕;陳家駿;;基于層疊條件隨機(jī)場模型的中文機(jī)構(gòu)名自動(dòng)識別[J];電子學(xué)報(bào);2006年05期
4 周雅倩,郭以昆,黃萱菁,吳立德;基于最大熵方法的中英文基本名詞短語識別[J];計(jì)算機(jī)研究與發(fā)展;2003年03期
5 胡文博;都云程;呂學(xué)強(qiáng);施水才;;基于多層條件隨機(jī)場的中文命名實(shí)體識別[J];計(jì)算機(jī)工程與應(yīng)用;2009年01期
6 冀高峰;湯庸;道煒;吳桂賓;黃帆;王鵬;;基于XML的自動(dòng)學(xué)習(xí)Web信息抽取[J];計(jì)算機(jī)科學(xué);2008年03期
7 劉輝;陳靜玉;徐學(xué)洲;;基于模板流程配置的Web信息抽取[J];計(jì)算機(jī)工程;2008年20期
8 張華平,劉群;基于角色標(biāo)注的中國人名自動(dòng)識別研究[J];計(jì)算機(jī)學(xué)報(bào);2004年01期
9 李勇軍,冀汶莉,馬光思;用DOM解析XML文檔[J];計(jì)算機(jī)應(yīng)用;2001年S1期
10 鄭家恒,張輝;基于HMM的中國組織機(jī)構(gòu)名自動(dòng)識別[J];計(jì)算機(jī)應(yīng)用;2002年11期
相關(guān)碩士學(xué)位論文 前2條
1 王江偉;基于最大熵模型的中文命名實(shí)體識別[D];南京理工大學(xué);2005年
2 燕敏;基于語義和版式的網(wǎng)上人物信息提取[D];天津工業(yè)大學(xué);2008年
,本文編號:2133947
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2133947.html