基于統(tǒng)計(jì)和規(guī)則的中文人名識(shí)別研究與實(shí)現(xiàn)
[Abstract]:The research of Chinese word segmentation is a basic subject of Chinese information processing, which is widely used in search engine, machine translation, information extraction, text clustering and so on. At present, the main factors that affect the quality of word segmentation are ambiguous segmentation and recognition of unrecorded words, but the number of unrecorded words is the largest and the recognition is the most difficult. In the word segmentation system, there is a special module for the recognition of people's names. Improving the quality of human name recognition can not only improve the accuracy of word segmentation, but also help information extraction and lexical analysis. This paper focuses on the automatic recognition of human names in modern Chinese texts. On the basis of the statistics of large scale name sample database and corpus, this paper analyzes the character of human name and the boundary word of person name, and sums up the rule of appearance of the word of name and boundary word of person name. Based on the statistical model of relative credibility and a series of rules designed according to the characteristics of the system, name recognition is carried out. Specifically, the main work of this paper has three aspects: the first is to analyze the resources used in the identification of people's names, and to make statistics on the large-scale names bank (including 4.8 million names) and the corpus (cumulative word frequency 3 billion). This paper summarizes the characteristics and rules of characters used in personal names, analyzes the boundary information of names in detail, classifies them according to their parts of speech and their meanings, and helps them to recognize their names as the external attributes of names. Then, the paper compares the encyclopedia corpus with the traditional corpus, points out its superiority, and uses the statistical model based on the relative credibility to calculate the large-scale corpus in the statistical methods used in this paper. At the same time, the model and statistics of two special forms of names are established, and the statistical information tables of all kinds of characters are established. In this paper, a series of rules are designed to extract candidate names and correct the recognition results. Finally, the threshold and parameters of the system are obtained by statistics, and the methods used in the research are compared through experiments, and the validity of the statistical model and rules used in this paper is verified. The People's Daily corpus in January 1998 was tested. The experimental results show that the system has a high accuracy and recall rate, and the recognition of human names has a good effect and improves the accuracy of the whole word segmentation system.
【學(xué)位授予單位】:西南交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 黃德根,馬玉霞,楊元生;基于互信息的中文姓名識(shí)別方法[J];大連理工大學(xué)學(xué)報(bào);2004年05期
2 李建華,王曉龍;中文人名自動(dòng)識(shí)別的一種有效方法[J];高技術(shù)通訊;2000年02期
3 毋琳;鄭逢斌;喬保軍;湯賽麗;;HENU漢語(yǔ)分詞系統(tǒng)中的中文人名識(shí)別算法[J];計(jì)算機(jī)工程與應(yīng)用;2006年14期
4 賈品貴;楊一平;盧朋;;基于統(tǒng)計(jì)方法的中文姓名識(shí)別研究[J];計(jì)算機(jī)工程與應(yīng)用;2006年31期
5 曹波;蘇一丹;鄧琦;;基于最大熵模型的中國(guó)人名自動(dòng)識(shí)別[J];計(jì)算機(jī)工程與應(yīng)用;2009年04期
6 張騰飛;王曉磊;王保云;;基于場(chǎng)景信息融合的中文姓名識(shí)別方法研究[J];計(jì)算機(jī)工程與應(yīng)用;2009年34期
7 王源媛;何中市;;基于詞性探測(cè)的中文姓名識(shí)別算法[J];計(jì)算機(jī)科學(xué);2005年04期
8 高紅;黃德根;楊元生;;一種與分詞一體化的中文人名識(shí)別方法[J];計(jì)算機(jī)工程;2006年19期
9 李麗雙;黃德根;毛婷婷;徐瀟瀟;;基于支持向量機(jī)的中國(guó)人名的自動(dòng)識(shí)別[J];計(jì)算機(jī)工程;2006年19期
10 賈寧;張全;;基于最大熵模型的中文姓名識(shí)別[J];計(jì)算機(jī)工程;2007年09期
相關(guān)會(huì)議論文 前1條
1 季Y,
本文編號(hào):2187344
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2187344.html