基于統(tǒng)計(jì)和規(guī)則的中文人名識(shí)別研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-08-17 10:38

【摘要】：中文分詞技術(shù)的研究是中文信息處理的一項(xiàng)基礎(chǔ)性課題,廣泛應(yīng)用于搜索引擎、機(jī)器翻譯、信息抽取、文本聚類等領(lǐng)域。目前,影響分詞質(zhì)量的主要因素是歧義切分和對(duì)未登錄詞的識(shí)別,而人名在未登錄詞中又是數(shù)量最多、識(shí)別難度最大的一類,分詞系統(tǒng)中往往針對(duì)人名有專門的模塊進(jìn)行識(shí)別。提高對(duì)人名識(shí)別的質(zhì)量,不僅能夠提高分詞的精度,而且對(duì)信息抽取和詞法分析有很大幫助。本文針對(duì)現(xiàn)代漢語(yǔ)文本,主要研究人名的自動(dòng)識(shí)別問(wèn)題。在對(duì)大規(guī)模姓名樣本庫(kù)和語(yǔ)料庫(kù)進(jìn)行統(tǒng)計(jì)的基礎(chǔ)上,對(duì)人名用字和人名邊界詞進(jìn)行分析,總結(jié)人名用字和人名邊界詞出現(xiàn)規(guī)律,使用基于相對(duì)可信度的統(tǒng)計(jì)模型和針對(duì)系統(tǒng)自身特點(diǎn)設(shè)計(jì)的一系列規(guī)則來(lái)進(jìn)行人名識(shí)別。具體地,本文的主要工作有三方面內(nèi)容：一是對(duì)人名識(shí)別所使用的資源作分析,對(duì)大規(guī)模人名庫(kù)(含480萬(wàn)個(gè)人名)和語(yǔ)料庫(kù)(累計(jì)詞頻30億)進(jìn)行統(tǒng)計(jì),總結(jié)人名用字特點(diǎn)和規(guī)律,對(duì)人名的邊界信息作了詳細(xì)分析,根據(jù)人名邊界詞的詞性和所表達(dá)的意義對(duì)其進(jìn)行了分級(jí),作為人名外部屬性幫助人名識(shí)別,然后對(duì)本文所使用的百科語(yǔ)料庫(kù)與傳統(tǒng)語(yǔ)料庫(kù)進(jìn)行了對(duì)比,指出其優(yōu)越性；在本文所使用的統(tǒng)計(jì)方法方面,使用基于相對(duì)可信度的統(tǒng)計(jì)模型對(duì)大規(guī)模語(yǔ)料庫(kù)進(jìn)行了統(tǒng)計(jì),同時(shí)對(duì)兩種特殊形式的人名建立了模型并作出統(tǒng)計(jì),建立了人名各類用字的統(tǒng)計(jì)信息表；在規(guī)則方法的使用方面,本文設(shè)計(jì)了一系列的規(guī)則用于提取候選姓名和對(duì)人名識(shí)別結(jié)果進(jìn)行校正。最后本文通過(guò)統(tǒng)計(jì)獲得系統(tǒng)使用的各個(gè)閾值和參數(shù),通過(guò)實(shí)驗(yàn)對(duì)在研究過(guò)程中使用的方法做了對(duì)比,并驗(yàn)證本文所使用的統(tǒng)計(jì)模型和規(guī)則的有效性。對(duì)1998年1月份《人民日?qǐng)?bào)》語(yǔ)料庫(kù)進(jìn)行測(cè)試,實(shí)驗(yàn)結(jié)果表明,本系統(tǒng)獲得了較高的準(zhǔn)確率和召回率,人名識(shí)別獲得了良好的效果,提高了整個(gè)分詞系統(tǒng)的精度。
[Abstract]:The research of Chinese word segmentation is a basic subject of Chinese information processing, which is widely used in search engine, machine translation, information extraction, text clustering and so on. At present, the main factors that affect the quality of word segmentation are ambiguous segmentation and recognition of unrecorded words, but the number of unrecorded words is the largest and the recognition is the most difficult. In the word segmentation system, there is a special module for the recognition of people's names. Improving the quality of human name recognition can not only improve the accuracy of word segmentation, but also help information extraction and lexical analysis. This paper focuses on the automatic recognition of human names in modern Chinese texts. On the basis of the statistics of large scale name sample database and corpus, this paper analyzes the character of human name and the boundary word of person name, and sums up the rule of appearance of the word of name and boundary word of person name. Based on the statistical model of relative credibility and a series of rules designed according to the characteristics of the system, name recognition is carried out. Specifically, the main work of this paper has three aspects: the first is to analyze the resources used in the identification of people's names, and to make statistics on the large-scale names bank (including 4.8 million names) and the corpus (cumulative word frequency 3 billion). This paper summarizes the characteristics and rules of characters used in personal names, analyzes the boundary information of names in detail, classifies them according to their parts of speech and their meanings, and helps them to recognize their names as the external attributes of names. Then, the paper compares the encyclopedia corpus with the traditional corpus, points out its superiority, and uses the statistical model based on the relative credibility to calculate the large-scale corpus in the statistical methods used in this paper. At the same time, the model and statistics of two special forms of names are established, and the statistical information tables of all kinds of characters are established. In this paper, a series of rules are designed to extract candidate names and correct the recognition results. Finally, the threshold and parameters of the system are obtained by statistics, and the methods used in the research are compared through experiments, and the validity of the statistical model and rules used in this paper is verified. The People's Daily corpus in January 1998 was tested. The experimental results show that the system has a high accuracy and recall rate, and the recognition of human names has a good effect and improves the accuracy of the whole word segmentation system.
【學(xué)位授予單位】：西南交通大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 黃德根,馬玉霞,楊元生;基于互信息的中文姓名識(shí)別方法[J];大連理工大學(xué)學(xué)報(bào);2004年05期

2 李建華,王曉龍;中文人名自動(dòng)識(shí)別的一種有效方法[J];高技術(shù)通訊;2000年02期

3 毋琳;鄭逢斌;喬保軍;湯賽麗;;HENU漢語(yǔ)分詞系統(tǒng)中的中文人名識(shí)別算法[J];計(jì)算機(jī)工程與應(yīng)用;2006年14期

4 賈品貴;楊一平;盧朋;;基于統(tǒng)計(jì)方法的中文姓名識(shí)別研究[J];計(jì)算機(jī)工程與應(yīng)用;2006年31期

5 曹波;蘇一丹;鄧琦;;基于最大熵模型的中國(guó)人名自動(dòng)識(shí)別[J];計(jì)算機(jī)工程與應(yīng)用;2009年04期

6 張騰飛;王曉磊;王保云;;基于場(chǎng)景信息融合的中文姓名識(shí)別方法研究[J];計(jì)算機(jī)工程與應(yīng)用;2009年34期

7 王源媛;何中市;;基于詞性探測(cè)的中文姓名識(shí)別算法[J];計(jì)算機(jī)科學(xué);2005年04期

8 高紅;黃德根;楊元生;;一種與分詞一體化的中文人名識(shí)別方法[J];計(jì)算機(jī)工程;2006年19期

9 李麗雙;黃德根;毛婷婷;徐瀟瀟;;基于支持向量機(jī)的中國(guó)人名的自動(dòng)識(shí)別[J];計(jì)算機(jī)工程;2006年19期

10 賈寧;張全;;基于最大熵模型的中文姓名識(shí)別[J];計(jì)算機(jī)工程;2007年09期

相關(guān)會(huì)議論文前1條

1 季Y，

本文編號(hào)：2187344

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2187344.html

上一篇：基于信息資源組織視覺(jué)的新型OPAC系統(tǒng)設(shè)計(jì)研究
下一篇：我國(guó)語(yǔ)言學(xué)網(wǎng)絡(luò)資源調(diào)查分析

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于統(tǒng)計(jì)和規(guī)則的中文人名識(shí)別研究與實(shí)現(xiàn)