基于蒙古文語料庫的人名自動識別
本文選題:蒙古文信息處理 + 語料庫用現(xiàn)代蒙古文標注規(guī)范; 參考:《中央民族大學》2013年博士論文
【摘要】:蒙古文人名的自動識別是命名實體識別的子任務之一。 中、英文信息處理經(jīng)歷了半個世紀的發(fā)展,在基礎資源的建設、詞性標注、信息檢索、文本分類、機器翻譯、語言識別與合成、人機對話等領域都取得非常大的發(fā)展,中、英文信息處理的現(xiàn)代化發(fā)展,對國內少數(shù)民族語言信息處理的理論與技術發(fā)展也起到了深刻的促進作用。 與中、英文信息處理相比,蒙古文信息處理雖然起步稍晚,但也取得了少數(shù)民族信息處理領域的輝煌成就。蒙古文信息處理已初步完成了字、詞處理階段,現(xiàn)已進入句處理階段,蒙古文信息處理已完成短語結構關系識別、短語邊界界定等淺層句法分析任務,正向深層句法分析邁進,蒙古文信息檢索、自動文摘、文本分類、機器翻譯的研究也方興未艾。 蒙古文詞法分析與標注對短語、句法、語義、篇章的研究具有重要意義,不過作為基礎環(huán)節(jié)的詞法分析與標注,在未登錄詞,尤其是命名實體的識別研究未能繁榮發(fā)展。命名實體識別上的欠缺始終影響著詞法分析的精度,并進而影響短語分析、句法分析、信息檢索、機器翻譯等領域的發(fā)展。 專有名詞是語料庫的重要組成部分,專有名詞識別技術的突破是提高蒙古文詞法分析正確率及其他后續(xù)工作的重要基礎,歧義和未登錄詞的識別是影響切分精度的兩大障礙,未登錄詞包括新詞和人名、地名等命名實體。本文作為蒙古文人名自動識別的研究成果,涉及普通人名及兼類人名的識別,因而我們的研究具有相當高的學術價值及應用價值。 蒙古文本中人名數(shù)量眾多,兼類現(xiàn)象較為普遍,研究蒙古人名的論述較少,尚無太多現(xiàn)成的理論與技術可供參考,因而蒙古文人名識別遇到很多難題,主要表現(xiàn)在: ☆人名是開放集合,無法采取窮舉方法。蒙古族人名兼類現(xiàn)象較為嚴重,越普通的詞,成為人名的現(xiàn)象也越普遍,名詞、動詞、形容詞、數(shù)詞、時間詞、副詞、代詞、模擬詞都能成為人名,這給人名識別帶來很大困難。 ☆蒙古文深加工語料庫規(guī)模比起中、英文規(guī)模尚小,這必定影響到統(tǒng)計方法的運用。目前內蒙古大學已儲備了200萬詞規(guī)模深加工語料庫,而我們使用26萬詞規(guī)模語料庫,語料庫的規(guī)模使規(guī)則提取及機器學習受到一定限制。 ☆專有名詞的識別一直是蒙古文詞法分析與標注的難點問題,但人名易與地名及其他專有名詞兼類,因而專有名詞之間的兼類問題也是困擾我們的難點問題。 本文采用了最大熵的統(tǒng)計方法識別蒙古文人名,在傳統(tǒng)的規(guī)則為主的研究基礎上,將最大熵的數(shù)學模型成功應用于蒙古文命名實體的識別當中,實現(xiàn)了蒙古文人名自動識別系統(tǒng)。本文的創(chuàng)新和貢獻主要體現(xiàn)在: ◇首次建立了蒙古文人名識別語料庫 目前,蒙古文語料庫已具備了一定的規(guī)模,這對蒙古文信息處理的繁榮發(fā)展起到良好的推動作用。不過迄今為止,國內外還沒有建立專門面向蒙古文人名識別的語料庫,我們從網(wǎng)絡抓取5773個蒙古文人名句,與內蒙古大學的語料庫一同訓練識別模型,測試自動識別的結果,有效補充了語料庫缺乏帶來的缺憾。 ◇系統(tǒng)研究了蒙古族人名的內外部結構 我們深入分析了蒙古人名的民族特征、時代特征、地域特征、性別特征,深入總結了蒙古文人名的內部組成模式,對蒙古族人名的結構類型及特點,對蒙古族特有的蒙古姓氏及其來源進行解讀。 ◇提出了蒙古文語料庫標注及轉寫規(guī)范 我們在對蒙古文語料庫的標注現(xiàn)狀進行分析的基礎上,提出了,“語料庫用現(xiàn)代蒙古語標注規(guī)范”,并針對漢語人名標注的諸多問題,以蒙古文標注外來詞的固定習慣為基礎,以《現(xiàn)代蒙古語語料庫標注規(guī)范》為參考,提出了詳盡的“漢語人名的拉丁轉寫方案”。 ◇建立人名識別的知識庫 我們?yōu)樽詣幼R別蒙古文人名,建立了包括“漢語姓氏詞典、蒙古姓氏詞典、蒙古族普通人名詞典、漢語姓氏拉丁映射表、漢語人名拉丁映射表、梵藏滿人名詞典、著名人物詞典、人名指示詞庫、地名詞典、地名后綴詞典、機構名后綴詞典”等詞典或映射表的普通人名知識庫,建立了包含“兼類人名詞典、兼類詞搭配詞典、蒙古人名詞干詞典”等知識的兼類人名知識庫。 ◇設計并實現(xiàn)了蒙古文人名自動識別系統(tǒng) 實驗證明,作為國內外在蒙古文命名實體識別中較早運用統(tǒng)計方法的學術成果,本研究封閉測試的正確率94.56%,召回率85.15%,F值89.61%,取得了較為滿意的識別效果。
[Abstract]:Automatic recognition of Mongolian names is one of the sub tasks of named entity recognition.
English information processing has gone through the development of half a century. It has made great progress in the construction of basic resources, part of speech tagging, information retrieval, text classification, Machine Translation, language recognition and synthesis, human-computer dialogue and so on, the modernization of information processing in English and Chinese, and the theory and technology of the domestic minority language information processing. Development has also played a profound role in promoting.
Compared with Chinese and English information processing, Mongolian information processing is a little late, but it has also achieved brilliant achievements in the field of minority information processing. The Mongolian information processing has already completed the initial word, the word processing stage has now entered the sentence processing stage, the Mongolian information processing has completed the phrase structure relationship identification, the phrase boundary definition and so on shallow. The task of layer syntactic analysis is going deep into syntactic analysis, Mongolian information retrieval, automatic summarization, text categorization, and Machine Translation's research is also in the ascendant.
The analysis and tagging of Mongolian words are of great significance to the study of phrase, syntax, semantics and text. However, as the basic link of the lexical analysis and annotation, the research on the recognition of the unregistered words, especially the named entity, has not flourish. The lack of the named entity recognition affects the accuracy of the lexical analysis and then affects the phrase division. Analysis, syntax analysis, information retrieval, Machine Translation and other fields of development.
The proper noun is an important part of the corpus. The breakthrough of the know-how recognition technology is an important basis for improving the accuracy of the Mongolian word analysis and other follow-up work. The identification of ambiguous and unregistered words is the two major obstacle that affects the accuracy of the segmentation. The unregistered words include the new words and names, the names of the names, and other naming entities. The research results of automatic recognition of names involve the recognition of common names and congeneric names. Therefore, our research has high academic value and application value.
The number of names in the Mongolia text is numerous and the phenomenon of concurrently is more common. There are few treatise on the study of the names of people in Mongolia. There are not too many ready-made theories and techniques for reference. Therefore, there are many difficult problems in the recognition of Mongolian People's names, which are mainly manifested in:
The names of people are more serious, the more common words, the more common the phenomenon is, the more common the phenomenon is, the more common, the noun, the verb, the adjective, the numerals, the time words, the adverbs, the pronoun, the analogue words can all become the names, which brings great difficulties to the name recognition.
The scale of Mongolian deep processing corpus is still small in scale, which must affect the use of statistical methods. At present, the Inner Mongolia University has already stored 2 million word large processing corpus, and we use 260 thousand word corpus, and the scale of corpus has limited the rule extraction and machine learning.
The recognition of proper nouns has always been a difficult problem in the analysis and annotation of Mongolian words, but the names of people are easy to combine with the place names and other proper nouns, so the problem of concurrently between the proper nouns is also a difficult problem.
This paper uses the maximum entropy method to identify the Mongolian names. On the basis of the traditional rule based research, the mathematical model of maximum entropy is successfully applied to the recognition of Mongolian named entity, and the Mongolian name automatic recognition system is realized. The innovation and contribution of this paper are mainly embodied in the following:
For the first time, the Mongolian name recognition corpus was established.
At present, the Mongolian corpus has a certain scale, which has played a good role in the prosperity and development of Mongolian information processing. But up to now, there has not been a corpus of Mongolian name recognition at home and abroad. We have grabbed 5773 Mongolia literati from the network and trained with the corpus of Inner Mongolia University. Training the recognition model and testing the results of automatic recognition effectively complement the deficiency of corpus.
The internal and external structure of Mongolian names is systematically studied.
We deeply analyze the ethnic characteristics of the names of the people in Mongolia, the characteristics of the times, the geographical features and the sex characteristics, and the internal composition patterns of the Mongolian names, the structure types and characteristics of the Mongolian names, and the interpretation of the unique Mongolia surnames and their sources.
We put forward the specification of Mongolian corpus annotation and transcription.
On the basis of the analysis of the status quo of the Mongolian corpus tagging, we put forward, "corpus with modern Mongolian tagging", and in view of the many problems of Chinese name tagging, based on the fixed habits of Mongolian annotation of loanwords, and with the reference of "modern Mongolian tagging specification >" as a reference, a detailed "Chinese" is put forward. A Latin Transliteration scheme for the name of a person.
A knowledge base for the establishment of name recognition
In order to automatically identify the names of Mongolian people, we have established the words "Chinese surname dictionary, Mongolia surname dictionary, Mongolian general name dictionary, Chinese surname Latin mapping table, Chinese name Latin map table, Sanskrit full name dictionary, famous figure dictionary, name indicator dictionary, place name dictionary, place name suffix dictionary, institution name suffix dictionary" and so on. The common name knowledge base of the book or the mapping table has established the knowledge base of the names of people with the knowledge of "concurrently name dictionary, concurrently word collocation dictionary, Mongolian noun dictionary" and so on.
The automatic recognition system of Mongolian names is designed and implemented.
The experiment proves that as the academic achievement of the early use of statistical methods in Mongolian naming entity recognition at home and abroad, the correct rate of the closed test is 94.56%, the recall rate is 85.15%, and the F value is 89.61%, and the satisfactory recognition results have been obtained.
【學位授予單位】:中央民族大學
【學位級別】:博士
【學位授予年份】:2013
【分類號】:H212;H087
【參考文獻】
相關期刊論文 前10條
1 齊心;蒙古人名論析[J];解放軍外語學院學報;1998年05期
2 胡冠龍;張建;李淼;;改進的基于轉換方法的拉丁蒙文詞性標注[J];計算機應用;2007年04期
3 俞士汶,段慧明,朱學鋒,孫斌;北京大學現(xiàn)代漢語語料庫基本加工規(guī)范[J];中文信息學報;2002年05期
4 俞士汶,段慧明,朱學鋒,孫斌;北京大學現(xiàn)代漢語語料庫基本加工規(guī)范(續(xù))[J];中文信息學報;2002年06期
5 羅智勇,宋柔;一種基于可信度的人名識別方法[J];中文信息學報;2005年03期
6 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學報;2007年03期
7 姜文斌;吳金星;烏日力嘎;那順烏日圖;劉群;;蒙古語有向圖形態(tài)分析器的判別式詞干詞綴切分[J];中文信息學報;2011年04期
8 小林高四郎;烏恩;;蒙古族的姓氏和親屬稱謂[J];蒙古學資料與情報;1987年01期
9 H·贊巴拉蘇榮;白永壽;;蒙古人的藏語名[J];蒙古學資料與情報;1988年03期
10 侯宏旭;劉群;那順烏日圖;牧仁高娃;李錦濤;;基于統(tǒng)計語言模型的蒙古文詞切分[J];模式識別與人工智能;2009年01期
相關博士學位論文 前4條
1 周雅倩;最大熵方法及其在自然語言處理中的應用[D];復旦大學;2005年
2 達胡白乙拉;蒙古語基本動詞短語自動識別研究[D];內蒙古大學;2005年
3 雪艷;漢蒙詞語對齊及相關技術研究[D];內蒙古大學;2009年
4 淑琴;蒙古文同形詞知識庫的構建[D];內蒙古大學;2010年
相關碩士學位論文 前10條
1 吳金星;蒙古語詞法標注語料庫的構建及相關技術研究[D];內蒙古大學;2011年
2 張麗靜;規(guī)則與統(tǒng)計相結合的兼類詞處理機制[D];大連理工大學;2002年
3 淑琴;《蒙古語語法信息詞典構形附加成分分庫》的設計與實現(xiàn)[D];內蒙古大學;2005年
4 喬永波;規(guī)則與統(tǒng)計相結合的中文命名實體識別[D];山東大學;2007年
5 圖格木勒;蒙古語語言資源庫建設相關技術研究[D];內蒙古大學;2007年
6 格根塔娜;蘇尼特左旗蒙古族人名研究[D];內蒙古大學;2007年
7 圖雅;科爾沁蒙古族人名研究[D];內蒙古師范大學;2007年
8 趙琳瑛;基于隱馬爾科夫模型的中文命名實體識別研究[D];西安電子科技大學;2008年
9 牧仁高娃;蒙古語語料庫標注及相關對策研究[D];內蒙古大學;2008年
10 薩楚日;鄂爾多斯蒙古族人名變化研究[D];內蒙古大學;2009年
,本文編號:2066304
本文鏈接:http://sikaile.net/wenyilunwen/yuyanxuelw/2066304.html