中文命名實體識別算法研究
發(fā)布時間:2018-03-05 08:55
本文選題:中文命名實體識別 切入點:混合模型 出處:《浙江大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:命名實體識別(Name Entity Recognition,NER)是指識別文本中具有特定意義的實體,主要包括人名、地名、組織機構(gòu)名等,是將非結(jié)構(gòu)化數(shù)據(jù)轉(zhuǎn)為結(jié)構(gòu)化數(shù)據(jù)的一個重要技術(shù)手段,是計算機正確理解文本信息的關(guān)鍵步驟,也是信息抽取、情感分析、問答系統(tǒng)等多個自然語言處理應(yīng)用的基礎(chǔ)任務(wù),因此命名實體識別的研究存在著重要意義。但由于中文語言自身的特點,中文命名實體仍存在許多難點,其主要難點包括:(1)中文命名實體識別通常是基于單一模型的識別,這些模型具有各自的優(yōu)缺點和局限性。(2)中文命名實體識別通常是基于詞序列的識別,需要借助中文分詞技術(shù),中文命名實體識別的效果往往依賴于中文分詞的準(zhǔn)確率。本文的研究內(nèi)容和主要工作包括:(1)調(diào)研了國內(nèi)外命名實體識別的相關(guān)工作,總結(jié)和實現(xiàn)了主流的命名實體識別方法,分析和比較了這些方法的優(yōu)缺點,為本文的后續(xù)工作提供了思路。(2)為了解決單一模型的局限性,本文結(jié)合了多個模型和使用多任務(wù)學(xué)習(xí)進行中文命名實體識別,該方法BiLSTM-CRF-MTL能夠較好地解決單一模型的缺點,此外不需要過多的特征構(gòu)造,模型通過多個相關(guān)任務(wù)進行特征學(xué)習(xí)。(3)為了解決基于詞序列識別存在的問題,本文將基于字序列進行中文命名實體識別,引入基于外部語料和新詞發(fā)現(xiàn)的詞向量,同時將基于關(guān)鍵詞提取的中文分詞置信度作為特征來緩解中文分詞帶來的噪聲。(4)為了讓模型能夠更好地擬合上下文和緩解標(biāo)注樣本較少的問題,本文提出了一種基于實體詞替換的樣本生成方法。本文基于1998年人民日報語料進行中文命名實體測評,對比了多個基于單一模型的識別方法以及相關(guān)文獻的識別方法,實驗結(jié)果表明,本文提出的方法取得了 88.79%的平均F1,相比其它方法具有較大提升。
[Abstract]:Named entity recognition (Entity recognition) refers to the entities with specific significance in the identification text, mainly including the names of persons, place names, organizations, etc., which is an important technical means to transform unstructured data into structured data. It is a key step for computer to understand text information correctly, and it is also the basic task of many natural language processing applications, such as information extraction, emotion analysis, question and answer system, etc. Therefore, the research of named entity recognition is of great significance. However, due to the characteristics of Chinese language, there are still many difficulties in Chinese named entity recognition, the main difficulties of which include: 1) Chinese named entity recognition is usually based on a single model. These models have their own advantages, disadvantages and limitations. (2) Chinese named entity recognition is usually based on word sequence recognition, which requires the help of Chinese word segmentation technology. The effect of Chinese named entity recognition often depends on the accuracy of Chinese word segmentation. This paper analyzes and compares the advantages and disadvantages of these methods, and provides a train of thought for the further work of this paper. In order to solve the limitation of single model, this paper combines multiple models and uses multi-task learning to identify Chinese named entities. This method, BiLSTM-CRF-MTL, can solve the shortcoming of single model well. In addition, it does not need too much feature construction. The model can learn features by several related tasks) in order to solve the problem of word sequence recognition. In this paper, the Chinese named entity recognition based on word sequence is introduced, and the word vector based on external corpus and new word discovery is introduced. At the same time, the confidence degree of Chinese word segmentation based on keyword extraction is used as a feature to alleviate the noise caused by Chinese word segmentation.) in order to make the model fit the context better and alleviate the problem of fewer labeled samples, This paper proposes a method of sample generation based on the substitution of entity words. This paper evaluates the Chinese named entities based on People's Daily corpus in 1998, and compares several recognition methods based on single model and related literature. The experimental results show that the proposed method achieves an average F _ 1 of 88.79%, which is much better than other methods.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【相似文獻】
相關(guān)期刊論文 前10條
1 向曉雯,史曉東,曾華琳;一個統(tǒng)計與規(guī)則相結(jié)合的中文命名實體識別系統(tǒng)[J];計算機應(yīng)用;2005年10期
2 張曉艷;王挺;陳火旺;;命名實體識別研究[J];計算機科學(xué);2005年04期
3 邱莎;;幾種基于機器學(xué)習(xí)的生物命名實體識別模型比較[J];電腦知識與技術(shù)(學(xué)術(shù)交流);2007年05期
4 趙軍;;命名實體識別、排歧和跨語言關(guān)聯(lián)[J];中文信息學(xué)報;2009年02期
5 鄭強;劉齊軍;王正華;朱云平;;生物醫(yī)學(xué)命名實體識別的研究與進展[J];計算機應(yīng)用研究;2010年03期
6 張向U,
本文編號:1569565
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1569565.html
最近更新
教材專著