中文命名實(shí)體識(shí)別算法研究
發(fā)布時(shí)間:2018-03-05 08:55
本文選題:中文命名實(shí)體識(shí)別 切入點(diǎn):混合模型 出處:《浙江大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:命名實(shí)體識(shí)別(Name Entity Recognition,NER)是指識(shí)別文本中具有特定意義的實(shí)體,主要包括人名、地名、組織機(jī)構(gòu)名等,是將非結(jié)構(gòu)化數(shù)據(jù)轉(zhuǎn)為結(jié)構(gòu)化數(shù)據(jù)的一個(gè)重要技術(shù)手段,是計(jì)算機(jī)正確理解文本信息的關(guān)鍵步驟,也是信息抽取、情感分析、問(wèn)答系統(tǒng)等多個(gè)自然語(yǔ)言處理應(yīng)用的基礎(chǔ)任務(wù),因此命名實(shí)體識(shí)別的研究存在著重要意義。但由于中文語(yǔ)言自身的特點(diǎn),中文命名實(shí)體仍存在許多難點(diǎn),其主要難點(diǎn)包括:(1)中文命名實(shí)體識(shí)別通常是基于單一模型的識(shí)別,這些模型具有各自的優(yōu)缺點(diǎn)和局限性。(2)中文命名實(shí)體識(shí)別通常是基于詞序列的識(shí)別,需要借助中文分詞技術(shù),中文命名實(shí)體識(shí)別的效果往往依賴于中文分詞的準(zhǔn)確率。本文的研究?jī)?nèi)容和主要工作包括:(1)調(diào)研了國(guó)內(nèi)外命名實(shí)體識(shí)別的相關(guān)工作,總結(jié)和實(shí)現(xiàn)了主流的命名實(shí)體識(shí)別方法,分析和比較了這些方法的優(yōu)缺點(diǎn),為本文的后續(xù)工作提供了思路。(2)為了解決單一模型的局限性,本文結(jié)合了多個(gè)模型和使用多任務(wù)學(xué)習(xí)進(jìn)行中文命名實(shí)體識(shí)別,該方法BiLSTM-CRF-MTL能夠較好地解決單一模型的缺點(diǎn),此外不需要過(guò)多的特征構(gòu)造,模型通過(guò)多個(gè)相關(guān)任務(wù)進(jìn)行特征學(xué)習(xí)。(3)為了解決基于詞序列識(shí)別存在的問(wèn)題,本文將基于字序列進(jìn)行中文命名實(shí)體識(shí)別,引入基于外部語(yǔ)料和新詞發(fā)現(xiàn)的詞向量,同時(shí)將基于關(guān)鍵詞提取的中文分詞置信度作為特征來(lái)緩解中文分詞帶來(lái)的噪聲。(4)為了讓模型能夠更好地?cái)M合上下文和緩解標(biāo)注樣本較少的問(wèn)題,本文提出了一種基于實(shí)體詞替換的樣本生成方法。本文基于1998年人民日?qǐng)?bào)語(yǔ)料進(jìn)行中文命名實(shí)體測(cè)評(píng),對(duì)比了多個(gè)基于單一模型的識(shí)別方法以及相關(guān)文獻(xiàn)的識(shí)別方法,實(shí)驗(yàn)結(jié)果表明,本文提出的方法取得了 88.79%的平均F1,相比其它方法具有較大提升。
[Abstract]:Named entity recognition (Entity recognition) refers to the entities with specific significance in the identification text, mainly including the names of persons, place names, organizations, etc., which is an important technical means to transform unstructured data into structured data. It is a key step for computer to understand text information correctly, and it is also the basic task of many natural language processing applications, such as information extraction, emotion analysis, question and answer system, etc. Therefore, the research of named entity recognition is of great significance. However, due to the characteristics of Chinese language, there are still many difficulties in Chinese named entity recognition, the main difficulties of which include: 1) Chinese named entity recognition is usually based on a single model. These models have their own advantages, disadvantages and limitations. (2) Chinese named entity recognition is usually based on word sequence recognition, which requires the help of Chinese word segmentation technology. The effect of Chinese named entity recognition often depends on the accuracy of Chinese word segmentation. This paper analyzes and compares the advantages and disadvantages of these methods, and provides a train of thought for the further work of this paper. In order to solve the limitation of single model, this paper combines multiple models and uses multi-task learning to identify Chinese named entities. This method, BiLSTM-CRF-MTL, can solve the shortcoming of single model well. In addition, it does not need too much feature construction. The model can learn features by several related tasks) in order to solve the problem of word sequence recognition. In this paper, the Chinese named entity recognition based on word sequence is introduced, and the word vector based on external corpus and new word discovery is introduced. At the same time, the confidence degree of Chinese word segmentation based on keyword extraction is used as a feature to alleviate the noise caused by Chinese word segmentation.) in order to make the model fit the context better and alleviate the problem of fewer labeled samples, This paper proposes a method of sample generation based on the substitution of entity words. This paper evaluates the Chinese named entities based on People's Daily corpus in 1998, and compares several recognition methods based on single model and related literature. The experimental results show that the proposed method achieves an average F _ 1 of 88.79%, which is much better than other methods.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 向曉雯,史曉東,曾華琳;一個(gè)統(tǒng)計(jì)與規(guī)則相結(jié)合的中文命名實(shí)體識(shí)別系統(tǒng)[J];計(jì)算機(jī)應(yīng)用;2005年10期
2 張曉艷;王挺;陳火旺;;命名實(shí)體識(shí)別研究[J];計(jì)算機(jī)科學(xué);2005年04期
3 邱莎;;幾種基于機(jī)器學(xué)習(xí)的生物命名實(shí)體識(shí)別模型比較[J];電腦知識(shí)與技術(shù)(學(xué)術(shù)交流);2007年05期
4 趙軍;;命名實(shí)體識(shí)別、排歧和跨語(yǔ)言關(guān)聯(lián)[J];中文信息學(xué)報(bào);2009年02期
5 鄭強(qiáng);劉齊軍;王正華;朱云平;;生物醫(yī)學(xué)命名實(shí)體識(shí)別的研究與進(jìn)展[J];計(jì)算機(jī)應(yīng)用研究;2010年03期
6 張向U,
本文編號(hào):1569565
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1569565.html
最近更新
教材專著