基于深度學(xué)習(xí)的中文命名實(shí)體識(shí)別研究
發(fā)布時(shí)間:2018-04-27 07:39
本文選題:中文命名實(shí)體識(shí)別 + 深度學(xué)習(xí)。 參考:《北京工業(yè)大學(xué)》2015年碩士論文
【摘要】:中文命名實(shí)體識(shí)別是自然語(yǔ)言處理領(lǐng)域中的基本任務(wù)之一,也是自動(dòng)問(wèn)答、信息抽取等自然語(yǔ)言處理綜合應(yīng)用中的基礎(chǔ)環(huán)節(jié)。近十幾年來(lái),國(guó)內(nèi)外學(xué)者對(duì)文本中的實(shí)體識(shí)別技術(shù)已有廣泛探討和深入研究。但隨著互聯(lián)網(wǎng)的飛速發(fā)展,大量無(wú)規(guī)則、多領(lǐng)域的文本數(shù)據(jù)不斷增長(zhǎng),對(duì)命名實(shí)體識(shí)別技術(shù)提出了新的要求。本文主要工作如下:(1)對(duì)國(guó)內(nèi)外命名實(shí)體識(shí)別的解決方法進(jìn)行了調(diào)查研究,分析了當(dāng)今主流模型方法與技術(shù)發(fā)展趨勢(shì)。在總結(jié)當(dāng)前主流方法的缺陷和中文命名實(shí)體識(shí)別的特殊性的同時(shí),指出了利用深度學(xué)習(xí)的相關(guān)理論來(lái)解決中文命名實(shí)體識(shí)別問(wèn)題的新思路。(2)提出了一種基于堆疊式自編碼分類(lèi)器的深層神經(jīng)網(wǎng)絡(luò)模型,對(duì)該模型在命名實(shí)體識(shí)別任務(wù)中的應(yīng)用進(jìn)行了深入研究。解決了從中文文本序列到模型輸入向量的轉(zhuǎn)化問(wèn)題,推導(dǎo)了便于工程實(shí)現(xiàn)的向量化前向-后向傳播公式。同時(shí),總結(jié)了一套行之有效的參數(shù)初始化與調(diào)參方法,優(yōu)化了模型訓(xùn)練過(guò)程與實(shí)體標(biāo)注效果。(3)在建立模型的基礎(chǔ)上,進(jìn)行了大量的對(duì)比實(shí)驗(yàn)。實(shí)驗(yàn)結(jié)果表明,這種深層神經(jīng)網(wǎng)絡(luò)標(biāo)注模型具有良好的中文實(shí)體識(shí)別效果,在人民日?qǐng)?bào)語(yǔ)料集上的測(cè)試效果達(dá)到了當(dāng)前最好水平。特別在地名、機(jī)構(gòu)名的識(shí)別方面比條件隨機(jī)場(chǎng)模型更具優(yōu)勢(shì),地名與機(jī)構(gòu)名的識(shí)別召回率比條件隨機(jī)場(chǎng)的識(shí)別結(jié)果分別提升了9.60%、8.84%,F值分別提升了3.76%、2.35%。(4)實(shí)現(xiàn)了基于深層神經(jīng)網(wǎng)絡(luò)模型的中文命名實(shí)體識(shí)別系統(tǒng)。提出了增量學(xué)習(xí)的半自動(dòng)化處理流程:系統(tǒng)結(jié)合邊界熵與增量訓(xùn)練的半監(jiān)督后處理方法,用以替代過(guò)去規(guī)則與統(tǒng)計(jì)結(jié)合的傳統(tǒng)框架。解決了實(shí)踐中中文標(biāo)注語(yǔ)料匱乏、訓(xùn)練開(kāi)銷(xiāo)與維護(hù)成本較大的問(wèn)題,使其能夠在少量人工干預(yù)的前提下,快速有效地處理海量中文數(shù)據(jù)。實(shí)踐表明,基于深度學(xué)習(xí)理論的神經(jīng)網(wǎng)絡(luò)模型能夠很好的應(yīng)用于中文命名實(shí)體識(shí)別任務(wù)。以該模型為核心建立的中文命名實(shí)體識(shí)別系統(tǒng)具有良好的健壯性和可維護(hù)性,能夠滿足大數(shù)據(jù)背景下中文命名實(shí)體識(shí)別的新需求。
[Abstract]:Chinese named entity recognition is the basic part of Ren Wuzhi in the field of natural language processing, and it is also the basic link in the comprehensive application of natural language processing such as automatic question answering, information extraction and so on. In the past ten years, scholars at home and abroad have extensively discussed and studied the technology of entity recognition in text. However, with the rapid development of the Internet, a large number of irregular, multi-field text data is growing, and a new requirement for named entity recognition technology is put forward. The main work of this paper is as follows: (1) this paper investigates and studies the methods of identifying named entities at home and abroad, and analyzes the trend of development of current mainstream model methods and technologies. While summarizing the defects of current mainstream methods and the particularity of Chinese named entity recognition, This paper points out a new way to solve the problem of Chinese named entity recognition by using the theory of depth learning. (2) A deep neural network model based on stacked self-coding classifier is proposed. The application of this model in the task of named entity recognition is deeply studied. The transformation problem from Chinese text sequence to model input vector is solved, and the vectorization forward-backward propagation formula is derived. At the same time, a set of effective parameter initialization and parameter adjustment methods are summarized, and the model training process and the effect of entity tagging are optimized. On the basis of establishing the model, a large number of comparative experiments are carried out. The experimental results show that this deep neural network annotation model has a good effect on Chinese entity recognition, and the test results on People's Daily corpus reach the best level at present. Especially in geographical names, the recognition of agency names is more advantageous than conditional random field models, The recognition recall ratio of place name to agency name the recognition result of the field increased 9.600.84 F value increased 3.76 / 2.35 / 4 respectively) the Chinese named entity recognition system based on the deep neural network model was implemented. A semi-supervised post-processing method combining boundary entropy and incremental training is proposed to replace the traditional framework of combining rules with statistics. It solves the problems of lack of Chinese tagging corpus, high cost of training and maintenance in practice, and enables it to deal with massive Chinese data quickly and effectively under the premise of a small amount of manual intervention. Practice shows that the neural network model based on depth learning theory can be well applied to Chinese named entity recognition task. The Chinese named entity recognition system based on this model has good robustness and maintainability and can meet the new requirement of Chinese named entity recognition under big data background.
【學(xué)位授予單位】:北京工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類(lèi)號(hào)】:TP391.1
,
本文編號(hào):1809857
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1809857.html
最近更新
教材專(zhuān)著