基于深度卷積神經(jīng)網(wǎng)絡的實體關系抽取
本文選題:關系抽取 切入點:深度卷積神經(jīng)網(wǎng)絡 出處:《太原理工大學》2017年碩士論文
【摘要】:實體關系抽取一直以來就是自然語言處理領域研究的一個熱點問題。能夠準確的識別出兩個實體之間的語義關系在信息抽取任務中是至關重要的,同時對于知識庫的創(chuàng)建以及信息檢索等領域都具有重要的意義。隨著深度學習在圖像和視覺等領域的迅猛發(fā)展,近年來深度學習也被引入到自然語言處理領域,成為了研究的熱點。由于傳統(tǒng)的實體關系抽取方法在模型學習之前都需要人工手動的選取一些離散的特征,特征選取的好壞直接關系到最終的抽取結(jié)果。我們無法預知什么樣的特征最有效,而且特征的數(shù)量也不是越多越好,多數(shù)是依賴專家經(jīng)驗來判斷特征的有效性。同時特征的選擇過程大多依賴于現(xiàn)有的自然語言處理(NLP)工具,費時費力,且易造成錯誤傳播。與傳統(tǒng)的方法相比,基于深度學習機制的關系抽取算法可以自動的從原始的語料中學習到特征,不僅減少了對于NLP工具的依賴,而且充分利用了文本的結(jié)構信息。同時,前人的研究成果證明了深度學習模型中的卷積神經(jīng)網(wǎng)絡(Convolutional Neural Network,CNN)以其獨特的網(wǎng)絡結(jié)構可以對特征進行更好的學習;诖,本文采用深度卷積神經(jīng)網(wǎng)絡完成實體關系抽取任務。首先,提出基于句子的衡量詞重要性的TP-ISP(term proportion-inverse sentence proportion)算法,通過該算法得到每個類別中各個詞的tpisp值,根據(jù)該值的大小結(jié)合排序算法得到關于每個詞重要性的排序結(jié)果;然后選取排名靠前的詞作為表征該類別的關鍵詞特征,同原始句子的詞向量特征和詞位置特征一同作為網(wǎng)絡的初始輸入,減少了現(xiàn)有的使用深度學習的方法中僅僅依賴單一詞向量學習特征的不足。通過加入該類別關鍵詞特征,增加了類別間的區(qū)分度,同時也彌補了網(wǎng)絡自動學習特征的不足;最后在網(wǎng)絡訓練階段,本文采用分段最大池化策略,即選取每一段中得分值最高的特征,將這些特征組合起來作為最終分類器的輸入特征。這一策略一定程度上減少了傳統(tǒng)的最大池化策略對于信息的丟失問題。此外,由于中文語料匱乏等原因在此方面研究較少,因此本文以COAE(Chinese Opinion Analysis Evaluation)2016評測任務中的數(shù)據(jù)集為對象,將該模型結(jié)合中文語料的特殊性解決基于中文的實體關系抽取問題。同時使用word2vec工具中的Skip-gram模型和中文維基數(shù)據(jù),訓練獲得了中文詞向量表,優(yōu)于單獨使用word2vec隨機初始化生成的詞向量表。實驗證明,本文的模型在英文和中文語料中都使得實體關系抽取結(jié)果得到很大的提升。
[Abstract]:Entity relation extraction has always been a hot topic in the field of natural language processing.It is very important to identify the semantic relationship between two entities accurately in the task of information extraction. It is also important for the creation of knowledge base and information retrieval and so on.With the rapid development of depth learning in the field of image and vision, deep learning has been introduced into the field of natural language processing in recent years.Because traditional entity relation extraction methods need to manually select some discrete features manually before model learning, the quality of feature selection is directly related to the final extraction results.We can not predict which features are the most effective, and the number of features is not as much as possible. Most of them depend on expert experience to judge the validity of features.At the same time, the process of feature selection mostly depends on the existing natural language processing tools, which is time-consuming and easy to cause error propagation.Compared with the traditional methods, the relationship extraction algorithm based on the deep learning mechanism can automatically learn features from the original corpus, which not only reduces the dependence on NLP tools, but also makes full use of the structural information of the text.At the same time, the previous research results prove that the convolutional Neural network CNNs in the deep learning model can better learn the features with their unique network structure.Based on this, this paper uses deep convolution neural network to complete the entity relation extraction task.First of all, the TP-ISP(term proportion-inverse sentence proportion algorithm based on sentence is proposed, through which the tpisp value of each word in each category is obtained, and the sorting result about the importance of each word is obtained according to the size of the value combined with the sorting algorithm.Then the top word is selected as the key word feature to represent the category, and the word vector feature and word position feature of the original sentence are used as the initial input of the network.It reduces the deficiency of the existing methods of using depth learning which only rely on single word vector learning features.By adding the keyword feature of the category, the classification degree among the categories is increased, and the deficiency of the automatic learning feature of the network is also made up. Finally, in the training stage of the network, this paper adopts the strategy of segment maximization pool.In other words, the features with the highest score in each segment are selected and combined as the input features of the final classifier.To some extent, this strategy reduces the problem of information loss caused by the traditional maximization strategy.In addition, due to the lack of Chinese corpus, this paper takes the data set in the COAE(Chinese Opinion Analysis Evaluation)2016 evaluation task as an object to solve the problem of entity relation extraction based on Chinese language combined with the particularity of Chinese corpus.At the same time, the Chinese word orientation scale is obtained by using the Skip-gram model and Chinese wiki data in word2vec tool, which is better than the word orientation scale which is generated by using word2vec random initialization alone.Experimental results show that the proposed model can greatly improve the result of entity relation extraction in both English and Chinese corpus.
【學位授予單位】:太原理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1;TP183
【參考文獻】
相關期刊論文 前8條
1 余濤;;基于稀疏自編碼器的手寫體數(shù)字識別[J];數(shù)字技術與應用;2017年01期
2 李江;冉君軍;張克非;;一種基于降噪自編碼器的人臉表情識別方法[J];計算機應用研究;2016年12期
3 鄧俊鋒;張曉龍;;基于自動編碼器組合的深度學習優(yōu)化方法[J];計算機應用;2016年03期
4 陳鵬;郭劍毅;余正濤;嚴馨;張志坤;高盛祥;;融合領域知識短語樹核函數(shù)的中文領域?qū)嶓w關系抽取[J];南京大學學報(自然科學);2015年01期
5 劉紹毓;周杰;李弼程;席耀一;唐浩浩;;基于多分類SVM-KNN的實體關系抽取方法[J];數(shù)據(jù)采集與處理;2015年01期
6 賈真;何大可;楊燕;楊宇飛;冶忠林;;基于弱監(jiān)督學習的中文網(wǎng)絡百科關系抽取[J];智能系統(tǒng)學報;2015年01期
7 林古立;彭宏;馬千里;韋佳;覃姜維;;一種基于關鍵詞的網(wǎng)頁搜索結(jié)果多樣化方法[J];華南理工大學學報(自然科學版);2011年05期
8 莊成龍;錢龍華;周國棟;;基于樹核函數(shù)的實體語義關系抽取方法研究[J];中文信息學報;2009年01期
相關博士學位論文 前2條
1 陳宇;基于深度置信網(wǎng)絡的中文信息抽取方法[D];哈爾濱工業(yè)大學;2014年
2 毛存禮;有色金屬領域?qū)嶓w檢索關鍵技術研究[D];昆明理工大學;2014年
相關碩士學位論文 前8條
1 張沖;基于Attention-Based LSTM模型的文本分類技術的研究[D];南京大學;2016年
2 陳智;基于卷積神經(jīng)網(wǎng)絡的多標簽場景分類[D];山東大學;2015年
3 王國昱;基于深度學習的中文命名實體識別研究[D];北京工業(yè)大學;2015年
4 胡新辰;基于LSTM的語義關系分類研究[D];哈爾濱工業(yè)大學;2015年
5 吳嘉偉;電子病歷實體關系抽取研究[D];哈爾濱工業(yè)大學;2014年
6 許可;卷積神經(jīng)網(wǎng)絡在圖像識別上的應用的研究[D];浙江大學;2012年
7 康琪;基于Bootstrapping的領域知識自動抽取技術的研究[D];山東大學;2012年
8 周藍s,
本文編號:1708486
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1708486.html