天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 軟件論文 >

基于新聞數(shù)據(jù)的中文人物社會關系抽取研究

發(fā)布時間:2018-12-18 21:25
【摘要】:隨著互聯(lián)網(wǎng)規(guī)模的不斷擴大,其中蘊含的信息和數(shù)據(jù)也在持續(xù)增長。信息抽取技術的目標是從互聯(lián)網(wǎng)中的海量無結(jié)構(gòu)化數(shù)據(jù)中挖掘出結(jié)構(gòu)化的數(shù)據(jù)。實體關系抽取是信息抽取的子任務,已經(jīng)成為數(shù)據(jù)挖掘與信息檢索領域的一個研究熱點。人物關系抽取屬于實體關系抽取的一個方面,人物關系三元組數(shù)據(jù)被用于構(gòu)建人物關系網(wǎng)絡和問答系統(tǒng),具有較高的應用價值。但是,目前關系抽取研究主要集中在英文語料的處理上,基于中文數(shù)據(jù)的關系抽取研究進展比較緩慢且研究難度較大。基于機器學習的關系抽取方法因其在關系抽取結(jié)果上較好的表現(xiàn),已經(jīng)成為目前的研究熱點。按照訓練數(shù)據(jù)獲取方式的不同,本文對基于半監(jiān)督學習,遠監(jiān)督學習和無監(jiān)督學習的三種方法進行研究,主要貢獻如下:1.有監(jiān)督學習的關系抽取方法對人工標注的訓練數(shù)據(jù)的依賴性較強,且人工標注的成本過高。為了在少量標注數(shù)據(jù)的條件下也能獲取較高的關系抽取性能,本文對半監(jiān)督學習的關系抽取方法進行研究。使用基于標簽傳播的半監(jiān)督學習算法能提升少量標注數(shù)據(jù)下的關系抽取效果,但是隨機選擇訓練樣本會影響關系抽取性能。為了提升標簽傳播算法的關系抽取效果,本文將標簽傳播算法與主動學習方法相結(jié)合用于人物關系抽取。這個方法主動選取對于關系分類的幫助最大的樣本進行標注,可以減少無效標注樣本數(shù)量,在相同標注數(shù)據(jù)量的條件下提升系統(tǒng)的性能。2.在目前的關系抽取研究中,遠監(jiān)督方法通常用于自動構(gòu)建訓練數(shù)據(jù),但是遠監(jiān)督學習的基本假設存在不準確的問題,從而在訓練數(shù)據(jù)中會引入噪聲數(shù)據(jù)。本文針對該問題提出了基于打分函數(shù)過濾訓練數(shù)據(jù)中噪聲的方法,能減少基于遠監(jiān)督學習獲取的訓練數(shù)據(jù)中的噪聲數(shù)據(jù)。另外,針對目前關系抽取系統(tǒng)的準確率不夠理想的問題,本文應用詞向量技術從單句文本中提取基于詞向量的若干特征加入常用的關系抽取特征系統(tǒng)中,用于提升人物關系抽取系統(tǒng)的表現(xiàn)。3.以上方法都需要預先定義關系類型后進行關系抽取獲得相應的關系實例。這些方法會限制了關系抽取模型可以獲得的關系種類,無法得到新的關系類型的關系三元組數(shù)據(jù)。因此本文提出了一種不需要訓練數(shù)據(jù)以及預先定義的關系類型的基于無監(jiān)督學習的關系抽取方法。該方法首先從新聞標題數(shù)據(jù)獲得關聯(lián)度較高的人物對用于關系抽取研究;然后,抓取關聯(lián)人物對所在新聞數(shù)據(jù)進行預處理后,利用TF-IDF得到人物對共現(xiàn)句子中的關鍵詞;其次,基于詞語共現(xiàn)信息得到詞語之間的關聯(lián),進而建立關鍵詞關聯(lián)網(wǎng)絡:最后,利用對關聯(lián)網(wǎng)絡進行圖聚類分析以獲得人物關系。
[Abstract]:With the continuous expansion of the scale of the Internet, the information and data contained therein are also growing. The goal of information extraction technology is to mine the structured data from the massive unstructured data in the Internet. Entity relation extraction is a sub-task of information extraction, which has become a research hotspot in the field of data mining and information retrieval. Personal-relationship extraction belongs to an aspect of entity relation extraction. The triple data of personal-relationship is used to construct personal-relationship network and question-and-answer system, which has high application value. However, at present, the research on relation extraction is mainly focused on the processing of English corpus, and the research on relation extraction based on Chinese data is slow and difficult. The relationship extraction method based on machine learning has become a hot research topic because of its good performance in relation extraction results. According to the different training data acquisition methods, this paper studies three methods based on semi-supervised learning, far supervised learning and unsupervised learning. The main contributions are as follows: 1. The supervised learning relational extraction method is highly dependent on the training data of manual annotation, and the cost of manual annotation is too high. In order to obtain high performance of relation extraction under the condition of small amount of labeled data, this paper studies the relationship extraction method of semi-supervised learning. Using semi-supervised learning algorithm based on label propagation can improve the effect of relational extraction under a small amount of labeled data, but random selection of training samples will affect the performance of relational extraction. In order to improve the relationship extraction effect of label propagation algorithm, this paper combines tag propagation algorithm with active learning method to extract human relationship. This method takes the initiative to select the most helpful samples for relational classification, which can reduce the number of invalid labeled samples, and improve the performance of the system under the condition of the same amount of tagged data. 2. In the present research of relation extraction, remote supervision is usually used to construct training data automatically, but the basic hypothesis of remote supervised learning is inaccurate, so noise data will be introduced into the training data. In this paper, a method of filtering noise in training data based on scoring function is proposed, which can reduce the noise data obtained from training data based on remote supervised learning. In addition, aiming at the problem that the accuracy of the current relational extraction system is not ideal, this paper applies word vector technology to extract some features based on word vector from the single sentence text and adds some features based on word vector to the commonly used relational extraction feature system. Used to enhance the performance of the personal-relationship extraction system. 3. All of the above methods need to predefine the relationship type and then extract the relation to obtain the corresponding relational instance. These methods limit the types of relationships that can be obtained by the relational extraction model, and can not obtain the relational triples of the new relational types. Therefore, this paper proposes an unsupervised learning based relational extraction method which does not require training data and predefined relationship types. In this method, first of all, people pairs with high correlation degree are obtained from the news title data for relation extraction, and then the key words in the co-occurrence sentences are obtained by TF-IDF after the related characters are preprocessed to their news data. Secondly, based on the co-occurrence information, the association between words is obtained, and then the keyword association network is established. Finally, the relationship between people is obtained by using the graph clustering analysis of the association network.
【學位授予單位】:中國科學技術大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP391.1

【參考文獻】

相關期刊論文 前10條

1 秦兵;劉安安;劉挺;;無指導的中文開放式實體關系抽取[J];計算機研究與發(fā)展;2015年05期

2 潘云;布勒布麗汗·伊沙巴依;楊靜;尹敏;;利用中文在線資源的遠程監(jiān)督人物關系抽取[J];小型微型計算機系統(tǒng);2015年04期

3 黃衛(wèi)春;范少帥;熊李艷;鐘茂生;;基于特征選擇的人物關系抽取方法[J];科學技術與工程;2015年03期

4 郭喜躍;何婷婷;胡小華;陳前軍;;基于句法語義特征的中文實體關系抽取[J];中文信息學報;2014年06期

5 張俊麗;常艷麗;師文;;標簽傳播算法理論及其應用研究綜述[J];計算機應用研究;2013年01期

6 劉康;錢旭;王自強;;主動學習算法綜述[J];計算機工程與應用;2012年34期

7 王立霞;淮曉永;;基于語義的中文文本關鍵詞提取算法[J];計算機工程;2012年01期

8 毛小麗;何中市;邢欣來;劉莉;;基于語義角色的實體關系抽取[J];計算機工程;2011年17期

9 黃鑫;朱巧明;錢龍華;劉梅梅;;基于特征組合的中文實體關系抽取[J];微電子學與計算機;2010年04期

10 雷鈺麗;李陽;王崇駿;劉紅星;謝俊元;;基于權(quán)重的馬爾可夫隨機游走相似度度量的實體識別方法[J];河北師范大學學報(自然科學版);2010年01期

相關碩士學位論文 前2條

1 寧海燕;實體關系自動抽取技術的比較研究[D];哈爾濱工業(yè)大學;2010年

2 李晶;基于網(wǎng)絡抱團發(fā)現(xiàn)的命名實體關系抽取[D];華中師范大學;2006年



本文編號:2386521

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2386521.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶96280***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com