基于新聞數(shù)據(jù)的中文人物社會(huì)關(guān)系抽取研究

發(fā)布時(shí)間：2018-12-18 21:25

【摘要】：隨著互聯(lián)網(wǎng)規(guī)模的不斷擴(kuò)大,其中蘊(yùn)含的信息和數(shù)據(jù)也在持續(xù)增長。信息抽取技術(shù)的目標(biāo)是從互聯(lián)網(wǎng)中的海量無結(jié)構(gòu)化數(shù)據(jù)中挖掘出結(jié)構(gòu)化的數(shù)據(jù)。實(shí)體關(guān)系抽取是信息抽取的子任務(wù),已經(jīng)成為數(shù)據(jù)挖掘與信息檢索領(lǐng)域的一個(gè)研究熱點(diǎn)。人物關(guān)系抽取屬于實(shí)體關(guān)系抽取的一個(gè)方面,人物關(guān)系三元組數(shù)據(jù)被用于構(gòu)建人物關(guān)系網(wǎng)絡(luò)和問答系統(tǒng),具有較高的應(yīng)用價(jià)值。但是,目前關(guān)系抽取研究主要集中在英文語料的處理上,基于中文數(shù)據(jù)的關(guān)系抽取研究進(jìn)展比較緩慢且研究難度較大。基于機(jī)器學(xué)習(xí)的關(guān)系抽取方法因其在關(guān)系抽取結(jié)果上較好的表現(xiàn),已經(jīng)成為目前的研究熱點(diǎn)。按照訓(xùn)練數(shù)據(jù)獲取方式的不同,本文對(duì)基于半監(jiān)督學(xué)習(xí),遠(yuǎn)監(jiān)督學(xué)習(xí)和無監(jiān)督學(xué)習(xí)的三種方法進(jìn)行研究,主要貢獻(xiàn)如下：1.有監(jiān)督學(xué)習(xí)的關(guān)系抽取方法對(duì)人工標(biāo)注的訓(xùn)練數(shù)據(jù)的依賴性較強(qiáng),且人工標(biāo)注的成本過高。為了在少量標(biāo)注數(shù)據(jù)的條件下也能獲取較高的關(guān)系抽取性能,本文對(duì)半監(jiān)督學(xué)習(xí)的關(guān)系抽取方法進(jìn)行研究。使用基于標(biāo)簽傳播的半監(jiān)督學(xué)習(xí)算法能提升少量標(biāo)注數(shù)據(jù)下的關(guān)系抽取效果,但是隨機(jī)選擇訓(xùn)練樣本會(huì)影響關(guān)系抽取性能。為了提升標(biāo)簽傳播算法的關(guān)系抽取效果,本文將標(biāo)簽傳播算法與主動(dòng)學(xué)習(xí)方法相結(jié)合用于人物關(guān)系抽取。這個(gè)方法主動(dòng)選取對(duì)于關(guān)系分類的幫助最大的樣本進(jìn)行標(biāo)注,可以減少無效標(biāo)注樣本數(shù)量,在相同標(biāo)注數(shù)據(jù)量的條件下提升系統(tǒng)的性能。2.在目前的關(guān)系抽取研究中,遠(yuǎn)監(jiān)督方法通常用于自動(dòng)構(gòu)建訓(xùn)練數(shù)據(jù),但是遠(yuǎn)監(jiān)督學(xué)習(xí)的基本假設(shè)存在不準(zhǔn)確的問題,從而在訓(xùn)練數(shù)據(jù)中會(huì)引入噪聲數(shù)據(jù)。本文針對(duì)該問題提出了基于打分函數(shù)過濾訓(xùn)練數(shù)據(jù)中噪聲的方法,能減少基于遠(yuǎn)監(jiān)督學(xué)習(xí)獲取的訓(xùn)練數(shù)據(jù)中的噪聲數(shù)據(jù)。另外,針對(duì)目前關(guān)系抽取系統(tǒng)的準(zhǔn)確率不夠理想的問題,本文應(yīng)用詞向量技術(shù)從單句文本中提取基于詞向量的若干特征加入常用的關(guān)系抽取特征系統(tǒng)中,用于提升人物關(guān)系抽取系統(tǒng)的表現(xiàn)。3.以上方法都需要預(yù)先定義關(guān)系類型后進(jìn)行關(guān)系抽取獲得相應(yīng)的關(guān)系實(shí)例。這些方法會(huì)限制了關(guān)系抽取模型可以獲得的關(guān)系種類,無法得到新的關(guān)系類型的關(guān)系三元組數(shù)據(jù)。因此本文提出了一種不需要訓(xùn)練數(shù)據(jù)以及預(yù)先定義的關(guān)系類型的基于無監(jiān)督學(xué)習(xí)的關(guān)系抽取方法。該方法首先從新聞標(biāo)題數(shù)據(jù)獲得關(guān)聯(lián)度較高的人物對(duì)用于關(guān)系抽取研究；然后,抓取關(guān)聯(lián)人物對(duì)所在新聞數(shù)據(jù)進(jìn)行預(yù)處理后,利用TF-IDF得到人物對(duì)共現(xiàn)句子中的關(guān)鍵詞；其次,基于詞語共現(xiàn)信息得到詞語之間的關(guān)聯(lián),進(jìn)而建立關(guān)鍵詞關(guān)聯(lián)網(wǎng)絡(luò)：最后,利用對(duì)關(guān)聯(lián)網(wǎng)絡(luò)進(jìn)行圖聚類分析以獲得人物關(guān)系。
[Abstract]:With the continuous expansion of the scale of the Internet, the information and data contained therein are also growing. The goal of information extraction technology is to mine the structured data from the massive unstructured data in the Internet. Entity relation extraction is a sub-task of information extraction, which has become a research hotspot in the field of data mining and information retrieval. Personal-relationship extraction belongs to an aspect of entity relation extraction. The triple data of personal-relationship is used to construct personal-relationship network and question-and-answer system, which has high application value. However, at present, the research on relation extraction is mainly focused on the processing of English corpus, and the research on relation extraction based on Chinese data is slow and difficult. The relationship extraction method based on machine learning has become a hot research topic because of its good performance in relation extraction results. According to the different training data acquisition methods, this paper studies three methods based on semi-supervised learning, far supervised learning and unsupervised learning. The main contributions are as follows: 1. The supervised learning relational extraction method is highly dependent on the training data of manual annotation, and the cost of manual annotation is too high. In order to obtain high performance of relation extraction under the condition of small amount of labeled data, this paper studies the relationship extraction method of semi-supervised learning. Using semi-supervised learning algorithm based on label propagation can improve the effect of relational extraction under a small amount of labeled data, but random selection of training samples will affect the performance of relational extraction. In order to improve the relationship extraction effect of label propagation algorithm, this paper combines tag propagation algorithm with active learning method to extract human relationship. This method takes the initiative to select the most helpful samples for relational classification, which can reduce the number of invalid labeled samples, and improve the performance of the system under the condition of the same amount of tagged data. 2. In the present research of relation extraction, remote supervision is usually used to construct training data automatically, but the basic hypothesis of remote supervised learning is inaccurate, so noise data will be introduced into the training data. In this paper, a method of filtering noise in training data based on scoring function is proposed, which can reduce the noise data obtained from training data based on remote supervised learning. In addition, aiming at the problem that the accuracy of the current relational extraction system is not ideal, this paper applies word vector technology to extract some features based on word vector from the single sentence text and adds some features based on word vector to the commonly used relational extraction feature system. Used to enhance the performance of the personal-relationship extraction system. 3. All of the above methods need to predefine the relationship type and then extract the relation to obtain the corresponding relational instance. These methods limit the types of relationships that can be obtained by the relational extraction model, and can not obtain the relational triples of the new relational types. Therefore, this paper proposes an unsupervised learning based relational extraction method which does not require training data and predefined relationship types. In this method, first of all, people pairs with high correlation degree are obtained from the news title data for relation extraction, and then the key words in the co-occurrence sentences are obtained by TF-IDF after the related characters are preprocessed to their news data. Secondly, based on the co-occurrence information, the association between words is obtained, and then the keyword association network is established. Finally, the relationship between people is obtained by using the graph clustering analysis of the association network.
【學(xué)位授予單位】：中國科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 秦兵;劉安安;劉挺;;無指導(dǎo)的中文開放式實(shí)體關(guān)系抽取[J];計(jì)算機(jī)研究與發(fā)展;2015年05期

2 潘云;布勒布麗汗·伊沙巴依;楊靜;尹敏;;利用中文在線資源的遠(yuǎn)程監(jiān)督人物關(guān)系抽取[J];小型微型計(jì)算機(jī)系統(tǒng);2015年04期

3 黃衛(wèi)春;范少帥;熊李艷;鐘茂生;;基于特征選擇的人物關(guān)系抽取方法[J];科學(xué)技術(shù)與工程;2015年03期

4 郭喜躍;何婷婷;胡小華;陳前軍;;基于句法語義特征的中文實(shí)體關(guān)系抽取[J];中文信息學(xué)報(bào);2014年06期

5 張俊麗;常艷麗;師文;;標(biāo)簽傳播算法理論及其應(yīng)用研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2013年01期

6 劉康;錢旭;王自強(qiáng);;主動(dòng)學(xué)習(xí)算法綜述[J];計(jì)算機(jī)工程與應(yīng)用;2012年34期

7 王立霞;淮曉永;;基于語義的中文文本關(guān)鍵詞提取算法[J];計(jì)算機(jī)工程;2012年01期

8 毛小麗;何中市;邢欣來;劉莉;;基于語義角色的實(shí)體關(guān)系抽取[J];計(jì)算機(jī)工程;2011年17期

9 黃鑫;朱巧明;錢龍華;劉梅梅;;基于特征組合的中文實(shí)體關(guān)系抽取[J];微電子學(xué)與計(jì)算機(jī);2010年04期

10 雷鈺麗;李陽;王崇駿;劉紅星;謝俊元;;基于權(quán)重的馬爾可夫隨機(jī)游走相似度度量的實(shí)體識(shí)別方法[J];河北師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年01期

相關(guān)碩士學(xué)位論文前2條

1 寧海燕;實(shí)體關(guān)系自動(dòng)抽取技術(shù)的比較研究[D];哈爾濱工業(yè)大學(xué);2010年

2 李晶;基于網(wǎng)絡(luò)抱團(tuán)發(fā)現(xiàn)的命名實(shí)體關(guān)系抽取[D];華中師范大學(xué);2006年

，

本文編號(hào)：2386521

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2386521.html

上一篇：基于視頻序列的運(yùn)動(dòng)目標(biāo)檢測與跟蹤算法研究
下一篇：異質(zhì)網(wǎng)絡(luò)中基于語義元路徑的推薦系統(tǒng)研究與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于新聞數(shù)據(jù)的中文人物社會(huì)關(guān)系抽取研究