基于新聞數(shù)據(jù)的中文人物社會關系抽取研究
[Abstract]:With the continuous expansion of the scale of the Internet, the information and data contained therein are also growing. The goal of information extraction technology is to mine the structured data from the massive unstructured data in the Internet. Entity relation extraction is a sub-task of information extraction, which has become a research hotspot in the field of data mining and information retrieval. Personal-relationship extraction belongs to an aspect of entity relation extraction. The triple data of personal-relationship is used to construct personal-relationship network and question-and-answer system, which has high application value. However, at present, the research on relation extraction is mainly focused on the processing of English corpus, and the research on relation extraction based on Chinese data is slow and difficult. The relationship extraction method based on machine learning has become a hot research topic because of its good performance in relation extraction results. According to the different training data acquisition methods, this paper studies three methods based on semi-supervised learning, far supervised learning and unsupervised learning. The main contributions are as follows: 1. The supervised learning relational extraction method is highly dependent on the training data of manual annotation, and the cost of manual annotation is too high. In order to obtain high performance of relation extraction under the condition of small amount of labeled data, this paper studies the relationship extraction method of semi-supervised learning. Using semi-supervised learning algorithm based on label propagation can improve the effect of relational extraction under a small amount of labeled data, but random selection of training samples will affect the performance of relational extraction. In order to improve the relationship extraction effect of label propagation algorithm, this paper combines tag propagation algorithm with active learning method to extract human relationship. This method takes the initiative to select the most helpful samples for relational classification, which can reduce the number of invalid labeled samples, and improve the performance of the system under the condition of the same amount of tagged data. 2. In the present research of relation extraction, remote supervision is usually used to construct training data automatically, but the basic hypothesis of remote supervised learning is inaccurate, so noise data will be introduced into the training data. In this paper, a method of filtering noise in training data based on scoring function is proposed, which can reduce the noise data obtained from training data based on remote supervised learning. In addition, aiming at the problem that the accuracy of the current relational extraction system is not ideal, this paper applies word vector technology to extract some features based on word vector from the single sentence text and adds some features based on word vector to the commonly used relational extraction feature system. Used to enhance the performance of the personal-relationship extraction system. 3. All of the above methods need to predefine the relationship type and then extract the relation to obtain the corresponding relational instance. These methods limit the types of relationships that can be obtained by the relational extraction model, and can not obtain the relational triples of the new relational types. Therefore, this paper proposes an unsupervised learning based relational extraction method which does not require training data and predefined relationship types. In this method, first of all, people pairs with high correlation degree are obtained from the news title data for relation extraction, and then the key words in the co-occurrence sentences are obtained by TF-IDF after the related characters are preprocessed to their news data. Secondly, based on the co-occurrence information, the association between words is obtained, and then the keyword association network is established. Finally, the relationship between people is obtained by using the graph clustering analysis of the association network.
【學位授予單位】:中國科學技術大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前10條
1 秦兵;劉安安;劉挺;;無指導的中文開放式實體關系抽取[J];計算機研究與發(fā)展;2015年05期
2 潘云;布勒布麗汗·伊沙巴依;楊靜;尹敏;;利用中文在線資源的遠程監(jiān)督人物關系抽取[J];小型微型計算機系統(tǒng);2015年04期
3 黃衛(wèi)春;范少帥;熊李艷;鐘茂生;;基于特征選擇的人物關系抽取方法[J];科學技術與工程;2015年03期
4 郭喜躍;何婷婷;胡小華;陳前軍;;基于句法語義特征的中文實體關系抽取[J];中文信息學報;2014年06期
5 張俊麗;常艷麗;師文;;標簽傳播算法理論及其應用研究綜述[J];計算機應用研究;2013年01期
6 劉康;錢旭;王自強;;主動學習算法綜述[J];計算機工程與應用;2012年34期
7 王立霞;淮曉永;;基于語義的中文文本關鍵詞提取算法[J];計算機工程;2012年01期
8 毛小麗;何中市;邢欣來;劉莉;;基于語義角色的實體關系抽取[J];計算機工程;2011年17期
9 黃鑫;朱巧明;錢龍華;劉梅梅;;基于特征組合的中文實體關系抽取[J];微電子學與計算機;2010年04期
10 雷鈺麗;李陽;王崇駿;劉紅星;謝俊元;;基于權(quán)重的馬爾可夫隨機游走相似度度量的實體識別方法[J];河北師范大學學報(自然科學版);2010年01期
相關碩士學位論文 前2條
1 寧海燕;實體關系自動抽取技術的比較研究[D];哈爾濱工業(yè)大學;2010年
2 李晶;基于網(wǎng)絡抱團發(fā)現(xiàn)的命名實體關系抽取[D];華中師范大學;2006年
,本文編號:2386521
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2386521.html