基于新聞數(shù)據(jù)的中文人物社會(huì)關(guān)系抽取研究
[Abstract]:With the continuous expansion of the scale of the Internet, the information and data contained therein are also growing. The goal of information extraction technology is to mine the structured data from the massive unstructured data in the Internet. Entity relation extraction is a sub-task of information extraction, which has become a research hotspot in the field of data mining and information retrieval. Personal-relationship extraction belongs to an aspect of entity relation extraction. The triple data of personal-relationship is used to construct personal-relationship network and question-and-answer system, which has high application value. However, at present, the research on relation extraction is mainly focused on the processing of English corpus, and the research on relation extraction based on Chinese data is slow and difficult. The relationship extraction method based on machine learning has become a hot research topic because of its good performance in relation extraction results. According to the different training data acquisition methods, this paper studies three methods based on semi-supervised learning, far supervised learning and unsupervised learning. The main contributions are as follows: 1. The supervised learning relational extraction method is highly dependent on the training data of manual annotation, and the cost of manual annotation is too high. In order to obtain high performance of relation extraction under the condition of small amount of labeled data, this paper studies the relationship extraction method of semi-supervised learning. Using semi-supervised learning algorithm based on label propagation can improve the effect of relational extraction under a small amount of labeled data, but random selection of training samples will affect the performance of relational extraction. In order to improve the relationship extraction effect of label propagation algorithm, this paper combines tag propagation algorithm with active learning method to extract human relationship. This method takes the initiative to select the most helpful samples for relational classification, which can reduce the number of invalid labeled samples, and improve the performance of the system under the condition of the same amount of tagged data. 2. In the present research of relation extraction, remote supervision is usually used to construct training data automatically, but the basic hypothesis of remote supervised learning is inaccurate, so noise data will be introduced into the training data. In this paper, a method of filtering noise in training data based on scoring function is proposed, which can reduce the noise data obtained from training data based on remote supervised learning. In addition, aiming at the problem that the accuracy of the current relational extraction system is not ideal, this paper applies word vector technology to extract some features based on word vector from the single sentence text and adds some features based on word vector to the commonly used relational extraction feature system. Used to enhance the performance of the personal-relationship extraction system. 3. All of the above methods need to predefine the relationship type and then extract the relation to obtain the corresponding relational instance. These methods limit the types of relationships that can be obtained by the relational extraction model, and can not obtain the relational triples of the new relational types. Therefore, this paper proposes an unsupervised learning based relational extraction method which does not require training data and predefined relationship types. In this method, first of all, people pairs with high correlation degree are obtained from the news title data for relation extraction, and then the key words in the co-occurrence sentences are obtained by TF-IDF after the related characters are preprocessed to their news data. Secondly, based on the co-occurrence information, the association between words is obtained, and then the keyword association network is established. Finally, the relationship between people is obtained by using the graph clustering analysis of the association network.
【學(xué)位授予單位】:中國科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 秦兵;劉安安;劉挺;;無指導(dǎo)的中文開放式實(shí)體關(guān)系抽取[J];計(jì)算機(jī)研究與發(fā)展;2015年05期
2 潘云;布勒布麗汗·伊沙巴依;楊靜;尹敏;;利用中文在線資源的遠(yuǎn)程監(jiān)督人物關(guān)系抽取[J];小型微型計(jì)算機(jī)系統(tǒng);2015年04期
3 黃衛(wèi)春;范少帥;熊李艷;鐘茂生;;基于特征選擇的人物關(guān)系抽取方法[J];科學(xué)技術(shù)與工程;2015年03期
4 郭喜躍;何婷婷;胡小華;陳前軍;;基于句法語義特征的中文實(shí)體關(guān)系抽取[J];中文信息學(xué)報(bào);2014年06期
5 張俊麗;常艷麗;師文;;標(biāo)簽傳播算法理論及其應(yīng)用研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2013年01期
6 劉康;錢旭;王自強(qiáng);;主動(dòng)學(xué)習(xí)算法綜述[J];計(jì)算機(jī)工程與應(yīng)用;2012年34期
7 王立霞;淮曉永;;基于語義的中文文本關(guān)鍵詞提取算法[J];計(jì)算機(jī)工程;2012年01期
8 毛小麗;何中市;邢欣來;劉莉;;基于語義角色的實(shí)體關(guān)系抽取[J];計(jì)算機(jī)工程;2011年17期
9 黃鑫;朱巧明;錢龍華;劉梅梅;;基于特征組合的中文實(shí)體關(guān)系抽取[J];微電子學(xué)與計(jì)算機(jī);2010年04期
10 雷鈺麗;李陽;王崇駿;劉紅星;謝俊元;;基于權(quán)重的馬爾可夫隨機(jī)游走相似度度量的實(shí)體識(shí)別方法[J];河北師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年01期
相關(guān)碩士學(xué)位論文 前2條
1 寧海燕;實(shí)體關(guān)系自動(dòng)抽取技術(shù)的比較研究[D];哈爾濱工業(yè)大學(xué);2010年
2 李晶;基于網(wǎng)絡(luò)抱團(tuán)發(fā)現(xiàn)的命名實(shí)體關(guān)系抽取[D];華中師范大學(xué);2006年
,本文編號(hào):2386521
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2386521.html