中文開放式多元實(shí)體關(guān)系抽取
發(fā)布時(shí)間:2017-12-31 20:23
本文關(guān)鍵詞:中文開放式多元實(shí)體關(guān)系抽取 出處:《太原理工大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
更多相關(guān)文章: 開放式信息抽取 實(shí)體關(guān)系抽取 機(jī)器學(xué)習(xí) 邏輯回歸分類器 支持向量機(jī)
【摘要】:信息抽取是指從文本中抽取指定類型的實(shí)體詞、關(guān)系詞、時(shí)間、地點(diǎn)、事件等多層次的語義信息,并將這些信息轉(zhuǎn)化成結(jié)構(gòu)化格式進(jìn)行輸出。隨著網(wǎng)絡(luò)信息的指數(shù)型增長,加之在今天人工智能的快速發(fā)展,信息抽取逐漸成了熱門研究領(lǐng)域。而實(shí)體關(guān)系抽取是信息抽取的一個(gè)重要環(huán)節(jié),同時(shí)也是一個(gè)重要任務(wù),實(shí)體關(guān)系抽取的主要內(nèi)容是抽取文本中的實(shí)體關(guān)系類型和實(shí)體關(guān)系值。實(shí)體關(guān)系抽取對(duì)于知識(shí)圖譜構(gòu)建和領(lǐng)域本體、問答系統(tǒng)、文本相似度計(jì)算以及語義理解和文本摘要提取等更深層次的自然語言處理問題都具有重要的理論和實(shí)踐意義。實(shí)體關(guān)系抽取的研究包括傳統(tǒng)式實(shí)體關(guān)系抽取和開放式實(shí)體關(guān)系抽取。其中,傳統(tǒng)實(shí)體關(guān)系抽取主要面向限定領(lǐng)域文本、限定類別實(shí)體和關(guān)系的抽取,需要針對(duì)某一限定領(lǐng)域建立語言模型進(jìn)行抽取。然而隨著互聯(lián)網(wǎng)信息的指數(shù)型增長和互聯(lián)網(wǎng)信息所具有的跨領(lǐng)域特性,使得傳統(tǒng)式實(shí)體關(guān)系抽取無法滿足網(wǎng)絡(luò)文本抽取的需求。從而,開放式信息抽取成為了信息抽取的一個(gè)重要研究領(lǐng)域,它的主要任務(wù)是從大規(guī)模異構(gòu)、跨領(lǐng)域文本中抽取實(shí)體、關(guān)系、事件等多層次語義信息,并且以結(jié)構(gòu)化格式輸出,使得可以跨領(lǐng)域地、大規(guī)模地對(duì)網(wǎng)絡(luò)文本進(jìn)行處理。針對(duì)英文文本的開放式實(shí)體關(guān)系抽取主要分為兩個(gè)階段:先對(duì)實(shí)體詞進(jìn)行抽取的階段和先對(duì)關(guān)系詞進(jìn)行抽取的階段。在針對(duì)中文文本實(shí)體關(guān)系抽取方面的研究主要集中在二元關(guān)系抽取以及使用淺層語義特征進(jìn)行抽取的方法。因此本文提出了基于依存關(guān)系分析的針對(duì)中文文本的開放式實(shí)體關(guān)系抽取方法,該方法可以用于抽取多元關(guān)系,并且加入了深層語義特征使得抽取的準(zhǔn)確性得到了提供。本文在上述方法的基礎(chǔ)上設(shè)計(jì)并實(shí)現(xiàn)了抽取系統(tǒng)。本文提出了面對(duì)大規(guī)模、異構(gòu)中文網(wǎng)絡(luò)文本的基于依存關(guān)系的開放式信息抽取方法,首先對(duì)網(wǎng)絡(luò)文本進(jìn)行預(yù)處理,包括網(wǎng)頁正文文本抽取、中文分詞、中文詞性標(biāo)注和依存關(guān)系分析,然后使用啟發(fā)式規(guī)則進(jìn)行基本名詞短語識(shí)別并通過基于詞間依存關(guān)系的啟發(fā)式規(guī)則獲取候選實(shí)體關(guān)系多元組,接著通過經(jīng)過訓(xùn)練的機(jī)器學(xué)習(xí)分類器對(duì)候選實(shí)體關(guān)系多元組進(jìn)行過濾得到最終的實(shí)體關(guān)系多元組,最后將過濾得到的實(shí)體關(guān)系組進(jìn)行標(biāo)準(zhǔn)化過程后保存在數(shù)據(jù)庫中。抽取出的大規(guī)模的實(shí)體關(guān)系組也可以用于其他的自然語言處理方面的任務(wù)。本文使用語言技術(shù)平臺(tái)云(Language Technology Platform-Cloud,LTP-Cloud)進(jìn)行文本預(yù)處理,定義了一系列基本名詞短語的詞性組合規(guī)則和一系列基于依存關(guān)系的抽取實(shí)體關(guān)系多元組的規(guī)則。在過濾階段,以詞個(gè)數(shù)、詞性、詞間距離等方面為特征訓(xùn)練得到機(jī)器學(xué)習(xí)分類器,對(duì)候選關(guān)系組進(jìn)行一個(gè)正確與否的判斷與過濾。在對(duì)測試語料抽取實(shí)驗(yàn)中,得到81.25%的準(zhǔn)確性。最后,使用了本文提出的抽取方法搭建了中文開放式多元實(shí)體關(guān)系抽取系統(tǒng),并抽取出了大量的實(shí)體關(guān)系組。
[Abstract]:Information extraction refers to the extraction from the specified text types of solid words, words, time, place, events and other multi-level semantic information, and these information into a structured format output. With the exponential growth of network information, coupled with the rapid development of today, artificial intelligence, information extraction has become a hot research the field and entity relation extraction is an important part of information extraction, and also an important task, the main content of entity relation extraction is selected in the text type and entity relationship entity relationship value. Entity relation extraction for knowledge mapping and domain ontology, question answering system, has important theoretical and practical significance of Natural Language Processing the deeper problem of text similarity computing and semantic comprehension and text summarization extraction. Research of entity relation extraction including traditional entity relation extraction Take and open entity relation extraction. Among them, the traditional entity relation extraction for domain specific text, limited categories of entity and relation extraction, need for a restricted domain language model based extraction. However, cross domain characteristics with the exponential growth of Internet information and Internet information. It makes the traditional entity relationship extraction can not meet the demand. So the network text extraction, open information extraction has become an important research field of information extraction, it is the main task of the large-scale heterogeneous, entity extraction, cross domain text between the events of multi-level semantic information, and output in a structured format, enables cross domain, for on a large scale. The network text open to English text entity relation extraction is mainly divided into two stages: the first stage extraction on the real words And the first to extract Related words in text. Chinese entity relation extraction research mainly concentrated in the two yuan relation extraction method and using the shallow semantic features extraction. This paper proposes an open entity relation extraction method for Chinese text dependency relation based on the analysis, this method can be used to extract multiple the relationship between, and joined the deep semantic feature makes the accuracy of the extraction is offered. This paper designs and implements the extraction system on the basis of the above methods is put forward in this paper. In the face of massive, heterogeneous network Chinese text open information extraction method based on the dependency relation, the network text pretreatment, including Web Text extraction Chinese, word segmentation, POS tagging and dependency relation analysis Chinese, then use heuristic rules for base noun phrase identification and The heuristic rules based on the dependency relation between words acquisition candidate entity relation between multiple groups, followed by trained machine learning classifier to filter candidate entity between multiple groups to obtain the final entity relation between multiple groups, the group entity relationship by filtering in the standardization process after stored in the database. A large group of entity relationship the extract can also be used for Natural Language Processing other tasks. In this paper, the use of language technology platform (Language Technology Platform-Cloud, LTP-Cloud cloud) for text preprocessing, defines a series of basic noun phrase combination rule based on part of speech and a series of multiple entity relation extraction group rule dependency relation. In the filtering stage, in a word the number of POS, distance etc. between words by machine learning classifier for feature training, a group of candidate relations In the test corpus extraction experiment, we get 81.25% accuracy. Finally, we use the extraction method proposed in this paper to build an open multi entity relationship extraction system in China, and extract a large number of entity relationship groups.
【學(xué)位授予單位】:太原理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 秦兵;劉安安;劉挺;;無指導(dǎo)的中文開放式實(shí)體關(guān)系抽取[J];計(jì)算機(jī)研究與發(fā)展;2015年05期
2 趙軍;劉康;周光有;蔡黎;;開放式文本信息抽取[J];中文信息學(xué)報(bào);2011年06期
3 奉國和;鄭偉;;國內(nèi)中文自動(dòng)分詞技術(shù)研究綜述[J];圖書情報(bào)工作;2011年02期
4 周宏宇;張政;;中文分詞技術(shù)綜述[J];安陽師范學(xué)院學(xué)報(bào);2010年02期
相關(guān)博士學(xué)位論文 前1條
1 張奇;信息抽取中實(shí)體關(guān)系識(shí)別研究[D];中國科學(xué)技術(shù)大學(xué);2010年
,本文編號(hào):1361332
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1361332.html
最近更新
教材專著