基于上下文的多特征圖模型中文實體鏈接技術
發(fā)布時間:2018-08-25 15:21
【摘要】:網(wǎng)絡信息的發(fā)展與語義搜索需求的不斷增長,使得知識庫的擴充成為自然語言處理研究領域的熱點。實體鏈接正是知識庫擴充的核心關鍵技術,是將文本中的實體指稱表述項正確鏈接到知識庫中實體的過程,具有重要的理論研究價值和實際應用價值。目前大多數(shù)實體鏈接技術處理的語言為英文,針對中文的研究仍處于起步階段,造成這一現(xiàn)象的主要原因包括:(1)缺乏統(tǒng)一且權威的中文開源知識庫和語料庫;(2)中文的實體抽取技術受制于中文分詞,并且中文的語義豐富、語法更加靈活,消歧難度比英文大,使得其仍然停留在命名實體的表述層面,不能很好地獲取實體的語義信息。針對以上問題,本文以當前主流的英文實體鏈接技術為基礎,結(jié)合目前中文的研究現(xiàn)狀,提出了一種基于上下文的多特征圖模型的解決方案。(1)選取中文維基百科作為此次實體鏈接任務的知識庫支撐,并且從NIST(National Institute of Standards and Technology,美國國家標準與技術研究院)在TAC(Text Analysis Conference,文本分析會議)的KBP(Knowledge Base Population,知識庫擴充)子任務提供的官方評測數(shù)據(jù)中,抽取中文語料信息,構(gòu)造語料庫和實驗數(shù)據(jù)集;(2)從實體指稱表述項的上下文和維基百科數(shù)據(jù)庫兩個方面入手,充分抽取實體之間的多種特征并量化為語義相似度,然后將語義相似度融合到構(gòu)建的圖模型中,利用圖模型的主題一致性的特點,對候選實體進行排序,完成實體鏈接,達到提高中文分詞的準確性和增加實體語義信息的目的。為了驗證本文方法的性能,采用重現(xiàn)目前最新的中文實體鏈接的方法,實驗結(jié)果表明,本文提出的方法可以有效提高實體鏈接的準確率和效率,取得了較好的整體效果。
[Abstract]:With the development of network information and the increasing demand of semantic search, the expansion of knowledge base has become a hot topic in the field of natural language processing. Entity link is the key technology of the expansion of knowledge base, and it is the process of correctly linking the entity reference in the text to the entity in the knowledge base. It has important theoretical research value and practical application value. At present, most of the languages processed by physical link technology are English, and the research on Chinese is still in its infancy. The main causes of this phenomenon include: (1) lack of unified and authoritative Chinese open source knowledge base and corpus; (2) Chinese entity extraction technology is restricted by Chinese word segmentation, and Chinese has rich semantics, more flexible grammar and greater difficulty in disambiguation than English. It still stays at the expression level of named entity, and can not get the semantic information of entity well. In view of the above problems, this paper based on the current mainstream English entity link technology, combined with the current research status of Chinese, A multi-feature graph model based on context is proposed. (1) Chinese Wikipedia is selected as the knowledge base support for this entity link task. And extract Chinese corpus information from the official evaluation data provided by the NIST (National Institute of Standards and Technology, National Institute of Standards and Technology (NIST (National Institute of Standards and Technology,) in the KBP (Knowledge Base Population, knowledge Base expansion of the TAC (Text Analysis Conference, text Analysis Conference. Construct corpus and experimental data set; (2) from the context of entity reference expression and Wikipedia database, fully extract a variety of features between entities and quantify them to semantic similarity. Then the semantic similarity is fused into the constructed graph model. By using the feature of topic consistency of the graph model, the candidate entities are sorted and the entity links are completed, so as to improve the accuracy of Chinese word segmentation and increase the semantic information of entities. In order to verify the performance of this method, the method of reproducing the latest Chinese entity link is adopted. The experimental results show that the proposed method can effectively improve the accuracy and efficiency of the entity link, and achieve a good overall effect.
【學位授予單位】:太原理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1
本文編號:2203297
[Abstract]:With the development of network information and the increasing demand of semantic search, the expansion of knowledge base has become a hot topic in the field of natural language processing. Entity link is the key technology of the expansion of knowledge base, and it is the process of correctly linking the entity reference in the text to the entity in the knowledge base. It has important theoretical research value and practical application value. At present, most of the languages processed by physical link technology are English, and the research on Chinese is still in its infancy. The main causes of this phenomenon include: (1) lack of unified and authoritative Chinese open source knowledge base and corpus; (2) Chinese entity extraction technology is restricted by Chinese word segmentation, and Chinese has rich semantics, more flexible grammar and greater difficulty in disambiguation than English. It still stays at the expression level of named entity, and can not get the semantic information of entity well. In view of the above problems, this paper based on the current mainstream English entity link technology, combined with the current research status of Chinese, A multi-feature graph model based on context is proposed. (1) Chinese Wikipedia is selected as the knowledge base support for this entity link task. And extract Chinese corpus information from the official evaluation data provided by the NIST (National Institute of Standards and Technology, National Institute of Standards and Technology (NIST (National Institute of Standards and Technology,) in the KBP (Knowledge Base Population, knowledge Base expansion of the TAC (Text Analysis Conference, text Analysis Conference. Construct corpus and experimental data set; (2) from the context of entity reference expression and Wikipedia database, fully extract a variety of features between entities and quantify them to semantic similarity. Then the semantic similarity is fused into the constructed graph model. By using the feature of topic consistency of the graph model, the candidate entities are sorted and the entity links are completed, so as to improve the accuracy of Chinese word segmentation and increase the semantic information of entities. In order to verify the performance of this method, the method of reproducing the latest Chinese entity link is adopted. The experimental results show that the proposed method can effectively improve the accuracy and efficiency of the entity link, and achieve a good overall effect.
【學位授予單位】:太原理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前10條
1 楊光;劉秉權;劉銘;;基于圖方法的命名實體消歧[J];智能計算機與應用;2015年05期
2 李茂林;;基于主題敏感的重啟隨機游走實體鏈接方法[J];北京大學學報(自然科學版);2016年01期
3 陳萬禮;昝紅英;吳泳鋼;;基于多源知識和Ranking SVM的中文微博命名實體鏈接[J];中文信息學報;2015年05期
4 昝紅英;吳泳鋼;賈玉祥;牛桂玲;;基于多源知識的中文微博命名實體鏈接[J];山東大學學報(理學版);2015年07期
5 張濤;劉康;趙軍;;一種基于圖模型的維基概念相似度計算方法及其在實體鏈接系統(tǒng)中的應用[J];中文信息學報;2015年02期
6 舒佳根;惠浩添;錢龍華;朱巧明;;一個中文實體鏈接語料庫的建設[J];北京大學學報(自然科學版);2015年02期
7 譚詠梅;楊雪;;結(jié)合實體鏈接與實體聚類的命名實體消歧[J];北京郵電大學學報;2014年05期
8 郭宇航;秦兵;劉挺;李生;;實體鏈指技術研究進展[J];智能計算機與應用;2014年05期
9 懷寶興;寶騰飛;祝恒書;劉淇;;一種基于概率主題模型的命名實體鏈接方法[J];軟件學報;2014年09期
10 朱敏;賈真;左玲;吳安峻;陳方正;柏玉;;中文微博實體鏈接研究[J];北京大學學報(自然科學版);2014年01期
,本文編號:2203297
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2203297.html
最近更新
教材專著