基于維基百科的實(shí)體鏈接算法研究及系統(tǒng)實(shí)現(xiàn)

發(fā)布時(shí)間：2018-10-14 16:41

【摘要】：互聯(lián)網(wǎng)進(jìn)入信息爆炸時(shí)代,信息數(shù)量巨大,表現(xiàn)形式多樣,信息復(fù)雜。如何從大量信息中準(zhǔn)確獲取到用戶所需要的信息,是亟待解決的問題。然而,自然語(yǔ)言中廣泛存在著歧義性問題,實(shí)體歧義就是指同一個(gè)實(shí)體指稱在不同上下文環(huán)境中對(duì)應(yīng)不同真實(shí)世界實(shí)體的語(yǔ)言現(xiàn)象,消除實(shí)體的歧義性能夠幫助更好理解文本信息,而實(shí)體鏈接就是將網(wǎng)頁(yè)、微博或者對(duì)話中的人名、地名和機(jī)構(gòu)名正確地鏈接到知識(shí)庫(kù)中的相應(yīng)實(shí)體上,主要解決同義詞和一詞多義的實(shí)體消歧問題,對(duì)于信息檢索、自動(dòng)問答和完備知識(shí)庫(kù)具有重要意義。本文針對(duì)實(shí)體鏈接中的核心問題,實(shí)體指稱的候選實(shí)體排名進(jìn)行了研究,論文的主要工作和創(chuàng)新點(diǎn)歸納如下：1.提出了融合LDA和重啟隨機(jī)游走的候選實(shí)體排名算法以及融合Word2Vec和PageRank的候選實(shí)體排名算法,有效提升了實(shí)體鏈接的準(zhǔn)確性。傳統(tǒng)的候選實(shí)體排名算法往往停留在特征抽取的階段,需要提取大量特征,然后通過監(jiān)督學(xué)習(xí)的方法訓(xùn)練,非常繁瑣,其特征也往往是一些淺層特征,比如字符串的相似性,忽略了實(shí)體之間的語(yǔ)義相似性,針對(duì)以上問題,本文利用實(shí)體維基百科中的鏈接結(jié)構(gòu),同時(shí)考慮到同一主題下的實(shí)體會(huì)鏈接到一起,語(yǔ)義上更相關(guān)的實(shí)體也會(huì)鏈接到一起,針對(duì)此觀點(diǎn),本文提出了融合LDA和重啟隨機(jī)游走的候選實(shí)體排名算法以及融合Word2Vec和PageRank的候選實(shí)體排名算法,兩個(gè)算法都利用了實(shí)體所在維基百科的圖結(jié)構(gòu),其中重啟隨機(jī)游走最終得到的是每個(gè)候選實(shí)體的向量,而PageRank最終得到每個(gè)候選實(shí)體的PR值,前者融入了實(shí)體關(guān)于主題的特征向量,后者融入了實(shí)體和實(shí)體之間語(yǔ)義相似度,兩者都在圖模型的基礎(chǔ)上加入了語(yǔ)義特征,通過實(shí)驗(yàn)驗(yàn)證,相對(duì)于主流的候選實(shí)體排名算法,提高了實(shí)體鏈接的準(zhǔn)確率。2.結(jié)合兩種候選實(shí)體排名算法,開發(fā)了實(shí)體鏈接系統(tǒng)LEL,該系統(tǒng)能夠?qū)⑽谋局械膶?shí)體鏈接到維基百科知識(shí)庫(kù),具有很強(qiáng)的交互性。
[Abstract]:The Internet enters the information explosion age, the information quantity is huge, the manifestation is diverse, the information is complex. How to get the information that users need from a large amount of information is an urgent problem to be solved. However, there is widespread ambiguity in natural languages. Entity ambiguity refers to the linguistic phenomenon in which the same entity refers to different real world entities in different contexts. Disambiguation of entities can help to better understand text information, and entity links are the right links to the corresponding entities in the knowledge base by linking pages, Weibo or the names of people, places and institutions in the dialogue. To solve the problem of entity disambiguation of synonym and polysemy, it is of great significance for information retrieval, automatic question and answer and complete knowledge base. Aiming at the core problem of entity link, the candidate entity ranking of entity reference is studied in this paper. The main work and innovation of this paper are summarized as follows: 1. A candidate entity ranking algorithm combining LDA and restarting random walk and a candidate entity ranking algorithm combining Word2Vec and PageRank are proposed to effectively improve the accuracy of entity link. The traditional candidate entity ranking algorithm often stays at the stage of feature extraction, and needs to extract a large number of features, and then training by supervised learning is very cumbersome, and its features are often some shallow features, such as the similarity of strings. Ignoring the semantic similarity between entities, this paper uses the link structure in entity Wikipedia, considering that entities under the same subject will link together, and entities that are more semantically relevant will be linked together. In order to solve this problem, this paper proposes a candidate entity ranking algorithm that combines LDA and reboot random walk, and a candidate entity ranking algorithm that combines Word2Vec and PageRank. Both algorithms utilize the graph structure of Wikipedia where the entity is located. The reboot random walk results in the vector of each candidate entity, and the PR value of each candidate entity is obtained by PageRank. The former incorporates the feature vector of the entity on the subject, and the latter integrates the semantic similarity between the entity and the entity. Both of them add semantic features to the graph model. The experimental results show that compared with the mainstream candidate entity ranking algorithm, the accuracy of entity link is improved. 2. Combined with two candidate entity ranking algorithms, an entity link system (LEL,) is developed. The system can link the entities in the text to the Wikipedia knowledge base and has strong interaction.
【學(xué)位授予單位】：華東師范大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 陳斌;;結(jié)構(gòu)化實(shí)體圖——E-R方法的增強(qiáng)[J];計(jì)算機(jī)科學(xué);1986年06期

2 龐正剛;;在Auto CAD中繪制相交線的新方法[J];重慶工貿(mào)職業(yè)技術(shù)學(xué)院學(xué)報(bào);2006年02期

3 李灶福,李曉蘭,鄧小紅,包晨陽(yáng);關(guān)于Auto CAD中將三維實(shí)體圖轉(zhuǎn)換成平面三視圖的探討[J];機(jī)床與液壓;2003年03期

4 榮英;譚國(guó)萍;;CAD快速繪制組合體三維實(shí)體圖的方法和技巧[J];九江學(xué)院學(xué)報(bào)(自然科學(xué)版);2013年03期

5 J Miguel Gerlso;張勤勇;;TM——一適合CAD和所要求的數(shù)據(jù)庫(kù)功能的面向?qū)嶓w語(yǔ)言[J];國(guó)外導(dǎo)彈與航天運(yùn)載器;1989年08期

6 焦泉忠;;NX5實(shí)體圖與CAXA2007工程圖轉(zhuǎn)換[J];金屬加工(冷加工);2013年02期

7 范力軍;圖形變量化的實(shí)現(xiàn)技術(shù)[J];工程設(shè)計(jì)CAD與智能建筑;1999年11期

8 王斌;;CAD三維實(shí)體解決復(fù)雜形體看圖問題[J];實(shí)驗(yàn)室科學(xué);2007年03期

9 楊長(zhǎng)青;;AutoCAD三維實(shí)體教學(xué)體會(huì)[J];科技信息;2010年32期

10 徐景輝;苑偉政;常洪龍;謝建兵;;一種新型三維實(shí)體到標(biāo)準(zhǔn)工藝版圖的轉(zhuǎn)換方法[J];傳感技術(shù)學(xué)報(bào);2006年05期

相關(guān)博士學(xué)位論文前1條

1 吳建華;矢量空間數(shù)據(jù)實(shí)體匹配方法與應(yīng)用研究[D];武漢大學(xué);2008年

相關(guān)碩士學(xué)位論文前5條

1 薛昊原;領(lǐng)域文本資源實(shí)體鏈接算法研究[D];鄭州大學(xué);2015年

2 朱燦;實(shí)體解析技術(shù)研究與應(yīng)用[D];上海交通大學(xué);2015年

3 羅念;基于維基百科的實(shí)體鏈接算法研究及系統(tǒng)實(shí)現(xiàn)[D];華東師范大學(xué);2016年

4 何峰權(quán);基于屬性模式的實(shí)體識(shí)別框架[D];哈爾濱工業(yè)大學(xué);2013年

5 王瑋;從可比語(yǔ)料中抽取等價(jià)實(shí)體翻譯對(duì)的研究[D];哈爾濱工業(yè)大學(xué);2014年

，

本文編號(hào)：2271020

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2271020.html

上一篇：OpenStack開源社區(qū)中商業(yè)組織的參與模式
下一篇：基于Extjs的呼叫中心業(yè)務(wù)管理系統(tǒng)的設(shè)計(jì)與開發(fā)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于維基百科的實(shí)體鏈接算法研究及系統(tǒng)實(shí)現(xiàn)