天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

微博實(shí)體與百科條目鏈接的多策略研究

發(fā)布時(shí)間:2018-07-14 12:13
【摘要】:近年來(lái),隨著WEB2.0技術(shù)及互聯(lián)網(wǎng)產(chǎn)業(yè)的興起,社交網(wǎng)絡(luò)空前發(fā)展,衍生出的一種新型社交網(wǎng)絡(luò)平臺(tái),微博,其用戶(hù)規(guī)模和產(chǎn)生的數(shù)據(jù)量急劇增長(zhǎng)。另一方面,WEB2.0技術(shù)也帶來(lái)了網(wǎng)絡(luò)百科的迅速發(fā)展,如何利用社交媒體及網(wǎng)絡(luò)內(nèi)容進(jìn)行知識(shí)庫(kù)的構(gòu)建與擴(kuò)展成為當(dāng)今研究熱點(diǎn)。其中,待拓展實(shí)體條目的歧義問(wèn)題成為該研究領(lǐng)域的重點(diǎn)難點(diǎn),實(shí)體鏈接技術(shù)是解決該問(wèn)題的重要方法。本文針對(duì)中文微博內(nèi)容簡(jiǎn)短、語(yǔ)言隨意不規(guī)范等特性,提出了中文微博實(shí)體鏈接消歧的多策略方法。中文微博實(shí)體與百科條目的鏈接,即對(duì)微博內(nèi)容中出現(xiàn)的待測(cè)命名實(shí)體與百科知識(shí)庫(kù)中的條目進(jìn)行匹配,要求將微博中出現(xiàn)的實(shí)體與百科條目準(zhǔn)確鏈接。中文微博實(shí)體與百科條目的鏈接研究從屬于命名實(shí)體識(shí)別(Named Entity Recognition,NER)下命名實(shí)體消歧(NED,Named Entity Disambiguation)研究課題,是自然語(yǔ)言處理(NLP,Natural Language Processing)研究領(lǐng)域中的一項(xiàng)熱點(diǎn)研究,在自然語(yǔ)言處理的研究領(lǐng)域中起重要作用,是不可缺少的研究基礎(chǔ)。提升中文微博實(shí)體鏈接消歧的準(zhǔn)確性,可以更好地構(gòu)建與擴(kuò)展網(wǎng)絡(luò)百科知識(shí)庫(kù),體現(xiàn)自然語(yǔ)言處理系統(tǒng)的通用性高與性能好的特點(diǎn)。本文以參加的中國(guó)計(jì)算機(jī)學(xué)會(huì)(CCF,China Computer Federation)主辦的自然語(yǔ)言處理與中文計(jì)算會(huì)議(NLPCC, CCF Conference on Natural Language Processing Chinese Computing)的評(píng)測(cè)任務(wù)為主要研究?jī)?nèi)容。編寫(xiě)網(wǎng)頁(yè)爬蟲(chóng)程序,獲取微博內(nèi)容及網(wǎng)絡(luò)百科頁(yè)面信息,構(gòu)建百科實(shí)體映射表及梳理百科條目知識(shí)庫(kù)。使用LDA模型,基于主題模型的消歧算法對(duì)人名實(shí)體進(jìn)行消歧。集合基于實(shí)體映射表的匹配消歧算法、基于TF-IDF的實(shí)體義項(xiàng)特征消歧算法、基于實(shí)體義項(xiàng)標(biāo)簽的消歧算法和基于Fast-Newman聚類(lèi)模型實(shí)體消歧算法對(duì)中文微博實(shí)體進(jìn)行消歧,本文主要貢獻(xiàn)包括:(1) 構(gòu)建和梳理百科條目知識(shí)庫(kù)及實(shí)體映射表。(2) 提出基于主題模型的人名消歧算法。(3) 提出多層級(jí)、多策略的實(shí)體消歧算法。(4) 編寫(xiě)中文微博實(shí)體識(shí)別系統(tǒng)和百科知識(shí)庫(kù)程序,并申請(qǐng)軟件著作權(quán)。本文數(shù)據(jù)來(lái)源于第二屆和第三屆自然語(yǔ)言處理與中文計(jì)算會(huì)議(NLPCC 2013、2014)中的中文微博實(shí)體鏈接任務(wù),其中在2013年評(píng)測(cè)中,知識(shí)庫(kù)實(shí)體數(shù)為44492個(gè),待測(cè)實(shí)體數(shù)為1274個(gè)。在2014年評(píng)測(cè)中,知識(shí)庫(kù)實(shí)體數(shù)為378207個(gè),待測(cè)實(shí)體數(shù)為607個(gè)。評(píng)測(cè)成績(jī)2013年準(zhǔn)確率為84.99%,在全國(guó)提交的18組結(jié)果中排名第6和第7,隊(duì)伍成績(jī)排名第3。2014年準(zhǔn)確率為84.02%,隊(duì)伍排名第3。經(jīng)過(guò)后續(xù)總結(jié)改進(jìn),采用本文的模型和算法,準(zhǔn)確率達(dá)91.40%。
[Abstract]:In recent years, with the rise of Web 2.0 technology and the Internet industry, the social network has developed unprecedentedly. A new type of social network platform, Weibo, has been developed, its user size and the amount of data generated has increased rapidly. On the other hand, Web 2.0 technology has also brought the rapid development of online encyclopedia, how to use social media and network content to build and expand the knowledge base has become a hot research topic. Among them, the ambiguity of entity items to be expanded has become a key problem in this research field, and entity link technology is an important method to solve this problem. In this paper, a multi-strategy method of entity link disambiguation for Chinese Weibo is proposed, which is short in content and irregular in language. The link between Chinese Weibo entities and encyclopedia entries, that is, matching the named entities to be tested in the Weibo content with the entries in the encyclopedia knowledge base, requires that the entities appearing in Weibo be accurately linked to the encyclopedia entries. The research on the link between Chinese Weibo entities and encyclopedia entries belongs to the research topic of named entity disambiguation under named entity recognition, which is a hot topic in the field of natural language processing. It plays an important role in the field of natural language processing and is an indispensable research foundation. By improving the accuracy of Chinese Weibo entity link disambiguation, the network encyclopedia knowledge base can be constructed and expanded better, which embodies the characteristics of high generality and good performance of the natural language processing system. This paper focuses on the evaluation task of the CCF Conference on Natural language processing Chinese Computing organized by the CCF China computer Federation. A web crawler program is written to obtain Weibo content and web page information, to construct an encyclopedia entity mapping table and to comb the knowledge base of encyclopedia items. Using LDA model, the disambiguation algorithm based on topic model is used to disambiguate human name entity. The matching disambiguation algorithm based on entity mapping table, entity meaning feature disambiguation algorithm based on TF-IDF, entity sense label based disambiguation algorithm and entity disambiguation algorithm based on Fast-Newman clustering model are used to disambiguate Chinese Weibo entities. The main contributions of this paper are as follows: (1) constructing and combing the knowledge base of encyclopedic entries and entity mapping tables; (2) proposing a topic model-based algorithm for disambiguation of human names; (3) proposing a multi-level method for disambiguation. Multi-strategy entity disambiguation algorithm. (4) write Chinese Weibo entity recognition system and encyclopedic knowledge base program and apply for software copyright. The data in this paper are derived from the task of linking Chinese Weibo entities in the second and third Natural language processing and Chinese Computing conferences (NLPCC2013 / 2014). In the 2013 evaluation, the number of entities in the knowledge base is 44492 and the number of entities to be tested is 1274. In 2014, the number of entities in knowledge base is 378207 and the number of entities to be tested is 607. The accuracy rate in 2013 was 84.99, ranked 6th and 7th in 18 groups of results submitted by the country, ranked 3.02accuracy rate in 2014, and ranked 3th in team. After the following summary and improvement, the model and algorithm are adopted, the accuracy is 91.40.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類(lèi)號(hào)】:TP391.1;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前2條

1 周俊生;戴新宇;尹存燕;陳家駿;;基于層疊條件隨機(jī)場(chǎng)模型的中文機(jī)構(gòu)名自動(dòng)識(shí)別[J];電子學(xué)報(bào);2006年05期

2 劉非凡;趙軍;呂碧波;徐波;于浩;夏迎炬;;面向商務(wù)信息抽取的產(chǎn)品命名實(shí)體識(shí)別研究[J];中文信息學(xué)報(bào);2006年01期

相關(guān)碩士學(xué)位論文 前1條

1 羅樂(lè);基于潛在語(yǔ)義分析的文本分類(lèi)算法研究[D];西南大學(xué);2013年

,

本文編號(hào):2121609

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2121609.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶(hù)d0222***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com