中文微博實體鏈接方法研究
發(fā)布時間:2018-07-07 19:35
本文選題:微博 + 實體鏈接; 參考:《哈爾濱工業(yè)大學》2013年碩士論文
【摘要】:網(wǎng)絡資源的不斷膨脹,信息的不斷增多使得人們獲取有價值的信息變得越來越困難。而微博的發(fā)展和流行,使得人們更加無法從短文本中獲取更多感興趣的內容。課題組為解決這一問題,開發(fā)了知識拓展與推薦平臺,為用戶感興趣的信息提供更多的拓展信息。而待拓展知識條目的歧義性成為系統(tǒng)性能的瓶頸。實體鏈接技術是解決該問題的重要方法,它讓程序自動確定上下文中出現(xiàn)的某個實體指稱應該指向的真實世界中的哪個實體,從而實現(xiàn)消歧。針對中文微博這一短文本領域的實體鏈接任務,主要進行了以下幾個方面的工作: 為獲取充足的微博語料,本課題首先實現(xiàn)了網(wǎng)頁微博爬蟲程序,相比于API的獲取方式,大大提高了獲取效率,同時獲取了大量的微博語料,并進行了相應預處理工作。 候選實體的獲取是實體鏈接的關鍵,針對任一待消歧實體,提出了多種不同方式獲取的候選實體,分別賦予了不同的權重以去除噪聲提高消歧的準確性。候選知識庫信息的獲取則主要來自維基百科和百度百科,對于百科中不存在的詞匯,則調用一個元搜索對網(wǎng)絡上的信息進行整合,完成信息的獲取。而針對微博語料的特征稀疏問題,首先利用用戶簡介信息、標簽及近期微博進行拓展;然后提取微博中的關鍵詞獲取Google、百度、Bing等搜索引擎的結果進行拓展。 實現(xiàn)了基于多渠道候選實體的實體鏈接算法和基于領域詞庫的實體鏈接算法。通過各種方法的對比,算法在NLPCC2013評測公開數(shù)據(jù)集上能夠給出較為理想的準確值。本課題最后基于新浪微博開放平臺搭建了知識拓展與推薦的應用系統(tǒng)。本課題算法處理的結果在系統(tǒng)運行的結果顯示,可以達到預期的效果。
[Abstract]:With the expansion of network resources and the increasing of information, it becomes more and more difficult for people to obtain valuable information. With the development and popularity of Weibo, people can not get more interesting content from short text. In order to solve this problem, a knowledge extension and recommendation platform was developed to provide more information for users. The ambiguity of knowledge items to be expanded becomes the bottleneck of system performance. Entity link technology is an important method to solve this problem. It allows the program to automatically determine which entity the entity reference in the context should point to in the real world, so as to achieve disambiguation. In view of the entity link task of Chinese Weibo in this field, the main work is as follows: in order to obtain sufficient Weibo corpus, this paper first implements the web page Weibo crawler program. Compared with the way of obtaining the Weibo, the efficiency of the acquisition is greatly improved, and a large number of Weibo corpus is obtained, and the corresponding pretreatment work is carried out. The acquisition of candidate entities is the key of entity link. For any entity to be disambiguated, a variety of candidate entities are proposed, which are given different weights to remove noise to improve the accuracy of disambiguation. The candidate knowledge base information is obtained mainly from Wikipedia and Baidu encyclopedia. For the words that do not exist in the encyclopedia, a meta-search is called to integrate the information on the network to complete the information acquisition. To solve the problem of sparse features of Weibo corpus, the user profile, tags and recent Weibo are used to expand, and then the keywords from Weibo are extracted to obtain the results of search engines such as Google, Baidu and Bing. An entity link algorithm based on multi-channel candidate entities and an entity link algorithm based on domain lexicon are implemented. Through the comparison of various methods, the algorithm can give a more ideal accurate value on the NLPCC2013 evaluation and open data set. In the end, the application system of knowledge extension and recommendation is built based on Sina Weibo open platform. The results of the algorithm processing in the system show that the expected results can be achieved.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP393.092
【參考文獻】
相關期刊論文 前5條
1 王國霞;劉賀平;;個性化推薦系統(tǒng)綜述[J];計算機工程與應用;2012年07期
2 彭澤映;俞曉明;許洪波;劉春陽;;大規(guī)模短文本的不完全聚類[J];中文信息學報;2011年01期
3 趙軍;劉康;周光有;蔡黎;;開放式文本信息抽取[J];中文信息學報;2011年06期
4 張劍峰;夏云慶;姚建民;;微博文本處理研究綜述[J];中文信息學報;2012年04期
5 許棣華;王志堅;林巧民;黃衛(wèi)東;;一種基于偏好的個性化標簽推薦系統(tǒng)[J];計算機應用研究;2011年07期
,本文編號:2106036
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2106036.html
最近更新
教材專著