基于語(yǔ)義相似度計(jì)算及Twitter Storm平臺(tái)的微博檢索研究
發(fā)布時(shí)間:2018-07-03 18:22
本文選題:微博 + 語(yǔ)義擴(kuò)展 ; 參考:《武漢理工大學(xué)》2014年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)在國(guó)內(nèi)外的飛速發(fā)展,微博作為一款在世界各地被廣泛使用的互聯(lián)網(wǎng)社交產(chǎn)品具有跨時(shí)代的意義。它在為用戶(hù)提供開(kāi)放和集中的互聯(lián)網(wǎng)社交服務(wù)的同時(shí),逐漸發(fā)展為具有較大影響力的新媒體。鑒于微博數(shù)據(jù)的大規(guī)模及實(shí)時(shí)的特點(diǎn),如何在海量及動(dòng)態(tài)更新的微博數(shù)據(jù)中為用戶(hù)提供其感興趣的內(nèi)容顯得尤為重要。 本文所討論的基于特征擴(kuò)展和相似度計(jì)算的微博檢索的內(nèi)容包括:1、擴(kuò)展微博短文本的內(nèi)容,豐富微博的語(yǔ)義特征,為檢索結(jié)果與檢索關(guān)鍵字在語(yǔ)義上的相關(guān)性提供保障。2、利用WordNet機(jī)器語(yǔ)義字典的網(wǎng)狀結(jié)構(gòu)得到較準(zhǔn)確的微博語(yǔ)義相似度值。3、以相似度值的高低作為檢索排序的標(biāo)準(zhǔn)來(lái)模擬一個(gè)實(shí)時(shí)的微博檢索過(guò)程,能夠完成對(duì)關(guān)鍵字的微博檢索,并為每一個(gè)檢索到的微博提供相關(guān)微博的列表。 在豐富微博語(yǔ)義方面,本文提出基于維基百科的語(yǔ)義特征擴(kuò)展方法,該方法將微博中的名詞作為表達(dá)微博主題的關(guān)鍵詞,對(duì)名詞進(jìn)行關(guān)聯(lián)拓展以豐富微博的信息內(nèi)容。具體地,本文將維基百科作為語(yǔ)義特征的擴(kuò)展源,將名詞詞條中的“category”模塊下所包含的類(lèi)別作為擴(kuò)展語(yǔ)義特征添加到原微博中來(lái)豐富微博語(yǔ)義,并通過(guò)實(shí)驗(yàn)證明使用該語(yǔ)義擴(kuò)展方法能夠在一定程度上提高相似度計(jì)算結(jié)果的質(zhì)量。在獲取較高準(zhǔn)確度的微博相似度值方面,本文利用了普林斯頓大學(xué)開(kāi)發(fā)的英語(yǔ)詞網(wǎng)數(shù)據(jù)庫(kù)WordNet的網(wǎng)狀結(jié)構(gòu)得到基于微博語(yǔ)義的相似度。具體地,我們使用[37]中提出的基于路徑長(zhǎng)度的方法,同時(shí)考慮兩個(gè)單詞以及它們的最近公共節(jié)點(diǎn)在WordNet中距離根節(jié)點(diǎn)的路徑長(zhǎng)度(深度)來(lái)計(jì)算語(yǔ)義相似度,在實(shí)驗(yàn)中與基于VSM的余弦相似度方法做比較證明該方法能夠在一定程度上提高找到相關(guān)微博的準(zhǔn)確度與召回率。在模擬實(shí)時(shí)微博檢索方面,本文研究了開(kāi)源及實(shí)時(shí)的數(shù)據(jù)處理平臺(tái)Twitter Storm的架構(gòu)及應(yīng)用,采用本地模式模擬數(shù)據(jù)的實(shí)時(shí)和分布式處理。具體地,本文定義了自己的微博檢索拓?fù)浣Y(jié)構(gòu),,并實(shí)現(xiàn)拓?fù)浣Y(jié)構(gòu)中的每個(gè)節(jié)點(diǎn)功能,包括twitter數(shù)據(jù)集的預(yù)處理、節(jié)點(diǎn)間信息傳輸、多節(jié)點(diǎn)的相似度的并行計(jì)算與相似度表的維護(hù)、基于相似度值的檢索結(jié)果排序,以及為每個(gè)檢索結(jié)果提供相關(guān)微博等,從而將微博檢索排序嵌入到了Twitter Storm平臺(tái)上。
[Abstract]:With the rapid development of the Internet at home and abroad, Weibo, as a widely used social product in the world, has a cross-epoch significance. While providing users with open and centralized Internet social services, it has gradually developed into new media with greater influence. In view of the large scale and real-time characteristics of Weibo data, it is particularly important to provide users with interesting content in the massive and dynamically updated Weibo data. The content of Weibo retrieval based on feature extension and similarity calculation discussed in this paper includes: 1, extending the content of short text of Weibo, enriching the semantic features of Weibo. In order to guarantee the semantic correlation between retrieval results and search keywords, a more accurate semantic similarity value of Weibo. 3 is obtained by using the mesh structure of WordNet machine semantic dictionary, and the level of similarity value is regarded as the standard of retrieval ranking. To simulate a real-time Weibo retrieval process, The ability to complete Weibo retrieval of keywords and provide a list of relevant Weibo for each retrieved Weibo. In order to enrich the semantics of Weibo, this paper proposes a method of extending semantic features based on Wikipedia. In this method, the nouns in Weibo are used as keywords to express the subject of Weibo, and the nouns are extended to enrich the information content of Weibo. In this paper, Wikipedia is used as the extension source of semantic features, and the categories contained under the "category" module of nouns are added to the original Weibo to enrich the Weibo semantics. Experiments show that the semantic extension method can improve the quality of the similarity calculation results to a certain extent. In order to obtain the Weibo similarity value with high accuracy, this paper uses the mesh structure of WordNet, an English word net database developed by Princeton University, to obtain the similarity based on Weibo semantics. Specifically, we use the path-length approach proposed in [37] to calculate semantic similarity, taking into account the length (depth) of the path between two words and their most recent common nodes in WordNet from the root node. The comparison with the cosine similarity method based on VSM-based method proves that this method can improve the accuracy and recall rate of finding relevant Weibo to some extent. In the aspect of simulating real-time Weibo retrieval, this paper studies the architecture and application of open source and real-time data processing platform Weibo Storm, and simulates the real-time and distributed processing of data in local mode. Specifically, this paper defines its own Weibo retrieval topology structure, and realizes the function of each node in the topology structure, including the preprocessing of twitter data set, the transmission of information between nodes, the parallel computation of multi-node similarity and the maintenance of similarity table. The search results are sorted based on similarity value, and the relevant Weibo is provided for each retrieval result, so the Weibo retrieval sorting is embedded into the Twitter Storm platform.
【學(xué)位授予單位】:武漢理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:TP393.092;TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 晉耀紅;基于語(yǔ)義的文本過(guò)濾系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與應(yīng)用;2003年17期
2 張劍峰;夏云慶;姚建民;;微博文本處理研究綜述[J];中文信息學(xué)報(bào);2012年04期
3 文坤梅;徐帥;李瑞軒;辜希武;李玉華;;微博及中文微博信息處理研究綜述[J];中文信息學(xué)報(bào);2012年06期
4 劉曉華;韋福如;段亞娟;周明;;基于語(yǔ)義分析的微博搜索[J];山東大學(xué)學(xué)報(bào)(理學(xué)版);2012年05期
相關(guān)博士學(xué)位論文 前1條
1 宋萬(wàn)鵬;短文本相似度計(jì)算在用戶(hù)交互式問(wèn)答系統(tǒng)中的應(yīng)用[D];中國(guó)科學(xué)技術(shù)大學(xué);2010年
本文編號(hào):2094589
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2094589.html
最近更新
教材專(zhuān)著