基于最大樹劃分的微博熱點(diǎn)話題檢測方法研究
本文選題:微博 + 熱點(diǎn)話題檢測 ; 參考:《重慶大學(xué)》2014年碩士論文
【摘要】:隨著傳統(tǒng)互聯(lián)網(wǎng)技術(shù)和移動互聯(lián)網(wǎng)技術(shù)的快速發(fā)展,網(wǎng)絡(luò)信息的傳播速度和傳播規(guī)模都得到了極大的增長,人們的交流方式也隨之發(fā)生了改變。微博作為迅速崛起的新興網(wǎng)絡(luò)媒體,,越來越多地受到人們的關(guān)注。作為消息傳播和互動交流的平臺微博能夠在短時間內(nèi)產(chǎn)生大量的信息,這使得用戶很容易陷入到局部的微博信息中而失去了對整個微博空間最新動態(tài)的了解。面對浩瀚的微博信息,如何能夠快速準(zhǔn)確地獲取整個微博社區(qū)中的熱點(diǎn)話題,已經(jīng)成為一個重要的研究方向。 雖然傳統(tǒng)的話題檢測技術(shù)已經(jīng)相對比較成熟,能夠幫助用戶快速地獲取隱藏在大量長文本中的話題。但是該類方法在處理海量微博短文本時仍然存在著明顯的不足:一是計算復(fù)雜度過高,海量微博信息間的文本相似度計算對傳統(tǒng)話題檢測系統(tǒng)是致命的;二是丟失詞語的語義信息,在傳統(tǒng)的話題檢測模型中,僅僅通過文檔間重復(fù)詞語的多少來判定文檔的相似度,忽略了詞語之間的語義關(guān)聯(lián)。 針對上述問題,本文通過對微博熱點(diǎn)話題檢測相關(guān)理論和算法的學(xué)習(xí),分析現(xiàn)有的微博熱點(diǎn)話題檢測算法的優(yōu)缺點(diǎn),結(jié)合微博自身的特點(diǎn),提出了一種基于最大樹劃分的微博熱點(diǎn)話題檢測方法。通過在采集到的微博數(shù)據(jù)集上進(jìn)行的大量實(shí)驗(yàn),驗(yàn)證了本文方法的有效性。本文所提出方法的主要貢獻(xiàn)如下: ①提出了只針對一段時間內(nèi)的微博數(shù)據(jù)進(jìn)行話題檢測的思想,這符合實(shí)際中微博系統(tǒng)對熱點(diǎn)話題檢測功能的要求,同時能夠很好地去除在話題檢測的過程中歷史已有話題對新話題檢測的影響。 ②改進(jìn)了特征項(xiàng)權(quán)重和微博相似度的計算方法。通過將詞語間的語義相似信息結(jié)合到現(xiàn)有的計算方法中,達(dá)到了降低中文微博由于一詞多義和一義多詞現(xiàn)象所造成的計算誤差的目的,提高了計算的準(zhǔn)確性。 ③提出了基于最大樹劃分的微博熱點(diǎn)話題檢測方法。通過對模糊相似矩陣進(jìn)行最大樹生成有效地去除了微博彼此間那些似是而非的噪音相似數(shù)據(jù),降低了計算規(guī)模。同時,采用改進(jìn)的K-means聚類算法能夠自動確定聚類個數(shù),使得聚類結(jié)果更加準(zhǔn)確。另外,提出了計算微博話題熱度的方法,用以對微博話題的熱度進(jìn)行排序,發(fā)現(xiàn)熱點(diǎn)話題。 ④在整體執(zhí)行效率、準(zhǔn)確率方面相較其他微博話題檢測方法有所提高,有效提高了傳統(tǒng)話題檢測算法在處理大規(guī)模數(shù)據(jù)時存在的效率低下問題。
[Abstract]:With the rapid development of traditional Internet technology and mobile Internet technology, the speed and scale of network information transmission have been greatly increased, and the way people communicate has also changed. Weibo as a rapidly rising network media, more and more people pay attention to. As a platform for message dissemination and interactive communication, Weibo can generate a large amount of information in a short time, which makes it easy for users to fall into the local Weibo information and lose their understanding of the latest developments in the entire Weibo space. In the face of the vast amount of Weibo information, how to quickly and accurately access the hot topics in the whole Weibo community has become an important research direction. Although the traditional topic detection technology is relatively mature, it can help users to quickly obtain topics hidden in a large number of long text. However, this kind of method still has obvious shortcomings in dealing with massive Weibo short text: first, the computational complexity is too high, the text similarity calculation between massive Weibo information is fatal to the traditional topic detection system; The second is the loss of semantic information of words. In the traditional topic detection model the similarity of documents is judged only by the number of repeated words between documents and the semantic association between words is ignored. In view of the above problems, this paper analyzes the advantages and disadvantages of the existing Weibo hot topic detection algorithms by studying the relevant theories and algorithms of Weibo hot topic detection, and combines the characteristics of Weibo itself. A method of Weibo hot topic detection based on maximal tree partition is proposed. The effectiveness of the proposed method is verified by a large number of experiments on the collected Weibo data sets. The main contributions of the proposed method are as follows: 1 the idea of topic detection for Weibo data for a period of time is proposed, which accords with the requirement of Weibo system for hot topic detection in practice. At the same time, it can remove the influence of the historical topic on the new topic detection. 2. The method of calculating the weight of feature item and the similarity of Weibo is improved. By combining the semantic similarity information between words and phrases into the existing calculation methods, the purpose of reducing the calculation error caused by the phenomenon of polysemy and multi-word meaning in Chinese Weibo is achieved. The accuracy of calculation is improved. 3 Weibo hot topic detection method based on maximal tree partition is proposed. By generating the maximum tree of the fuzzy similarity matrix, the specious noise similarity data between Weibo and each other are removed effectively, and the computational scale is reduced. At the same time, the improved K-means clustering algorithm can automatically determine the number of clustering, making the clustering results more accurate. In addition, a method to calculate the heat of Weibo topics is proposed, which is used to sort the heat of Weibo topics and find hot topics. 4 the overall execution efficiency and accuracy are improved compared with other Weibo topic detection methods. It effectively improves the efficiency of traditional topic detection algorithm in dealing with large-scale data.
【學(xué)位授予單位】:重慶大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092;TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 劉群,張華平,俞鴻魁,程學(xué)旗;基于層疊隱馬模型的漢語詞法分析[J];計算機(jī)研究與發(fā)展;2004年08期
2 張晨逸;孫建伶;丁軼群;;基于MB-LDA模型的微博主題挖掘[J];計算機(jī)研究與發(fā)展;2011年10期
3 張曉艷;王挺;陳火旺;;命名實(shí)體識別研究[J];計算機(jī)科學(xué);2005年04期
4 劉青寶;鄧蘇;張維明;;基于相對密度的聚類算法[J];計算機(jī)科學(xué);2007年02期
5 鄭斐然;苗奪謙;張志飛;高燦;;一種中文微博新聞話題檢測的方法[J];計算機(jī)科學(xué);2012年01期
6 吳為勝;武友新;游建平;萬敏;;一種基于線性的樸素貝葉斯分類器知識庫的組織方法[J];計算機(jī)與現(xiàn)代化;2009年10期
7 洪宇;張宇;劉挺;李生;;話題檢測與跟蹤的評測及研究綜述[J];中文信息學(xué)報;2007年06期
8 路榮;項(xiàng)亮;劉明榮;楊青;;基于隱主題分析和文本聚類的微博客中新聞話題的發(fā)現(xiàn)[J];模式識別與人工智能;2012年03期
9 李凡,魯明羽,陸玉昌;關(guān)于文本特征抽取新方法的研究[J];清華大學(xué)學(xué)報(自然科學(xué)版);2001年07期
10 張敏,于劍;基于劃分的模糊聚類算法[J];軟件學(xué)報;2004年06期
本文編號:2022959
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2022959.html