微博熱點(diǎn)發(fā)現(xiàn)技術(shù)的研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-07-20 15:37
【摘要】:隨著WEB2.0和社交網(wǎng)站蓬勃發(fā)展,互聯(lián)網(wǎng)進(jìn)入了一個(gè)完全嶄新的“自媒體”時(shí)代。以新浪微博、Twitter等為代表的微博網(wǎng)站成為了人們關(guān)注的焦點(diǎn),但隨之而來的巨大的信息量也給人們帶來了困擾,如何從海量的微博信息流中獲得最新的熱門話題,便成人們一種迫切的需求。 通過分析微博信息特點(diǎn),并結(jié)合國內(nèi)外話題跟蹤檢測的方法,首先重點(diǎn)改進(jìn)了單遍聚類算法,該算法通過計(jì)算微博信息流的質(zhì)心,過濾掉大量離質(zhì)心距離過遠(yuǎn)的微博,有效降低了計(jì)算的復(fù)雜度,解決了對大數(shù)據(jù)量的樣本集進(jìn)行聚類時(shí)所出現(xiàn)的計(jì)算量過高,,無法進(jìn)行實(shí)時(shí)運(yùn)算的問題,同時(shí)改善了單遍聚類算法的準(zhǔn)確率對于樣本輸入的順序依賴過高的缺點(diǎn);其次,對樸素貝葉斯信息分類技術(shù)進(jìn)行了改進(jìn),提出了一種在微博文本短小、特征少的情況下提高分類準(zhǔn)確率的方法;最后,在文本特征提取中,采用搜索引擎技術(shù)來對文本特征項(xiàng)提取過程中的互信息進(jìn)行計(jì)算,解決了大規(guī)模短文本難以計(jì)算互信息的問題。 通過搭建微博熱點(diǎn)發(fā)現(xiàn)平臺,并在長期的使用中表明,該微博熱點(diǎn)發(fā)現(xiàn)技術(shù)取得了良好的效果,該算法比傳統(tǒng)的算法更適用于微博的平臺,具有速度快、精確度高、可進(jìn)行大數(shù)據(jù)量實(shí)時(shí)計(jì)算的優(yōu)點(diǎn),有較高的理論意義和實(shí)用價(jià)值。
[Abstract]:With Web 2.0 and social networking sites booming, the Internet has entered a completely new era of self-media. The Weibo websites, such as Sina Weibo Twitter and so on, have become the focus of attention, but the huge amount of information that follows has also brought people trouble, how to get the latest hot topic from the massive Weibo information flow, It becomes an urgent need for people. By analyzing the characteristics of Weibo information and combining the methods of topic tracking and detection at home and abroad, the single-pass clustering algorithm is improved. By calculating the centroid of Weibo information flow, the algorithm filters out a large number of Weibo which are far away from the centroid. The complexity of computation is reduced effectively, and the problem that the amount of computation is too high for the large data set to be clustered is solved, which can not be used in real time operation. At the same time, it improves the accuracy of single-pass clustering algorithm, which depends too much on the order of sample input. Secondly, the naive Bayesian information classification technology is improved, and a short text in Weibo is proposed. Finally, in the text feature extraction, search engine technology is used to calculate the mutual information in the text feature extraction process. The problem that mutual information is difficult to calculate in large-scale short text is solved. Through the construction of Weibo hot spot discovery platform, and in the long-term application, it shows that the Weibo hot spot discovery technology has achieved good results, this algorithm is more suitable for the platform of Weibo than the traditional algorithm, and has fast speed and high accuracy. The advantages of real-time calculation of large amount of data have high theoretical significance and practical value.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP393.092
本文編號:2133994
[Abstract]:With Web 2.0 and social networking sites booming, the Internet has entered a completely new era of self-media. The Weibo websites, such as Sina Weibo Twitter and so on, have become the focus of attention, but the huge amount of information that follows has also brought people trouble, how to get the latest hot topic from the massive Weibo information flow, It becomes an urgent need for people. By analyzing the characteristics of Weibo information and combining the methods of topic tracking and detection at home and abroad, the single-pass clustering algorithm is improved. By calculating the centroid of Weibo information flow, the algorithm filters out a large number of Weibo which are far away from the centroid. The complexity of computation is reduced effectively, and the problem that the amount of computation is too high for the large data set to be clustered is solved, which can not be used in real time operation. At the same time, it improves the accuracy of single-pass clustering algorithm, which depends too much on the order of sample input. Secondly, the naive Bayesian information classification technology is improved, and a short text in Weibo is proposed. Finally, in the text feature extraction, search engine technology is used to calculate the mutual information in the text feature extraction process. The problem that mutual information is difficult to calculate in large-scale short text is solved. Through the construction of Weibo hot spot discovery platform, and in the long-term application, it shows that the Weibo hot spot discovery technology has achieved good results, this algorithm is more suitable for the platform of Weibo than the traditional algorithm, and has fast speed and high accuracy. The advantages of real-time calculation of large amount of data have high theoretical significance and practical value.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 張華平,劉群;基于角色標(biāo)注的中國人名自動(dòng)識別研究[J];計(jì)算機(jī)學(xué)報(bào);2004年01期
2 袁軍鵬;朱東華;李毅;李連宏;黃進(jìn);;文本挖掘技術(shù)研究進(jìn)展[J];計(jì)算機(jī)應(yīng)用研究;2006年02期
3 黃永光;劉挺;車萬翔;胡曉光;;面向變異短文本的快速聚類算法[J];中文信息學(xué)報(bào);2007年02期
4 陸玉昌,魯明羽,李凡,周立柱;向量空間法中單詞權(quán)重函數(shù)的分析和構(gòu)造[J];計(jì)算機(jī)研究與發(fā)展;2002年10期
本文編號:2133994
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2133994.html
最近更新
教材專著