微博熱點發(fā)現(xiàn)技術的研究與實現(xiàn)

發(fā)布時間：2018-07-20 15:37

【摘要】：隨著WEB2.0和社交網(wǎng)站蓬勃發(fā)展，互聯(lián)網(wǎng)進入了一個完全嶄新的“自媒體”時代。以新浪微博、Twitter等為代表的微博網(wǎng)站成為了人們關注的焦點，但隨之而來的巨大的信息量也給人們帶來了困擾，如何從海量的微博信息流中獲得最新的熱門話題，便成人們一種迫切的需求。通過分析微博信息特點，并結(jié)合國內(nèi)外話題跟蹤檢測的方法，首先重點改進了單遍聚類算法，該算法通過計算微博信息流的質(zhì)心，過濾掉大量離質(zhì)心距離過遠的微博，有效降低了計算的復雜度，解決了對大數(shù)據(jù)量的樣本集進行聚類時所出現(xiàn)的計算量過高，，無法進行實時運算的問題，同時改善了單遍聚類算法的準確率對于樣本輸入的順序依賴過高的缺點；其次，對樸素貝葉斯信息分類技術進行了改進，提出了一種在微博文本短小、特征少的情況下提高分類準確率的方法；最后，在文本特征提取中，采用搜索引擎技術來對文本特征項提取過程中的互信息進行計算，解決了大規(guī)模短文本難以計算互信息的問題。通過搭建微博熱點發(fā)現(xiàn)平臺，并在長期的使用中表明，該微博熱點發(fā)現(xiàn)技術取得了良好的效果，該算法比傳統(tǒng)的算法更適用于微博的平臺，具有速度快、精確度高、可進行大數(shù)據(jù)量實時計算的優(yōu)點，有較高的理論意義和實用價值。
[Abstract]:With Web 2.0 and social networking sites booming, the Internet has entered a completely new era of self-media. The Weibo websites, such as Sina Weibo Twitter and so on, have become the focus of attention, but the huge amount of information that follows has also brought people trouble, how to get the latest hot topic from the massive Weibo information flow, It becomes an urgent need for people. By analyzing the characteristics of Weibo information and combining the methods of topic tracking and detection at home and abroad, the single-pass clustering algorithm is improved. By calculating the centroid of Weibo information flow, the algorithm filters out a large number of Weibo which are far away from the centroid. The complexity of computation is reduced effectively, and the problem that the amount of computation is too high for the large data set to be clustered is solved, which can not be used in real time operation. At the same time, it improves the accuracy of single-pass clustering algorithm, which depends too much on the order of sample input. Secondly, the naive Bayesian information classification technology is improved, and a short text in Weibo is proposed. Finally, in the text feature extraction, search engine technology is used to calculate the mutual information in the text feature extraction process. The problem that mutual information is difficult to calculate in large-scale short text is solved. Through the construction of Weibo hot spot discovery platform, and in the long-term application, it shows that the Weibo hot spot discovery technology has achieved good results, this algorithm is more suitable for the platform of Weibo than the traditional algorithm, and has fast speed and high accuracy. The advantages of real-time calculation of large amount of data have high theoretical significance and practical value.
【學位授予單位】：華中科技大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP393.092

【參考文獻】

相關期刊論文前4條

1 張華平,劉群;基于角色標注的中國人名自動識別研究[J];計算機學報;2004年01期

2 袁軍鵬;朱東華;李毅;李連宏;黃進;;文本挖掘技術研究進展[J];計算機應用研究;2006年02期

3 黃永光;劉挺;車萬翔;胡曉光;;面向變異短文本的快速聚類算法[J];中文信息學報;2007年02期

4 陸玉昌,魯明羽,李凡,周立柱;向量空間法中單詞權(quán)重函數(shù)的分析和構(gòu)造[J];計算機研究與發(fā)展;2002年10期

本文編號：2133994

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2133994.html

上一篇：基于Web的大規(guī)模中文人物信息提取研究
下一篇：Internet信息過濾中Bayesian過濾應用

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

微博熱點發(fā)現(xiàn)技術的研究與實現(xiàn)