基于微博的熱點(diǎn)話題提取
[Abstract]:With the rapid development of the Internet, Weibo, as an information platform, with its huge user group and unique user relationship structure, has shown great influence and played a more and more important role in people's social life. Has become a new force in the dissemination of information. At present, Sina Weibo alone released tens of millions or even hundreds of millions of Weibo in one day. It is very difficult to extract hot topics hidden behind massive data by manual processing. Therefore, it is of great significance to use the computer to process Weibo information automatically and to dig out hot topics from mass information in time, which is of great significance to understand the latest hot spots of public opinion and grasp the trend of public opinion. Because of its high feature dimension and sparse data, the traditional TF-IDF topic extraction method can not explain the relationship between words and words from the semantic level. The probabilistic topic model (LDA (Latent Dirichlet Allocation) holds that each document can contain more than one topic, and the generation probability of the corresponding words under different topics is different. Compared with other text models, LDA is more suitable for practical applications and has better description ability. This paper studies the topic mining and extraction of Weibo, the specific work includes: 1. By studying various text modeling methods, LDA is selected as the final model. By using Gibbs sampling method to solve the LDA model, the theme distribution vector of Weibo text is obtained. Using topic distribution vector as the text feature of Weibo can effectively reduce the dimension of data, and provide the data with low dimension and high 'grade' for the subsequent clustering algorithm. 2. The Single-Pass clustering algorithm is improved, which not only ensures the clustering effect, but also improves the time efficiency of clustering. This paper studies the topic word extraction algorithm of text class and proposes a similarity measure method based on word co-occurrence model. The similarity matrix is used for hierarchical clustering analysis. Select the largest class as the most representative of Weibo text content topic phrase. 4. A hot topic extraction system is completed, which combines web crawler, database module, word segmentation module, clustering module, word extraction module, and realizes Weibo hot topic extraction automatically.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP391.1;TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 殷俊;孟育耀;;微博的傳播特性與發(fā)展趨勢(shì)[J];今傳媒;2010年04期
2 周新媛;杜潔;何強(qiáng);;基于共現(xiàn)的詞聚類的研究[J];長(zhǎng)沙大學(xué)學(xué)報(bào);2007年02期
3 袁里馳;;一種基于互信息的詞聚類算法[J];系統(tǒng)工程;2008年05期
4 張鋒,樊孝忠,許云;基于遺傳算法的文本聚類特征選擇[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2004年S1期
5 鄭斐然;苗奪謙;張志飛;高燦;;一種中文微博新聞話題檢測(cè)的方法[J];計(jì)算機(jī)科學(xué);2012年01期
6 石晶;胡明;石鑫;戴國(guó)忠;;基于LDA模型的文本分割[J];計(jì)算機(jī)學(xué)報(bào);2008年10期
7 殷風(fēng)景;肖衛(wèi)東;葛斌;李芳芳;;一種面向網(wǎng)絡(luò)話題發(fā)現(xiàn)的增量文本聚類算法[J];計(jì)算機(jī)應(yīng)用研究;2011年01期
8 李偉;黃穎;;文本聚類算法的比較[J];科技情報(bào)開(kāi)發(fā)與經(jīng)濟(jì);2006年22期
9 趙世奇;劉挺;李生;;一種基于主題的文本聚類方法[J];中文信息學(xué)報(bào);2007年02期
10 喬亞男;齊勇;侯迪;;一種高穩(wěn)定性詞匯共現(xiàn)模型[J];西安交通大學(xué)學(xué)報(bào);2009年06期
,本文編號(hào):2243474
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2243474.html