天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于微博的熱點(diǎn)話題提取

發(fā)布時(shí)間:2018-09-14 18:09
【摘要】:隨著互聯(lián)網(wǎng)的高速發(fā)展,微博作為一種信息平臺(tái),以其龐大的用戶群、特有的用戶關(guān)系結(jié)構(gòu)顯示出巨大的影響力,在人們的社交生活中扮演著越來(lái)越重要的角色,已經(jīng)成為信息傳播的新勢(shì)力。目前,僅新浪微博一天的微博發(fā)布量就達(dá)到了幾千萬(wàn)甚至上億條,這種量級(jí)的數(shù)據(jù)很難通過(guò)人工處理的方法及時(shí)提取出隱藏在海量數(shù)據(jù)背后的熱點(diǎn)話題。因此,使用計(jì)算機(jī)自動(dòng)處理微博信息,及時(shí)從海量信息中挖掘出熱點(diǎn)話題,對(duì)于了解最新的輿論熱點(diǎn)、掌握輿論動(dòng)向有著重大意義。 傳統(tǒng)的TF-IDF話題提取方法,由于其特征維度較高、數(shù)據(jù)較稀疏,無(wú)法從語(yǔ)義層面解釋詞與詞之間的關(guān)系。概率主題模型LDA(Latent Dirichlet Allocation)認(rèn)為每個(gè)文檔可以包含多個(gè)主題,不同主題下對(duì)應(yīng)的詞的生成概率不同。相比于其它的文本模型,LDA更符合實(shí)際應(yīng)用中的情況,對(duì)文本有著更好的描述能力。本文針對(duì)微博話題的挖掘和提取進(jìn)行了研究,具體的工作包括: 1.通過(guò)研究各種文本建模方法,選取了LDA作為最終的模型。通過(guò)使用吉布斯抽樣方法求解LDA模型,得到了微博文本的主題分布向量。使用主題分布向量作為微博的文本特征有效地降低了數(shù)據(jù)的維度,為后續(xù)的聚類算法提供了維度低、區(qū)‘分度高的數(shù)據(jù)。 2.對(duì)Single-Pass聚類算法進(jìn)行了改進(jìn),在保證聚類效果的同時(shí)提升了聚類的時(shí)間效率。 3.研究了文本類的話題詞提取算法,提出了基于詞共現(xiàn)模型的相似度度量方法,使用相似度矩陣進(jìn)行層次聚類分析,選取最大的類作為最能代表微博文本類內(nèi)容的話題詞組。 4.完成了熱點(diǎn)話題提取系統(tǒng),有機(jī)組合了網(wǎng)絡(luò)爬蟲(chóng)、數(shù)據(jù)庫(kù)模塊、分詞模塊、聚類模塊、話顆詞提取模聲,實(shí)現(xiàn)了微博熱點(diǎn)話題的自動(dòng)提取。
[Abstract]:With the rapid development of the Internet, Weibo, as an information platform, with its huge user group and unique user relationship structure, has shown great influence and played a more and more important role in people's social life. Has become a new force in the dissemination of information. At present, Sina Weibo alone released tens of millions or even hundreds of millions of Weibo in one day. It is very difficult to extract hot topics hidden behind massive data by manual processing. Therefore, it is of great significance to use the computer to process Weibo information automatically and to dig out hot topics from mass information in time, which is of great significance to understand the latest hot spots of public opinion and grasp the trend of public opinion. Because of its high feature dimension and sparse data, the traditional TF-IDF topic extraction method can not explain the relationship between words and words from the semantic level. The probabilistic topic model (LDA (Latent Dirichlet Allocation) holds that each document can contain more than one topic, and the generation probability of the corresponding words under different topics is different. Compared with other text models, LDA is more suitable for practical applications and has better description ability. This paper studies the topic mining and extraction of Weibo, the specific work includes: 1. By studying various text modeling methods, LDA is selected as the final model. By using Gibbs sampling method to solve the LDA model, the theme distribution vector of Weibo text is obtained. Using topic distribution vector as the text feature of Weibo can effectively reduce the dimension of data, and provide the data with low dimension and high 'grade' for the subsequent clustering algorithm. 2. The Single-Pass clustering algorithm is improved, which not only ensures the clustering effect, but also improves the time efficiency of clustering. This paper studies the topic word extraction algorithm of text class and proposes a similarity measure method based on word co-occurrence model. The similarity matrix is used for hierarchical clustering analysis. Select the largest class as the most representative of Weibo text content topic phrase. 4. A hot topic extraction system is completed, which combines web crawler, database module, word segmentation module, clustering module, word extraction module, and realizes Weibo hot topic extraction automatically.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP391.1;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 殷俊;孟育耀;;微博的傳播特性與發(fā)展趨勢(shì)[J];今傳媒;2010年04期

2 周新媛;杜潔;何強(qiáng);;基于共現(xiàn)的詞聚類的研究[J];長(zhǎng)沙大學(xué)學(xué)報(bào);2007年02期

3 袁里馳;;一種基于互信息的詞聚類算法[J];系統(tǒng)工程;2008年05期

4 張鋒,樊孝忠,許云;基于遺傳算法的文本聚類特征選擇[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2004年S1期

5 鄭斐然;苗奪謙;張志飛;高燦;;一種中文微博新聞話題檢測(cè)的方法[J];計(jì)算機(jī)科學(xué);2012年01期

6 石晶;胡明;石鑫;戴國(guó)忠;;基于LDA模型的文本分割[J];計(jì)算機(jī)學(xué)報(bào);2008年10期

7 殷風(fēng)景;肖衛(wèi)東;葛斌;李芳芳;;一種面向網(wǎng)絡(luò)話題發(fā)現(xiàn)的增量文本聚類算法[J];計(jì)算機(jī)應(yīng)用研究;2011年01期

8 李偉;黃穎;;文本聚類算法的比較[J];科技情報(bào)開(kāi)發(fā)與經(jīng)濟(jì);2006年22期

9 趙世奇;劉挺;李生;;一種基于主題的文本聚類方法[J];中文信息學(xué)報(bào);2007年02期

10 喬亞男;齊勇;侯迪;;一種高穩(wěn)定性詞匯共現(xiàn)模型[J];西安交通大學(xué)學(xué)報(bào);2009年06期

,

本文編號(hào):2243474

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2243474.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶dab75***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com