天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于主題模型的微博話題挖掘

發(fā)布時(shí)間:2018-08-23 10:47
【摘要】:隨著微博用戶的不斷增長,國外的Twitter和國內(nèi)的新浪微博已經(jīng)成為媒體和個(gè)人發(fā)布信息的重要平臺(tái)。對于微博這種特殊的文本,通常小于140字,包含了豐富的社會(huì)化信息,且微博內(nèi)容不僅包含話題文本,也包含其他無話題表征能力的冗余文本,傳統(tǒng)的文本挖掘算法并不能很好的做微博話題的提取。本文結(jié)合中文詞性標(biāo)注和LDA(Latent Dirichlet Allocation)主題模型兩種方法用于微博話題提取,使用增量聚類方法確定微博話題個(gè)數(shù)和微博聚類,運(yùn)用中文詞性標(biāo)注可以很好的過濾掉微博文本中無話題表征能力的文本詞語,運(yùn)用LDA主題模型可以將文本信息表示在一個(gè)低維的主題空間之中,從語義上更好的挖掘微博話題。使用增量聚類方法可以有效的發(fā)現(xiàn)微博話題個(gè)數(shù),無需預(yù)先在聚類方法中指定話題個(gè)數(shù)。實(shí)驗(yàn)表明相較于傳統(tǒng)的文本分析分析方法,中文詞性標(biāo)注,LDA主題模型和增量聚類三者相結(jié)合能夠提高話題發(fā)現(xiàn)的準(zhǔn)確率 本文主要完成了以下幾項(xiàng)工作: (1)分析基于傳統(tǒng)文本模型進(jìn)行話題提取的方法,實(shí)驗(yàn)結(jié)果指出基于傳統(tǒng)文本模型優(yōu)勢和不足。提出基于LDA主題模型進(jìn)行微博話題檢測和提取的方法。 (2)基于LDA主題模型進(jìn)行微博話題檢測的過程中,發(fā)現(xiàn)文本預(yù)處理對于微博話題提取來說,至關(guān)重要。大量的微博中包含大量與話題無關(guān)的成分,干擾微博話題提取。提出在基于LDA主題模型進(jìn)行微博話取與中文詞性標(biāo)注進(jìn)行結(jié)合,可以有效的提高話題提取的精度和準(zhǔn)確性。并且進(jìn)行實(shí)驗(yàn)證實(shí)中文詞性標(biāo)注有助于提高話題提取的準(zhǔn)確性。 (3)分析傳統(tǒng)話題提取中使用的聚類方法需要指定特定的話題個(gè)數(shù)的不足,從而使用增量聚類的方法single-pass這一聚類方法進(jìn)行話題聚類,并且在single-pass算法的基礎(chǔ)上提出批處理的思想對single-pass算法進(jìn)行改進(jìn)。并且通過實(shí)驗(yàn)對比,指出改進(jìn)后的single-pass聚類算法能夠有效發(fā)現(xiàn)話題的數(shù)目。
[Abstract]:With the continuous growth of Weibo users, Twitter abroad and Sina Weibo at home have become an important platform for media and individuals to publish information. For special text such as Weibo, which is usually less than 140 words, it contains a wealth of social information, and the content of Weibo contains not only topic text, but also other redundant text without topic representation. The traditional text mining algorithm can not do Weibo topic extraction very well. This paper combines Chinese part of speech tagging and LDA (Latent Dirichlet Allocation) topic model for Weibo topic extraction, and uses incremental clustering method to determine the number of Weibo topics and Weibo clustering. Using Chinese part of speech tagging can filter out the text words with no topic representation in Weibo text, and use LDA topic model to express the text information in a low-dimensional topic space, so as to excavate Weibo topic better semantically. Using incremental clustering method can find the number of Weibo topics effectively, without specifying the number of topics in the clustering method in advance. The experiment shows that compared with the traditional text analysis method, The combination of LDA topic model and incremental clustering in Chinese part-of-speech tagging can improve the accuracy of topic discovery. This paper mainly completes the following work: (1) Analysis based on traditional text model The method of topic extraction, The experimental results point out the advantages and disadvantages of the traditional text model. This paper proposes a method of Weibo topic detection and extraction based on LDA topic model. (2) in the process of Weibo topic detection based on LDA topic model, it is very important to find out that text preprocessing is very important for Weibo topic extraction. A large number of Weibo contains a large number of topic independent components, interfering with Weibo topic extraction. It is proposed that the combination of Weibo speech extraction and Chinese part of speech tagging based on LDA topic model can effectively improve the accuracy and accuracy of topic extraction. It is proved by experiments that Chinese part-of-speech tagging can improve the accuracy of topic extraction. (3) it is necessary to specify the number of specific topics in traditional clustering methods. The incremental clustering method, single-pass, is used to cluster the topic, and based on the single-pass algorithm, the idea of batch processing is proposed to improve the single-pass algorithm. Through experimental comparison, it is pointed out that the improved single-pass clustering algorithm can effectively find the number of topics.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP393.092;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前3條

1 張晨逸;孫建伶;丁軼群;;基于MB-LDA模型的微博主題挖掘[J];計(jì)算機(jī)研究與發(fā)展;2011年10期

2 鄭斐然;苗奪謙;張志飛;高燦;;一種中文微博新聞話題檢測的方法[J];計(jì)算機(jī)科學(xué);2012年01期

3 彭澤映;俞曉明;許洪波;劉春陽;;大規(guī)模短文本的不完全聚類[J];中文信息學(xué)報(bào);2011年01期

,

本文編號:2198849

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2198849.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶c3cbd***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com