基于VSM-BTM主題模型的微博熱點(diǎn)話題發(fā)現(xiàn)研究
本文選題:微博 + 話題檢測 ; 參考:《西南大學(xué)》2017年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)的飛速發(fā)展,微博作為一種社交媒體已經(jīng)獲得了社會(huì)各界的廣泛關(guān)注。但是如何從海量、不規(guī)則的微博數(shù)據(jù)中高效地提取出有效的信息來進(jìn)行話題發(fā)現(xiàn),仍然是目前亟待解決的問題。因此,使用主題模型挖掘微博數(shù)據(jù)的方法得以產(chǎn)生。目前,學(xué)者們已經(jīng)對于主題模型進(jìn)行了大量的研究,但現(xiàn)有的方法技術(shù)仍然存在一些不足,主要體現(xiàn)在:一是計(jì)算的復(fù)雜度太高,對于大數(shù)據(jù)級別的微博數(shù)據(jù)計(jì)算的效率不高;二是使用一些主題模型(比如傳統(tǒng)的LDA模型)對微博這種短文把數(shù)據(jù)進(jìn)行聚類后的準(zhǔn)確度不高等;诖,本文提出了一種融入改進(jìn)的VSM模型和BTM主題模型和改進(jìn)的適合微博數(shù)據(jù)的K-Means聚類方法的新浪微博數(shù)據(jù)挖掘方法,在保證計(jì)算微博數(shù)據(jù)效率的同時(shí),提高微博數(shù)據(jù)挖掘的準(zhǔn)確度。本文對VSM-BTM主題模型的微博數(shù)據(jù)挖掘方法進(jìn)行了研究,研究內(nèi)容主要分為微博數(shù)據(jù)的預(yù)處理、VSM-BTM建模、適合微博的聚類方法三個(gè)部分。其中,微博數(shù)據(jù)的預(yù)處理包括分詞、去停用詞、刪除噪音數(shù)據(jù)等過程,并將預(yù)處理的結(jié)果以txt格式的文本保存,作為下一步主題建模的輸入。在VSM-BTM建模過程中,首先使用現(xiàn)有的BTM主題模型進(jìn)行建模,對數(shù)據(jù)預(yù)處理的結(jié)果進(jìn)行不斷地迭代,得到“文檔-主題”矩陣和“主題-詞語”矩陣,同時(shí),利用BTM主題模型生成的詞庫表和微博數(shù)據(jù)轉(zhuǎn)碼結(jié)果,提出了一種使用JS距離和余弦距離相結(jié)合計(jì)算微博數(shù)據(jù)之間的相似度的方法。適合微博的聚類方法是使用適合微博數(shù)據(jù)的改進(jìn)K-Means聚類方法對建模結(jié)果進(jìn)行聚類分析,主要是通過利用現(xiàn)有微博數(shù)據(jù)選取合適的初始簇和計(jì)算距離的方法對傳統(tǒng)的K-Means聚類算法進(jìn)行了改進(jìn)。最后采用準(zhǔn)確率、召回率和F1值對實(shí)驗(yàn)結(jié)果進(jìn)行分析評價(jià)。使用VSM-BTM主題模型進(jìn)行建模的方法避免了微博數(shù)據(jù)稀疏性的缺陷,且不需要使用外部信息對微博數(shù)據(jù)進(jìn)行擴(kuò)充,降低了對文本以外信息的依賴性。通過實(shí)驗(yàn),本文對單純的LDA主題模型、單純的BTM主題模型和本文提出的VSM-BTM主題模型的微博熱點(diǎn)話題發(fā)現(xiàn)效果進(jìn)行對比分析,以3個(gè)主題模型的準(zhǔn)確率、召回率和F1值為對比分析的依據(jù),發(fā)現(xiàn)本文提出的VSM-BTM主題模型在各個(gè)評價(jià)因素中都優(yōu)于單純的LDA主題模型和單純的BTM主題模型的微博熱點(diǎn)話題發(fā)現(xiàn)效果,從而證明了本文使用的主題模型對微博數(shù)據(jù)進(jìn)行建模和聚類方法的有效性,在不增加計(jì)算復(fù)雜度的前提下,準(zhǔn)確度優(yōu)于現(xiàn)有的其他兩種微博數(shù)據(jù)挖掘方法。
[Abstract]:With the rapid development of Internet, Weibo, as a kind of social media, has received wide attention from all walks of life.However, how to efficiently extract effective information from massive and irregular Weibo data for topic discovery is still an urgent problem.Therefore, the method of mining Weibo data using topic model can be produced.At present, scholars have done a lot of research on thematic models, but the existing methods and techniques still have some shortcomings, mainly reflected in: first, the complexity of the calculation is too high, the efficiency of the big data level Weibo data calculation is not high;Secondly, some thematic models (such as the traditional LDA model) are used to cluster the data of Weibo.Based on this, this paper proposes an improved VSM model and BTM topic model and an improved K-Means clustering method suitable for Weibo data mining, which ensures the efficiency of data calculation.Improve the accuracy of Weibo data mining.In this paper, the Weibo data mining method of VSM-BTM topic model is studied. The research content is divided into three parts: the pretreatment of Weibo data and the modeling of VSM-BTM, and the clustering method suitable for Weibo.Among them, the pretreatment of Weibo data includes participle, deactivation word, noise data and so on. The result of the preprocessing is saved as the text of txt format as the input of the next topic modeling.In the process of VSM-BTM modeling, we first use the existing BTM topic model to model, iterate over the results of data preprocessing, get the "document-topic" matrix and "subject-word" matrix, and at the same time,Based on the lexical table generated by the BTM subject model and the result of Weibo data transcoding, a method of calculating the similarity between Weibo data using JS distance and cosine distance is proposed.The clustering method suitable for Weibo is to use the improved K-Means clustering method, which is suitable for Weibo data, to analyze the modeling results.The traditional K-Means clustering algorithm is improved by using the existing Weibo data to select suitable initial clusters and calculate the distance.Finally, the accuracy rate, recall rate and F1 value are used to analyze and evaluate the experimental results.The method of modeling with VSM-BTM topic model avoids the limitation of Weibo's data sparsity, and does not need to use external information to expand Weibo data, thus reducing the dependence on information other than text.Through the experiments, this paper makes a comparative analysis of the effect of Weibo hot topic discovery between the pure LDA theme model, the simple BTM theme model and the VSM-BTM theme model proposed in this paper. The accuracy of the three thematic models is compared and analyzed.Recall rate and F1 value are the basis of comparative analysis. It is found that the VSM-BTM theme model proposed in this paper is better than Weibo hot topic discovery effect of LDA theme model and BTM theme model in all evaluation factors.It is proved that the thematic model used in this paper is effective in modeling and clustering Weibo data, and the accuracy is superior to that of the other two Weibo data mining methods without increasing computational complexity.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1;TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 王亞民;胡悅;;基于BTM的微博輿情熱點(diǎn)發(fā)現(xiàn)[J];情報(bào)雜志;2016年11期
2 常建秋;沈煒;;基于字符串匹配的中文分詞算法的研究[J];工業(yè)控制計(jì)算機(jī);2016年02期
3 伍萬坤;吳清烈;顧錦江;;基于EM-LDA綜合模型的電商微博熱點(diǎn)話題發(fā)現(xiàn)[J];現(xiàn)代圖書情報(bào)技術(shù);2015年11期
4 張佳明;王波;唐浩浩;李天彩;;基于Biterm主題模型的無監(jiān)督微博情感傾向性分析[J];計(jì)算機(jī)工程;2015年07期
5 鄭誠;吳文岫;代寧;;融合BTM主題特征的短文本分類方法[J];計(jì)算機(jī)工程與應(yīng)用;2016年13期
6 羅賢鋒;祝勝林;陳澤健;袁玉強(qiáng);;基于K-Medoids聚類的改進(jìn)KNN文本分類算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2014年11期
7 唐曉波;向坤;;基于LDA模型和微博熱度的熱點(diǎn)挖掘[J];圖書情報(bào)工作;2014年05期
8 彭凱;汪偉;楊煜普;;基于余弦距離度量學(xué)習(xí)的偽K近鄰文本分類算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2013年06期
9 王連喜;;微博短文本預(yù)處理及學(xué)習(xí)研究綜述[J];圖書情報(bào)工作;2013年11期
10 薛素芝;魯燃;任圓圓;;基于速度增長的微博熱點(diǎn)話題發(fā)現(xiàn)[J];計(jì)算機(jī)應(yīng)用研究;2013年09期
相關(guān)碩士學(xué)位論文 前2條
1 孫勝平;中文微博客熱點(diǎn)話題檢測與跟蹤技術(shù)研究[D];北京交通大學(xué);2011年
2 曹衛(wèi)峰;中文分詞關(guān)鍵技術(shù)研究[D];南京理工大學(xué);2009年
,本文編號:1738822
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1738822.html