基于分布式MBUT-LDA的微博用戶主題挖掘
發(fā)布時間:2018-05-20 14:07
本文選題:微博 + 用戶主題。 參考:《重慶大學》2014年碩士論文
【摘要】:微博作為當下最主流的社交網(wǎng)絡(luò)平臺之一,已經(jīng)成為用戶發(fā)布和獲取實時信息的重要手段。微博主題建模能夠從海量信息中挖掘用戶感興趣的話題和其他用戶。但是由于微博具有消息文本短、信息更新快、以及數(shù)據(jù)量巨大等特點,傳統(tǒng)的主題建模方法并不能有效挖掘出用戶真正感興趣的信息。 本文在研究已有的主題建模方法的基礎(chǔ)上,提出一種基于微博用戶和時間維度的建模方法MBUT-LDA。其中MB代表微博(MicroBlog)、U代表用戶(User)、T代表時間(Time)。該方法具有以下特點: ⑴本文在分析研究已有主題模型的基礎(chǔ)上,并且充分利用微博消息的主題在時間上有明顯的集中性特點,將用戶的微博信息按照時間進行劃分。此方法解決了微博文本信息短引起的信息量不完整問題,并且充分利用了微博消息的主題有明顯的時間集中性特點,提高了微博用戶主題的準確度。 ⑵在分析微博用戶和好友關(guān)系的提出上,提出“關(guān)注度”的概念;并結(jié)合TF-IDF算法,提出新的權(quán)重計算公式ATF-IDF,用以衡量微博詞匯預測主題的能力大小。 ⑶現(xiàn)今微博用戶量劇增,并且微博平臺允許微博用戶通過各種移動客戶端發(fā)布即時信息,導致微博信息文檔規(guī)模龐大,單一節(jié)點在分析微博海量信息時容易遇到性能瓶頸問題。本文利用分布式和虛擬化技術(shù)的優(yōu)勢,將提出的新的主題建模方法部署到分布式計算平臺Hadoop上,,實現(xiàn)了一個基于分布式框架Hadoop的MBUT-LDA微博用戶主題挖掘方法。 本文利用提出的分布式MBUT-LDA主題建模方法,通過大量微博消息訓練微博主題模型,并在訓練好的主題的基礎(chǔ)上,挖掘微博用戶的感興趣的主題。實驗證明,經(jīng)過ATF-IDF優(yōu)化的MBUT-LDA的推廣度和主題的準確度要高于MBUT-LDA和U-LDA(基于微博用戶的主題建模)。通過對不同用戶數(shù)量和不同節(jié)點數(shù)量的分布式MBUT-LDA實驗結(jié)果分析發(fā)現(xiàn),隨著節(jié)點增加,能夠有效的減少處理數(shù)據(jù)的時間,并且能夠有效的處理龐大的數(shù)據(jù)。
[Abstract]:As one of the most popular social network platforms, Weibo has become an important means for users to publish and obtain real-time information. Weibo topic modeling can mine topics of interest to users and other users from mass information. However, because Weibo has the characteristics of short message text, fast updating of information and huge amount of data, the traditional method of topic modeling can not effectively mine the information that users are really interested in. In this paper, based on the research of existing thematic modeling methods, a modeling method MBUT-LDA based on Weibo user and time dimension is proposed. MB stands for Weibo MicroBlogn U for user and time for time. The method has the following characteristics: 1. On the basis of analyzing and studying the existing topic models, this paper makes full use of the obvious centrality of the topic of Weibo message in time, and divides the user's Weibo information according to time. This method solves the problem of incomplete information caused by short text information of Weibo, and makes full use of the obvious time centrality of the topic of Weibo message, and improves the accuracy of Weibo user topic. 2 on the analysis of Weibo user and friend relationship, the concept of "concern" is put forward, and a new weight calculation formula ATF-IDF is put forward based on TF-IDF algorithm, which can be used to measure the ability of Weibo vocabulary to predict topic. At present, the number of Weibo users increases dramatically, and the Weibo platform allows Weibo users to publish instant information through various mobile clients, which leads to the large scale of Weibo information documents, and the single node is prone to meet the performance bottleneck problem when analyzing the huge amount of Weibo information. Based on the advantages of distributed and virtualization technology, this paper deploys the new topic modeling method to the distributed computing platform Hadoop, and implements a MBUT-LDA Weibo user topic mining method based on distributed framework Hadoop. In this paper, we use the distributed MBUT-LDA topic modeling method to train the Weibo topic model through a large number of Weibo messages, and mine the topics of interest to Weibo users on the basis of the well trained topics. Experimental results show that the generalization degree and accuracy of MBUT-LDA optimized by ATF-IDF are higher than those of MBUT-LDA and U-LDA (topic modeling based on Weibo users). By analyzing the results of distributed MBUT-LDA experiments with different number of users and different nodes, it is found that with the increase of nodes, the processing time of data can be reduced effectively, and the large amount of data can be processed effectively.
【學位授予單位】:重慶大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092;TP391.1
【參考文獻】
相關(guān)期刊論文 前4條
1 張晨逸;孫建伶;丁軼群;;基于MB-LDA模型的微博主題挖掘[J];計算機研究與發(fā)展;2011年10期
2 汪中;劉貴全;陳恩紅;;一種優(yōu)化初始中心點的K-means算法[J];模式識別與人工智能;2009年02期
3 張志飛;苗奪謙;高燦;;基于LDA主題模型的短文本分類方法[J];計算機應(yīng)用;2013年06期
4 羅軍舟;金嘉暉;宋愛波;東方;;云計算:體系架構(gòu)與關(guān)鍵技術(shù)[J];通信學報;2011年07期
本文編號:1914921
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1914921.html
最近更新
教材專著