天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于改進(jìn)H-K聚類算法的熱點(diǎn)話題發(fā)現(xiàn)方法

發(fā)布時間:2019-04-04 17:24
【摘要】:隨著社會網(wǎng)絡(luò)的快速發(fā)展,微博客已經(jīng)成為人們?nèi)粘I钪袦贤ń涣、信息傳播的平臺之一。在很短的時間內(nèi),微博平臺能產(chǎn)生海量的、信息分散的數(shù)據(jù)集,微博客用戶很難從這些海量文本信息中分辨出熱點(diǎn)話題,所以如何快速、準(zhǔn)確的從海量的微博客文本數(shù)據(jù)集中挖掘出熱點(diǎn)話題成為當(dāng)前研究的熱點(diǎn)。由于傳統(tǒng)的話題發(fā)現(xiàn)方法通常是基于特征詞匹配的,沒有考慮微博客文本潛在的語義,導(dǎo)致話題發(fā)現(xiàn)的質(zhì)量不高。針對微博客的特點(diǎn),本文從語義的角度對微博客熱話題發(fā)現(xiàn)技術(shù)進(jìn)行了深入的研究,提出一種基于改進(jìn)H-K聚類算法的話題發(fā)現(xiàn)方法。本文首先針對微博客數(shù)據(jù)集文本的時間刻度特性和話題的持續(xù)性的特點(diǎn),對熱點(diǎn)話題發(fā)現(xiàn)方法中用到的H-K聚類算法進(jìn)行了改進(jìn)。針對海量的微博客數(shù)據(jù)集,在微博客話題發(fā)現(xiàn)方法中,結(jié)合Hadoop中的MapReduce編程思想,將該算法進(jìn)行了并行化實(shí)現(xiàn),以提高聚類的處理效率。其次,本文從語義的層次對微博客文本進(jìn)行分析,通過引入LDA主題模型將非結(jié)構(gòu)化的微博客文本轉(zhuǎn)化為文本-主題分布和主題-文本特征詞分布,以降低微博客文本的維度,并從語義的角度對微博客進(jìn)行建模,以提高微博客文本相似度計算的準(zhǔn)確度。同時在微博客文本建模階段,結(jié)合MapReduce并行編程思想將LDA主題模型并行化,以提高微博客數(shù)據(jù)集的處理能力。實(shí)驗(yàn)表明,改進(jìn)的H-K聚類算法的聚類效果明顯得到提高,時間效率也得到提升,而且能更好的應(yīng)用到微博客文本的聚類中,解決了傳統(tǒng)聚類算法效率不高的問題;引入云計算平臺后,提高了對海量微博客文本數(shù)據(jù)集的處理能力;本文提出的熱點(diǎn)話題發(fā)現(xiàn)方法能根據(jù)微博客文本特征詞潛在的語義快速、準(zhǔn)確的從微博客數(shù)據(jù)集中發(fā)現(xiàn)熱點(diǎn)話題。
[Abstract]:With the rapid development of social network, Weibo visitors have become one of the platforms for communication and information dissemination in people's daily life. In a very short period of time, the Weibo platform can produce massive, scattered data sets. It is very difficult for Weibo guest users to distinguish hot topics from these massive text messages, so how to quickly, Accurately mining hot topics from massive data sets of Weibo guest texts has become a hot topic in current research. Because the traditional topic discovery method is usually based on feature word matching and does not consider the potential semantics of Weibo text, the quality of topic discovery is not high. In view of the characteristics of Weibo guest, this paper makes a deep research on Weibo hot topic discovery technology from the semantic point of view, and proposes a topic discovery method based on the improved Hxk clustering algorithm. In this paper, based on the characteristics of time scale and topic persistence of Weibo guest data sets, we improve the clustering algorithm used in the hot topic discovery method. In order to improve the processing efficiency of clustering, the algorithm is implemented in parallel with the idea of MapReduce programming in Hadoop, aiming at the massive Weibo guest data set, and in the method of Weibo guest topic discovery, the algorithm is implemented in parallel. Secondly, this paper analyzes the Weibo guest text from the semantic level, and transforms the unstructured Weibo guest text into text-theme distribution and theme-text feature word distribution by introducing the Weibo theme model, in order to reduce the dimension of Weibo guest text. From the semantic point of view, the Weibo guest is modeled to improve the accuracy of the similarity calculation of Weibo guest text. At the same time, in the modeling phase of Weibo text, the LDA topic model is parallelized with the idea of LDA parallel programming, so as to improve the processing ability of Weibo guest data set. The experiment results show that the improved clustering algorithm can improve the efficiency of clustering and improve the efficiency of time. Moreover, it can be applied to clustering of Weibo guest text better, which solves the problem of low efficiency of traditional clustering algorithm. With the introduction of cloud computing platform, the processing ability of massive Weibo guest text data set is improved. The hot topic discovery method proposed in this paper can find hot topics from the Weibo guest data set accurately and quickly according to the latent semantics of the feature words in Weibo guest text.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP391.1;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 范宇;符紅光;文奕;;基于LDA模型的專利信息聚類技術(shù)[J];計算機(jī)應(yīng)用;2013年S1期

2 廖彬;于炯;張?zhí)?楊興耀;;基于分布式文件系統(tǒng)HDFS的節(jié)能算法[J];計算機(jī)學(xué)報;2013年05期

3 薛素芝;魯燃;任圓圓;;基于速度增長的微博熱點(diǎn)話題發(fā)現(xiàn)[J];計算機(jī)應(yīng)用研究;2013年09期

4 李玉林;董晶;;基于Hadoop的MapReduce模型的研究與改進(jìn)[J];計算機(jī)工程與設(shè)計;2012年08期

5 鄭斐然;苗奪謙;張志飛;高燦;;一種中文微博新聞話題檢測的方法[J];計算機(jī)科學(xué);2012年01期

6 楊亮;林原;林鴻飛;;基于情感分布的微博熱點(diǎn)事件發(fā)現(xiàn)[J];中文信息學(xué)報;2012年01期

7 程苗;陳華平;;基于Hadoop的Web日志挖掘[J];計算機(jī)工程;2011年11期

8 趙應(yīng)秋;羅軍;張君艷;;基于知網(wǎng)的詞語語義相關(guān)度計算[J];信息技術(shù);2010年03期

9 魯明羽;姚曉娜;魏善嶺;;基于模糊聚類的網(wǎng)絡(luò)論壇熱點(diǎn)話題挖掘[J];大連海事大學(xué)學(xué)報;2008年04期

10 肖波;徐前方;藺志青;郭軍;李春光;;可信關(guān)聯(lián)規(guī)則及其基于極大團(tuán)的挖掘算法[J];軟件學(xué)報;2008年10期

相關(guān)碩士學(xué)位論文 前1條

1 張玨;網(wǎng)絡(luò)輿情預(yù)測模型與平臺的研究[D];北京交通大學(xué);2009年



本文編號:2454016

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2454016.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶8a8cb***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com