天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

微博熱點(diǎn)話題檢測(cè)與跟蹤技術(shù)研究

發(fā)布時(shí)間:2018-10-23 20:31
【摘要】:話題檢測(cè)與跟蹤是指從海量數(shù)據(jù)中發(fā)現(xiàn)被最多討論的話題并在后續(xù)信息中跟進(jìn)話題的發(fā)展變化狀態(tài),為人們解決愈發(fā)嚴(yán)重的信息爆炸問(wèn)題。話題檢測(cè)與跟蹤可以節(jié)省用戶時(shí)間,跟進(jìn)事件發(fā)展動(dòng)態(tài);為輿情監(jiān)控提供數(shù)據(jù)支持,有重要的實(shí)際價(jià)值和安全意義。隨著越來(lái)越多的用戶使用微博進(jìn)行信息發(fā)布和話題討論,熱點(diǎn)話題展示也逐漸變成微博平臺(tái)的一個(gè)重要功能。由于微博的即時(shí)性很強(qiáng),突發(fā)新聞在微博上的傳播速度很快,而且對(duì)于影響力較大的新聞事件,參與報(bào)道、轉(zhuǎn)發(fā)、評(píng)論的用戶數(shù)量也很大,往往能夠先于傳統(tǒng)新聞媒體做出反應(yīng)。因此,針對(duì)微博的特點(diǎn),本文通過(guò)過(guò)濾無(wú)效微博,設(shè)計(jì)并實(shí)現(xiàn)了一種針對(duì)微博的熱點(diǎn)話題跟蹤及檢測(cè)方法,主要工作如下:1)分析了微博特性,過(guò)濾了無(wú)效微博。微博用戶人群復(fù)雜,涵蓋范圍廣,差別大,內(nèi)容駁雜。通過(guò)分析微博用戶特征,包括用戶粉絲數(shù)與用戶每日發(fā)布微博數(shù),過(guò)濾廣告用戶與僵尸用戶;通過(guò)分析微博內(nèi)容,過(guò)濾商家推廣活動(dòng),與用戶分享內(nèi)容,用戶參與的活動(dòng)等大量對(duì)話題無(wú)貢獻(xiàn)的微博;通過(guò)分析分詞后的微博數(shù)據(jù),過(guò)濾包含詞數(shù)過(guò)多和過(guò)少的微博,去除無(wú)意義的過(guò)短文本,和重復(fù)過(guò)多的過(guò)長(zhǎng)文本,有效過(guò)濾無(wú)效微博,降低計(jì)算復(fù)雜度。2)設(shè)計(jì)并實(shí)現(xiàn)了基于時(shí)間特性的微博熱點(diǎn)話題檢測(cè)算法。將微博按時(shí)間遞增順序處理,通過(guò)改進(jìn)Single-Pass聚類(lèi)算法,包括相似度計(jì)算方法的改進(jìn),結(jié)合用戶影響力的話題向量更新方法的改進(jìn),進(jìn)行初步話題檢測(cè);利用FP-Growth頻繁項(xiàng)集發(fā)現(xiàn)算法,挖掘頻繁特征詞集,修正SP算法的錯(cuò)誤;利用改進(jìn)的K-MEDOIDS算法對(duì)頻繁特征詞集進(jìn)行聚類(lèi),抽取最終話題,提高了計(jì)算效率與話題檢測(cè)的準(zhǔn)確率。3)設(shè)計(jì)并實(shí)現(xiàn)了基于時(shí)間特性的多查詢向量自適應(yīng)話題跟蹤算法;谖⒉⿺(shù)量在時(shí)間維度上的分布特征,將微博按時(shí)段分組,并按時(shí)間遞增順序處理;將每個(gè)時(shí)段的話題與已存在所有話題組的所有話題進(jìn)行相似度計(jì)算對(duì)比,根據(jù)閾值選擇將其歸入已存在話題組或創(chuàng)建新的話題組,自適應(yīng)更改加入話題組的話題向量。有效的跟蹤話題發(fā)展?fàn)顟B(tài),提高了準(zhǔn)確率,減少了話題漂移。
[Abstract]:Topic detection and tracking is to find the most discussed topic from the massive data and follow up the development and change of the topic in the follow-up information to solve the increasingly serious problem of information explosion for people. Topic detection and tracking can save user time, follow up the development of events, and provide data support for public opinion monitoring, which has important practical value and security significance. As more and more users use Weibo to publish information and discuss topics, hot topic display has gradually become an important function of Weibo platform. Because Weibo's immediacy is very strong, breaking news spreads very quickly on Weibo, and the number of users who participate in reporting, forwarding, and commenting on news events with great influence is also very large. It is often possible to react before the traditional news media. Therefore, according to the characteristics of Weibo, this paper designs and implements a method of tracking and detecting hot topics for Weibo by filtering invalid Weibo. The main work is as follows: 1) analyzing the characteristics of Weibo, filtering the invalid Weibo. Weibo user crowd is complex, covers a wide range, the difference is big, the content is complicated. By analyzing Weibo's user characteristics, including the number of users' fans and the number of users issuing Weibo daily, filtering advertising users and zombie users, analyzing the content of Weibo, filtering merchants' promotional activities, and sharing content with users, Weibo, who has no contribution to the topic, participated in a large number of activities such as user participation. By analyzing the Weibo data after the participle, he filtered too many words and too few words to remove meaningless and too short text, and repeated too many long texts. Effectively filter invalid Weibo, reduce the computational complexity. 2) designed and implemented the algorithm based on the time characteristics of Weibo hot topic detection. Weibo is processed in the order of increasing time, by improving the Single-Pass clustering algorithm, including the improvement of similarity calculation method, combining with the improvement of the topic vector updating method of user's influence, the preliminary topic detection is carried out, and the FP-Growth frequent itemset discovery algorithm is used. Mining frequent feature word sets, correcting errors of SP algorithm, clustering frequent feature words set with improved K-MEDOIDS algorithm, extracting final topic, The computational efficiency and the accuracy of topic detection are improved. 3) A multi-query vector adaptive topic tracking algorithm based on time characteristic is designed and implemented. On the basis of the distribution of Weibo's quantity in time dimension, Weibo is grouped according to the period of time and processed in the order of increasing time, and the similarity calculation between the topics of each time period and all the topics that already exist in all the topic groups is compared. According to the threshold selection, the topic vector is changed adaptively to the existing topic group or to create a new topic group. Tracking the status of topic development effectively improves the accuracy and reduces the topic drift.
【學(xué)位授予單位】:東南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP391.1;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前5條

1 周剛;鄒鴻程;熊小兵;黃永忠;;MB-SinglePass:基于組合相似度的微博話題檢測(cè)[J];計(jì)算機(jī)科學(xué);2012年10期

2 廉捷;周欣;曹偉;劉云;;新浪微博數(shù)據(jù)挖掘方案[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年10期

3 張輝;周敬民;王亮;趙莉萍;;基于三維文檔向量的自適應(yīng)話題追蹤器模型[J];中文信息學(xué)報(bào);2010年05期

4 洪宇;張宇;劉挺;李生;;話題檢測(cè)與跟蹤的評(píng)測(cè)及研究綜述[J];中文信息學(xué)報(bào);2007年06期

5 王會(huì)珍;朱靖波;季鐸;葉娜;張斌;;基于反饋學(xué)習(xí)自適應(yīng)的中文話題追蹤[J];中文信息學(xué)報(bào);2006年03期

,

本文編號(hào):2290384

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2290384.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4e630***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com