中文微博熱點話題檢測技術(shù)研究
本文選題:中文微博 + 話題檢測; 參考:《重慶理工大學(xué)》2014年碩士論文
【摘要】:隨著移動互聯(lián)技術(shù)的迅猛發(fā)展,微博這一新興的社交網(wǎng)絡(luò)平臺快速興起,成為廣大用戶的一種新的交流方式。用戶以微博為載體,可以方便快捷地發(fā)表觀點,進(jìn)行信息交流、互動、資源共享。微博平臺特有的時效性和隨意性,使得微博信息能夠快速傳播及擴(kuò)散,在現(xiàn)實社會中產(chǎn)生強(qiáng)大的影響力。在微博文本中,隱含著大量時政要點、突發(fā)事件等有價值的信息。對微博文本進(jìn)行熱點話題的提取、檢索,可以幫助用戶迅速了解社會中的實時熱點信息,對網(wǎng)絡(luò)輿情監(jiān)控和信息即時搜索,具有重要的現(xiàn)實意義。但是,微博文本具有大數(shù)據(jù)的特征,難以由人工進(jìn)行識別和過濾。由此,通過尋求相關(guān)信息過濾手段,研究微博文本熱點話題的自動檢測技術(shù),成為信息檢索領(lǐng)域研究的熱點。 論文首先介紹了話題檢測的背景,研究現(xiàn)狀及相關(guān)技術(shù);接著分析了中文微博的信息特點及傳播特點;針對熱點話題檢測的信息過濾問題,提出了一種用戶角色定位方法。該方法通過用戶個人粉絲數(shù)和關(guān)注數(shù)指標(biāo)來計算用戶關(guān)注度;通過微博轉(zhuǎn)發(fā)數(shù)和評論數(shù)來計算微博影響力;再通過用戶關(guān)注度和微博影響力來綜合評估用戶影響力。通過對用戶角色的定位,實現(xiàn)了熱點話題檢測前的信息粗度過濾。然后,采用基于改進(jìn)的Single-Pass增量式聚類算法,對微博信息進(jìn)行初步話題檢測;最后結(jié)合微博轉(zhuǎn)發(fā)數(shù)、評論數(shù)等話題熱度的影響因素,進(jìn)行微博話題熱度的評估和排序,從而找到一定時間段內(nèi)的熱點話題。論文對中文微博話題檢測中的文本預(yù)處理、文本特征選取方法等進(jìn)行了優(yōu)化,采用結(jié)合語義相似度的TF-IDF函數(shù)計算特征權(quán)重。 基于上述方法,論文以新浪微博語料為載體展開了相關(guān)實驗,并以TDT會議評測規(guī)范中的召回率、漏檢率、錯檢率和誤測開銷值作為評價指標(biāo),對實驗結(jié)果進(jìn)行了分析和比較。實驗表明,論文提出的用戶角色定位方法可以有效地實現(xiàn)微博用戶類別的劃分,,為熱點話題檢測的信息過濾提供了基礎(chǔ);運(yùn)用基于用戶關(guān)注度和微博影響力的評估方法,論文對熱點話題提取的的漏檢率和誤檢率指標(biāo)分別降低到了20.38%和1.98%,取得了優(yōu)于傳統(tǒng)話題檢測的效率和精準(zhǔn)率,證明了論文所提方法的有效性。
[Abstract]:With the rapid development of mobile interconnection technology, Weibo, a new social network platform, has become a new communication mode for users. With Weibo as the carrier, users can express their views conveniently and quickly, exchange information, interact and share resources. Because of the timeliness and arbitrariness of Weibo platform, Weibo information can be spread and diffused quickly, and has a strong influence in the real society. In the text of Weibo, there are a lot of valuable information, such as the key points of current politics and unexpected events. Extracting and retrieving Weibo text from hot topics can help users quickly understand the real-time hot information in the society, and it is of great practical significance to monitor the network public opinion and search the information in real time. However, Weibo text has the characteristics of big data, so it is difficult to be recognized and filtered manually. Therefore, the research on automatic detection of hot topics in Weibo texts has become a hot topic in the field of information retrieval by searching for relevant information filtering methods. Firstly, this paper introduces the background of topic detection, research status and related technologies; then analyzes the information characteristics and propagation characteristics of Chinese Weibo; aiming at the problem of information filtering of hot topic detection, a user role location method is proposed. The method calculates the user's attention by the index of the number of users' individual followers and the number of users' attention; calculates the influence of Weibo by the number of Weibo retweets and comments; and evaluates the influence of users by the degree of user's attention and the influence of Weibo. The information coarseness filtering before hot topic detection is realized by locating the user role. Then, based on the improved Single-Pass incremental clustering algorithm, the preliminary topic detection of Weibo information is carried out. Finally, combining with the factors of Weibo forwarding number, comment number and so on, the evaluation and ranking of Weibo topic heat are carried out. In order to find a certain period of time hot topics. In this paper, the text preprocessing and text feature selection methods in Chinese Weibo topic detection are optimized, and the feature weights are calculated by TF-IDF function combined with semantic similarity. Based on the above methods, this paper takes Sina Weibo corpus as the carrier to carry out relevant experiments, and analyzes and compares the experimental results with the recall rate, missed detection rate, false check rate and false test cost value of the TDT conference evaluation specification. The experiments show that the user role location method proposed in this paper can effectively divide the user categories of Weibo and provide the basis for information filtering of hot topic detection, and use the evaluation method based on user concern and Weibo influence. The missing rate and false detection rate of hot topic extraction are reduced to 20.38% and 1.98% respectively. The efficiency and accuracy of the proposed method are better than that of traditional topic detection, which proves the effectiveness of the proposed method.
【學(xué)位授予單位】:重慶理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前9條
1 萬小軍,楊建武;在線新聞主題檢測系統(tǒng)的設(shè)計與應(yīng)用[J];華南理工大學(xué)學(xué)報(自然科學(xué)版);2004年S1期
2 王冠男;;微博客的信息流動機(jī)制與傳播形態(tài)[J];機(jī)電產(chǎn)品開發(fā)與創(chuàng)新;2010年06期
3 賈自艷 ,何清 ,張海俊 ,李嘉佑 ,史忠植;一種基于動態(tài)進(jìn)化模型的事件探測和追蹤算法[J];計算機(jī)研究與發(fā)展;2004年07期
4 李保利,俞士汶;話題識別與跟蹤研究[J];計算機(jī)工程與應(yīng)用;2003年17期
5 閔可銳;趙迎賓;劉昕;趙澤宇;閆華;;互聯(lián)網(wǎng)話題識別與跟蹤系統(tǒng)設(shè)計及實現(xiàn)[J];計算機(jī)工程;2008年19期
6 駱衛(wèi)華;于滿泉;許洪波;王斌;程學(xué)旗;;基于多策略優(yōu)化的分治多層聚類算法的話題發(fā)現(xiàn)研究[J];中文信息學(xué)報;2006年01期
7 洪宇;張宇;劉挺;李生;;話題檢測與跟蹤的評測及研究綜述[J];中文信息學(xué)報;2007年06期
8 楊武;李陽;盧玲;;基于用戶角色定位的微博熱點話題檢測方法[J];計算機(jī)應(yīng)用;2013年11期
9 王偉;許鑫;;基于聚類的網(wǎng)絡(luò)輿情熱點發(fā)現(xiàn)及分析[J];現(xiàn)代圖書情報技術(shù);2009年03期
本文編號:1924367
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1924367.html