基于優(yōu)化TF-IDF與詞共現(xiàn)的微博熱點(diǎn)話題發(fā)現(xiàn)研究
[Abstract]:Weibo's hot topic discovery refers to excavating the topic from a large number of Weibo and selecting the hot topic according to the method of topic heat evaluation. It can help people to choose the information users are interested in or need conveniently from the mass of information, and also has great value in the fields of government public opinion guidance, information security, financial judgment and so on. This paper analyzes and summarizes the current situation of Weibo's hot topic discovery, and finds that there are some problems such as high error rate of text segmentation, low accuracy of subject word extraction and different ways of evaluating the heat of selected topic. In view of these problems, this paper focuses on the following three aspects: first, the Chinese word segmentation and new word discovery technology are discussed in depth, and it is found that a lot of word fragments will appear after word segmentation with the present word segmentation tool, especially after the new word segmentation. The result is very different from the original intention. In order to solve the problem of high error rate of word segmentation, this paper proposes a new word discovery method based on rule and N-Gram model. Firstly, the rules of word structure are considered to construct the fragment library, then the candidate strings are extracted by using Bi-Gram and Tri-Gram patterns, and the candidate strings with high probability in both modes are selected as new words. Finally, organic combination of systematic participle and new words. The experimental results show that this algorithm can effectively prevent the bad effect of Weibo text segmentation caused by new words. Secondly, aiming at the problem that the accuracy of the subject word extraction is not high, this paper proposes an algorithm based on the optimized TF-IDF and word co-occurrence model to extract the theme words, which combines the advantages of TF-IDF algorithm and word co-occurrence model. In the study of TF-IDF algorithm, it is found that the traditional algorithm does not reflect the location information of words. In order to effectively reflect the importance of words, this paper adds the location information which belongs to Weibo text, title and comment to the data set. And give different weights to optimize the TF-IDF algorithm. On this basis, we use the co-occurrence model to consider the contextual semantic and contextual relationship of words, and extract the theme words. The experimental results show that the algorithm can reduce the deviation of subject word extraction and make the result more accurate. Thirdly, through the study of Weibo structure and topic communication law, this paper chooses the user characteristics and the subject word features as the influencing factors of hot topics, and uses them to design the calorific calculation formula of topics. Calculate the calorific value of each topic, finally select Weibo hot topic according to the threshold of calorific value. The experimental results show that the hot topic of Weibo obtained by this algorithm is in good agreement with the actual situation.
【學(xué)位授予單位】:南昌大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李曉瑜;俞麗穎;雷航;唐雪飛;;一種K-means改進(jìn)算法的并行化實(shí)現(xiàn)與應(yīng)用[J];電子科技大學(xué)學(xué)報(bào);2017年01期
2 饒浩;林育曼;陳海媚;;基于粒子群算法的微博熱點(diǎn)話題發(fā)現(xiàn)分析[J];情報(bào)科學(xué);2016年12期
3 夭榮朋;許國(guó)艷;宋健;;基于改進(jìn)互信息和鄰接熵的微博新詞發(fā)現(xiàn)方法[J];計(jì)算機(jī)應(yīng)用;2016年10期
4 馬慧芳;吉余崗;李曉紅;周汝南;;基于離散粒子群優(yōu)化的微博熱點(diǎn)話題發(fā)現(xiàn)算法[J];計(jì)算機(jī)工程;2016年03期
5 葉成緒;楊萍;劉少鵬;;基于主題詞的微博熱點(diǎn)話題發(fā)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2016年02期
6 李元菊;;數(shù)據(jù)不平衡分類研究綜述[J];現(xiàn)代計(jì)算機(jī)(專業(yè)版);2016年04期
7 劉少鵬;印鑒;歐陽佳;黃云;楊曉穎;;基于MB-HDP模型的微博主題挖掘[J];計(jì)算機(jī)學(xué)報(bào);2015年07期
8 陳羽中;方明月;郭文忠;;面向微博熱點(diǎn)話題發(fā)現(xiàn)的多標(biāo)簽傳播聚類方法研究[J];模式識(shí)別與人工智能;2015年01期
9 李勇;安新穎;趙迎光;;基于動(dòng)態(tài)時(shí)間窗口的突發(fā)監(jiān)測(cè)研究[J];醫(yī)學(xué)信息學(xué)雜志;2014年06期
10 孫永利;李東;張s,
本文編號(hào):2415197
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2415197.html