基于AdaBoost-LC的微博垃圾評論識別研究
本文選題:微博 + 垃圾評論識別。 參考:《重慶大學(xué)》2014年碩士論文
【摘要】:隨著Web2.0和互聯(lián)網(wǎng)的飛速發(fā)展,社交網(wǎng)絡(luò)呈現(xiàn)爆發(fā)式增長。微博作為社交網(wǎng)絡(luò)的一大重要代表逐漸深入人心,成為網(wǎng)民上網(wǎng)的主要活動之一。正是由于微博具有便捷性、高速度、廣泛性、效率高、背對臉等特點(diǎn),吸引了垃圾制造者們的注意。垃圾制造者出于各種目的,在微博上發(fā)表了大量的各種垃圾評論,這些垃圾評論的泛濫既影響網(wǎng)民之間的交流,甚至使得網(wǎng)民上當(dāng)受騙,又阻礙了面向評論的數(shù)據(jù)挖掘工作,因此垃圾評論的識別與過濾具有重要意義。 本文面向微博領(lǐng)域進(jìn)行識別垃圾評論的研究,主要的研究工作及成果如下: ①針對微博評論短小,分詞后容易出現(xiàn)特征稀疏的問題,提出把微博評論表示成特征值向量,由9個(gè)特征值組成,從多個(gè)不同的角度來描述評論的內(nèi)容,在此基礎(chǔ)上提出一種基于AdaBoost-LC的微博垃圾評論識別方法,該方法以線性分類器中最簡單的單閾值二值分類器作為基分類器,然后使用集成學(xué)習(xí)算法——AdaBoost算法來提升基分類器的分類精度。 ②針對AdaBoost-LC算法存在的不足之處,“困難”樣本權(quán)重急劇擴(kuò)張引起的退化現(xiàn)象,以及在垃圾評論識別場景下,正常評論被錯誤識別的代價(jià)更加高昂的問題,提出一種改進(jìn)的AdaBoost-Ex算法來識別垃圾評論。 ③針對垃圾評論出現(xiàn)新特征,或者分類器隨時(shí)間流逝分類性能下降需要重新學(xué)習(xí)的問題,本文設(shè)計(jì)了算法的模塊化增量學(xué)習(xí)模型,該模型在保留原本學(xué)習(xí)到的規(guī)則的基礎(chǔ)上,只需要學(xué)習(xí)新樣本的規(guī)則,學(xué)習(xí)到的子分類器以線性加權(quán)的方式融合到增量學(xué)習(xí)系統(tǒng)中,使得算法具有漸進(jìn)式的學(xué)習(xí)能力,增強(qiáng)了算法的實(shí)用性。 最后,,在實(shí)際的熱門新浪微博的評論數(shù)據(jù)集上分別對本文提出的方法進(jìn)行了實(shí)驗(yàn),證明本文所提方法對微博垃圾評論具有良好的識別效果。
[Abstract]:With the rapid development of Web 2.0 and the Internet, social networks have exploded. As an important representative of social network, Weibo has become one of the main activities of Internet users. Because of its convenience, speed, universality, efficiency and back-to-face, Weibo attracts the attention of garbage makers. For various purposes, garbage makers have published a large number of spam comments on Weibo. The flood of these comments not only affects the communication among netizens, but also makes them cheated, and hinders the work of data mining for comments. Therefore, the identification and filtering of garbage comments is of great significance. The main research work and results of this paper are as follows: (1) in view of the short comment of Weibo, it is easy to have sparse features after word segmentation. In this paper, Weibo comments are represented as eigenvalue vectors, which are composed of nine eigenvalues. The content of comments is described from many different angles. On the basis of this, a Weibo garbage comment recognition method based on Ada Boost-LC is proposed. In this method, the simplest single threshold binary classifier in linear classifier is used as the base classifier, and then an integrated learning algorithm, AdaBoost algorithm, is used to improve the classification accuracy of the base classifier. 2 the shortcomings of AdaBost-LC algorithm are pointed out. The degradation caused by the sharp expansion of the "difficult" sample weight, and the more expensive problem of the normal comment being misidentified in the garbage comment recognition scene, An improved AdaBoost-Ex algorithm is proposed to identify spam comments. The modular incremental learning model of the algorithm is designed in this paper. The model only needs to learn the rules of the new samples on the basis of retaining the original learning rules, and the sub-classifiers that have been learned are integrated into the incremental learning system in a linearly weighted manner. It makes the algorithm have progressive learning ability and enhances the practicability of the algorithm. Finally, the methods proposed in this paper are tested on the popular Sina Weibo comment data set, which proves that the method proposed in this paper has a good recognition effect on Weibo spam reviews.
【學(xué)位授予單位】:重慶大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092;TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 何海江;凌云;;由向量空間相關(guān)模型識別博客文章的垃圾評論[J];長沙大學(xué)學(xué)報(bào);2008年02期
2 王海川,張立明;一種新的Adaboost快速訓(xùn)練算法[J];復(fù)旦學(xué)報(bào)(自然科學(xué)版);2004年01期
3 譚文堂;朱洪;葛斌;李芳芳;肖衛(wèi)東;;垃圾評論自動過濾方法[J];國防科技大學(xué)學(xué)報(bào);2012年05期
4 金永生;王睿;陳祥兵;;企業(yè)微博營銷效果和粉絲數(shù)量的短期互動模型[J];管理科學(xué);2011年04期
5 何海江;凌云;;由Logistic回歸識別Web社區(qū)的垃圾評論[J];計(jì)算機(jī)工程與應(yīng)用;2009年23期
6 張建剛;彭勤科;康雪姣;;在線電影評論傾向性分類算法研究[J];計(jì)算機(jī)工程與應(yīng)用;2011年11期
7 孫升蕓;田萱;;產(chǎn)品垃圾評論檢測研究綜述[J];計(jì)算機(jī)科學(xué);2011年S1期
8 嚴(yán)云洋;郭志波;楊靜宇;;基于雙閾值的增強(qiáng)型AdaBoost快速算法[J];計(jì)算機(jī)工程;2007年21期
9 牛永潔;張成;;多種字符串相似度算法的比較研究[J];計(jì)算機(jī)與數(shù)字工程;2012年03期
10 郭慶琳;李艷梅;唐琦;;基于VSM的文本相似度計(jì)算的研究[J];計(jì)算機(jī)應(yīng)用研究;2008年11期
本文編號:2076772
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2076772.html