垃圾博客檢測及相關(guān)技術(shù)的研究
本文關(guān)鍵詞: 特征關(guān)聯(lián)樹 組合特征 垃圾博客分類 統(tǒng)計(jì)特征 特征選擇 出處:《遼寧師范大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:近年來隨著internet技術(shù)的發(fā)展,博客(Web blog)通過為作者和讀者之間提供交互式交流平臺和動(dòng)態(tài)更新的社會網(wǎng)絡(luò)而成為極受歡迎的一種新媒體的社會溝通機(jī)制。據(jù)調(diào)查科學(xué)研究、統(tǒng)計(jì)調(diào)查、公共建設(shè)、教育、社會福利等研究領(lǐng)域都會應(yīng)用博客的分析結(jié)果,所以博客巨大的信息源和信息量具有極其寶貴的價(jià)值。但隨之產(chǎn)生的垃圾博客(spam blog or splog)也肆意猖獗。它產(chǎn)生的主要方式是盜竊他人內(nèi)容或機(jī)器自動(dòng)生成,其目的是提高目標(biāo)網(wǎng)站在搜索引擎中的排名以鏈接盈利廣告。垃圾博客造成的問題包括:1)嚴(yán)重降低博客的檢索質(zhì)量;2)明顯浪費(fèi)網(wǎng)絡(luò)和存儲資源。因此,為保護(hù)博客世界的良好環(huán)境,必須對垃圾博客進(jìn)行過濾。 首先本文根據(jù)博客的各種特征分析,提取了兩種高效特征并結(jié)合傳統(tǒng)的內(nèi)容特征,采用特征組合的方法對博客進(jìn)行分類。鑒于Yuuki Sato Takehito Utsuro對垃圾博客的統(tǒng)計(jì)規(guī)律以及對垃圾博客作者屬性的分析,挖掘出博客的作者屬性在博客分類中的重要性。這表明博客的作者屬性具有十分重要的研究價(jià)值。博客作者常會無規(guī)律地發(fā)表博客,而垃圾博客為提高網(wǎng)頁的點(diǎn)擊率進(jìn)而提高網(wǎng)站在ALEXA中的排名,須在短時(shí)間內(nèi)發(fā)表大量的博文,同時(shí)機(jī)器生成垃圾博文的速度非?。因此正常博客與垃圾博客在時(shí)間自相似特征上存在較大差異。本文根據(jù)文章中的作者屬性和自相似特征的不同,對博客文章進(jìn)行首次過濾,同時(shí)結(jié)合提取出的內(nèi)容特征,增加特征之間的互補(bǔ)性,使垃圾博客過濾的效率大大提高。 其次,本文設(shè)計(jì)了一種針對垃圾博客特征篩選的特征關(guān)聯(lián)樹分類算法。該算法根據(jù)特征之間的相關(guān)性構(gòu)造出一種特征關(guān)聯(lián)樹結(jié)構(gòu)來篩選特征,剪枝掉不相關(guān)和冗余特征,保留強(qiáng)相關(guān)和弱相關(guān)特征,同時(shí)利用期望交叉熵對特征關(guān)聯(lián)樹進(jìn)行二次篩選[2]。與傳統(tǒng)的特征選擇算法相比,該算法可以消除博客樣本數(shù)據(jù)非平衡的影響,根據(jù)特征的相似度和期望交叉熵的大小,自適應(yīng)地調(diào)整特征關(guān)聯(lián)樹的規(guī)模,降低特征維度。垃圾博客過濾的對比實(shí)驗(yàn)表明,該算法用于垃圾博客過濾時(shí),可以獲得較好的準(zhǔn)確率和召回率。 本文提出的上述兩種垃圾博客檢測算法,均屬于動(dòng)態(tài)文本二分類算法。在分析傳統(tǒng)的垃圾博客特征基礎(chǔ)上,挖掘出檢測垃圾博客的高效特征以及特征間的關(guān)聯(lián)性,有效縮減了特征維度的規(guī)模,提高檢測速度。經(jīng)典分類器上進(jìn)行對比實(shí)驗(yàn)測試,結(jié)果表明本文提出的垃圾博客檢測算法具有良好的分類效果。
[Abstract]:In recent years, with the development of internet technology. Blogs become a popular social communication mechanism for new media by providing interactive communication platforms and dynamically updated social networks between authors and readers. Statistical surveys, public construction, education, social welfare and other areas of research will apply the results of the blog analysis. So blog's huge source of information and amount of information is extremely valuable. But the resulting spam blog spam or splog. It is also rampant. The main way it produces is to steal other people's content or machine to generate it automatically. The aim is to improve the ranking of target sites in search engines to link to profitable advertising. 2) waste of network and storage resources obviously. Therefore, in order to protect the good environment of blog world, spam blog must be filtered. Firstly, based on the analysis of various features of blog, two efficient features are extracted and combined with traditional content features. In view of the statistical rule of Yuuki Sato Takehito Utsuro and the analysis of the attribute of the author of spam blog, the method of feature combination is used to classify the blog. Excavate the importance of blog's author attribute in blog classification, which indicates that blog's author's attribute has very important research value. Bloggers often publish blog irregularly. The spam blog in order to improve the click rate of web pages and thus improve the ranking of the site in the ALEXA, must publish a large number of blog posts in a short period of time. At the same time, the speed of generating spam blog is very fast. Therefore, there is a big difference between normal blog and spam blog in time self-similar features. The blog articles are filtered for the first time, and the content features extracted are combined to increase the complementarity between the features, so that the efficiency of spam blog filtering is greatly improved. Secondly, this paper designs a feature association tree classification algorithm for spam blog feature filtering, which constructs a feature association tree structure to filter features according to the correlation between features. Pruning irrelevant and redundant features, retaining strong and weak correlation features, and using expected cross-entropy to filter feature correlation trees twice [2. Compared with the traditional feature selection algorithm, this algorithm can eliminate the unbalanced influence of blog sample data, and adaptively adjust the scale of feature association tree according to the similarity of features and the size of expected cross-entropy. The comparison experiment of spam blog filtering shows that the algorithm can obtain good accuracy and recall rate when it is used in spam blog filtering. In this paper, the above two spam blog detection algorithms, both belong to the dynamic text two classification algorithm, on the basis of analyzing the traditional spam blog features. Mining out the efficient features of detecting spam blog and the correlation between features, effectively reduce the size of the feature dimension, improve the speed of detection. Classical classifier on the comparative experimental test. The results show that the proposed spam blog detection algorithm has a good classification effect.
【學(xué)位授予單位】:遼寧師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 何海江;凌云;;由向量空間相關(guān)模型識別博客文章的垃圾評論[J];長沙大學(xué)學(xué)報(bào);2008年02期
2 嚴(yán)超;王元慶;李久雪;張兆揚(yáng);;AdaBoost分類問題的理論推導(dǎo)[J];東南大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年04期
3 王圓;孫鐵利;李楊;;Web文本挖掘中的特征表示和特征提取[J];電腦知識與技術(shù);2006年14期
4 蘇丹;周明全;王學(xué)松;任玉芝;;一種基于最少出現(xiàn)文檔頻的文本特征提取方法[J];計(jì)算機(jī)工程與應(yīng)用;2012年10期
5 嚴(yán)超;王元慶;;連續(xù)型Adaboost算法研究[J];計(jì)算機(jī)科學(xué);2010年09期
6 蘭均;施化吉;李星毅;徐敏;;基于特征詞復(fù)合權(quán)重的關(guān)聯(lián)網(wǎng)頁分類[J];計(jì)算機(jī)科學(xué);2011年03期
7 鐘將;孫啟干;李靜;;基于歸一化向量的文本分類算法[J];計(jì)算機(jī)工程;2011年08期
8 王博;賈焰;楊樹強(qiáng);韓偉紅;;文本多分類中的特征選擇研究[J];計(jì)算機(jī)工程與科學(xué);2010年08期
9 崔自峰;徐寶文;張衛(wèi)豐;徐峻嶺;;一種近似Markov Blanket最優(yōu)特征選擇算法[J];計(jì)算機(jī)學(xué)報(bào);2007年12期
10 秦進(jìn),陳笑蓉,汪維家,陸汝占;文本分類中的特征抽取[J];計(jì)算機(jī)應(yīng)用;2003年02期
,本文編號:1447506
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1447506.html