對(duì)抗環(huán)境下垃圾郵件過濾技術(shù)的研究
本文選題:垃圾郵件過濾 切入點(diǎn):對(duì)抗環(huán)境 出處:《華南理工大學(xué)》2015年碩士論文 論文類型:學(xué)位論文
【摘要】:伴隨著網(wǎng)絡(luò)的發(fā)展與日益普及,電子郵件已經(jīng)成為人們?nèi)粘P畔⒔涣鞯闹匾绞街?方便了人們的日常生活和工作。但是大量的垃圾郵件也隨之出現(xiàn),在困擾人們正常通信的同時(shí),也給社會(huì)帶來巨大的經(jīng)濟(jì)損失,如何有效抑制垃圾郵件蔓延日益成為突出的問題。得益于人工智能技術(shù)的迅速發(fā)展,大量機(jī)器學(xué)習(xí)方法被應(yīng)用于垃圾郵件過濾領(lǐng)域并取得不錯(cuò)的效果。然而在對(duì)抗環(huán)境下垃圾郵件制造者試圖利用機(jī)器學(xué)習(xí)算法的弱點(diǎn),通過各種方式來偽裝垃圾郵件從而降低郵件分類器的效率。這種研究對(duì)抗環(huán)境下分類問題被稱為對(duì)抗學(xué)習(xí)。規(guī)避攻擊作為垃圾郵件制造者經(jīng)常使用的一種攻擊手段,其通過插入好詞和刪除壞詞使得垃圾郵件在保持原有語義的同時(shí)減少其自身的垃圾特性,從而有效的規(guī)避過濾器的檢測,降低郵件過濾系統(tǒng)的分類效率。本文系統(tǒng)地分析了垃圾郵件的產(chǎn)生及發(fā)展近狀,總結(jié)了當(dāng)前對(duì)抗環(huán)境下垃圾郵件過濾的主要研究現(xiàn)狀。傳統(tǒng)的TFIDF方法使用特征詞頻來表示特征的權(quán)重,而在應(yīng)對(duì)好詞攻擊時(shí)壞詞的權(quán)重下降很大從而降低了分類器的效率。因此本文提出了一種改進(jìn)型SRTFIDF特征表示方法以降低好詞攻擊對(duì)特征權(quán)重的影響。實(shí)驗(yàn)結(jié)果表明在應(yīng)對(duì)好詞攻擊時(shí)改進(jìn)后的特征表示方法比傳統(tǒng)的TFIDF方法魯棒性更好。相較于單分類器系統(tǒng),多分類器系統(tǒng)能夠提高分類器的精確率和魯棒性,但是研究表明在應(yīng)對(duì)規(guī)避攻擊時(shí)傳統(tǒng)的多分類器系統(tǒng)表現(xiàn)不佳。因此本文提出了一種基于多示例學(xué)習(xí)的分段式多分類器垃圾郵件過濾方法來對(duì)抗規(guī)避攻擊。我們將特征空間均分為兩個(gè)示例,并且針對(duì)每個(gè)示例訓(xùn)練多個(gè)子分類器來提高分類器的魯棒性。本文使用CEAS2008英文語料庫對(duì)提出的方法進(jìn)行有效驗(yàn)證。最終實(shí)驗(yàn)結(jié)果表明無論是應(yīng)對(duì)好詞攻擊還是規(guī)避攻擊分段式多分類器系統(tǒng)的精確率和魯棒性比傳統(tǒng)的多分類器系統(tǒng)表現(xiàn)更好。
[Abstract]:With the development and popularity of the Internet, email has become one of the most important ways for people to exchange information, which facilitates people's daily life and work. While disturbing people's normal communication, it also brings huge economic losses to the society. How to effectively curb the spread of spam has become an increasingly prominent problem, thanks to the rapid development of artificial intelligence technology. A large number of machine learning methods have been applied to spam filtering and have achieved good results. However, in a confrontational environment, spammers try to exploit the weaknesses of machine learning algorithms. In order to reduce the efficiency of email classifier by camouflage spam in a variety of ways, this research is called confrontation learning problem in antagonistic environment. Evading attacks is a common attack method used by spammers. By inserting good words and deleting bad words, spam can not only keep its original semantics but also reduce its own spam characteristics, thus effectively circumventing the detection of filters. To reduce the classification efficiency of mail filtering system. This paper systematically analyzes the generation and development of spam. The main research status of spam filtering in antagonistic environment is summarized. The traditional TFIDF method uses feature word frequency to express the weight of feature. However, the weight of bad words decreases greatly when dealing with attacks of good words, so the efficiency of classifier is reduced. In this paper, an improved SRTFIDF feature representation method is proposed to reduce the influence of good word attacks on feature weights. The improved feature representation method is more robust than the traditional TFIDF method in dealing with good word attacks. Multi-classifier system can improve the accuracy and robustness of the classifier. However, the research shows that the traditional multi-classifier system is not performing well in dealing with evasive attacks. Therefore, a segmented multi-classifier spam filtering method based on multi-example learning is proposed to counteract the evasive attacks. The feature space is divided into two examples, Moreover, several subclassifiers are trained for each example to improve the robustness of the classifier. This paper uses CEAS2008 English corpus to verify the proposed method effectively. Finally, the experimental results show that the proposed method can not only deal with good word attacks but also improve the robustness of classifiers. The accuracy and robustness of the multi-classifier system is better than that of the traditional multi-classifier system.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP393.098
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 鄧蔚;秦志光;劉嶠;程紅蓉;;抗好詞攻擊的中文垃圾郵件過濾模型[J];電子測量與儀器學(xué)報(bào);2010年12期
2 衣治安;劉楊;;基于二叉樹的多分類SVM算法在電子郵件過濾中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用;2007年11期
3 張玉芳;萬斌候;熊忠陽;;文本分類中的特征降維方法研究[J];計(jì)算機(jī)應(yīng)用研究;2012年07期
4 ;2014年第一季度中國反垃圾郵件狀況調(diào)查報(bào)告[J];互聯(lián)網(wǎng)天地;2014年07期
5 ;Large margin classification for combatingdisguise attacks on spam filters[J];Journal of Zhejiang University-Science C(Computers & Electronics);2012年03期
6 段宏斌;張健;;改進(jìn)的Naive Bayes技術(shù)在反垃圾郵件系統(tǒng)中的應(yīng)用[J];西北大學(xué)學(xué)報(bào)(自然科學(xué)版);2006年05期
相關(guān)博士學(xué)位論文 前3條
1 王博;文本分類中特征選擇技術(shù)的研究[D];國防科學(xué)技術(shù)大學(xué);2009年
2 陳彬;垃圾郵件的特征選擇及檢測方法研究[D];華南理工大學(xué);2010年
3 李鵬;圖像型垃圾郵件過濾關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2013年
相關(guān)碩士學(xué)位論文 前4條
1 安波;基于邏輯回歸模型的垃圾郵件過濾系統(tǒng)的研究[D];哈爾濱工程大學(xué);2009年
2 趙小華;KNN文本分類中特征詞權(quán)重算法的研究[D];太原理工大學(xué);2010年
3 趙利;基于中文主題變形的垃圾郵件過濾方法研究[D];武漢郵電科學(xué)研究院;2009年
4 羅常泳;基于內(nèi)容的垃圾郵件檢測方法研究[D];浙江大學(xué);2014年
,本文編號(hào):1597244
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1597244.html