改進(jìn)的KNN算法在過濾垃圾郵件中的應(yīng)用研究
發(fā)布時(shí)間:2018-02-09 05:53
本文關(guān)鍵詞: 垃圾郵件 KNN算法 偏依賴特性 類中心向量 出處:《湖南大學(xué)》2010年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)的廣泛普及,電子郵件已經(jīng)成為人們?nèi)粘I钪凶畋憬、最?jīng)濟(jì)的通信手段。但是電子郵件給用戶帶來便捷的同時(shí),也帶來了不可避免的副產(chǎn)品——垃圾郵件。由于實(shí)現(xiàn)比較簡單以及受到利益的驅(qū)使,一些企業(yè)和個(gè)人采用了這種最經(jīng)濟(jì)的方式進(jìn)行商業(yè)廣告,一些黑客也采用發(fā)送電子郵件進(jìn)行違法行為:盜竊用戶的機(jī)密資料,攻擊用戶的電腦等。電子郵件用戶幾乎每天都要收到幾十、幾百封垃圾郵件,每天都需要花費(fèi)一定的精力與時(shí)間來判斷是否為垃圾郵件,并進(jìn)行清除。垃圾郵件不僅影響到了電子郵件用戶,對(duì)網(wǎng)絡(luò)運(yùn)營提供商和網(wǎng)絡(luò)管理員也帶來了無盡的煩惱。這些所謂的垃圾郵件會(huì)占用用戶的帶寬、時(shí)間和存儲(chǔ)資源,如果泛濫嚴(yán)重甚至?xí)䦟?duì)網(wǎng)絡(luò)通信造成堵塞,使得正常郵件不能正常發(fā)送和接收,垃圾郵件嚴(yán)重阻礙了互聯(lián)網(wǎng)的健康發(fā)展。因此對(duì)垃圾郵件過濾技術(shù)的研究具有很大的實(shí)用價(jià)值,也是亟待解決的問題。 本文全面分析當(dāng)前垃圾郵件的主要特點(diǎn)以及垃圾郵件過濾技術(shù)的發(fā)展?fàn)顩r;深入討論了各種反垃圾郵件技術(shù)的相關(guān)理論和各自的優(yōu)缺點(diǎn)。針對(duì)當(dāng)前效果較好的KNN算法進(jìn)行了深入的研究,針對(duì)KNN算法的不足:傳統(tǒng)的KNN算法只考慮到相似度之和,或者簡單的利用相似度個(gè)數(shù)的多少來進(jìn)行判斷;將KNN算法應(yīng)用于垃圾郵件的過濾中,而沒有考慮到垃圾郵件本身的偏依賴特性,即用戶情愿多收到一封垃圾郵件,也不愿意讓垃圾郵件過濾系統(tǒng)將自己的正常郵件誤判為垃圾郵件過濾掉;傳統(tǒng)的KNN分類算法進(jìn)行分類時(shí),每次都需要將待測(cè)樣本和訓(xùn)練樣本集中的每個(gè)樣本進(jìn)行比較,計(jì)算相似度,計(jì)算量十分的大,不適合實(shí)時(shí)性要求比較高的垃圾郵件過濾系統(tǒng)。 本文針對(duì)上述KNN算法的不足之處進(jìn)行改進(jìn),提出并設(shè)計(jì)一種考慮了偏依賴特性的基于平均相似度和相似度個(gè)數(shù)的KNN算法。該算法首先通過計(jì)算平均相似度而不是相似度之和來表示類權(quán)重值,同時(shí)考慮到相似樣本的個(gè)數(shù)對(duì)分類性能的影響;其次引入了兩個(gè)表示垃圾郵件本身的偏依賴特性的參數(shù);最后,為了解決KNN算法的計(jì)算量大的缺點(diǎn),本文利用類中心向量法的思想,通過將將原始樣本轉(zhuǎn)化為一個(gè)個(gè)小類,并計(jì)算每個(gè)小類的中心向量,以代表原始訓(xùn)練樣本建立分類模型,這就相當(dāng)于將大樣本轉(zhuǎn)化為小樣本,減少了比較次數(shù),,大大降低了KNN分類算法的計(jì)算量。實(shí)驗(yàn)表明,與傳統(tǒng)的KNN算法進(jìn)行對(duì)比,本文提出的APC-KNN算法應(yīng)用于垃圾郵件的過濾,具有高正確率,低誤報(bào)率等優(yōu)點(diǎn);并且能夠更好的實(shí)現(xiàn)垃圾郵件的過濾,起到了保護(hù)電子郵件用戶以及節(jié)省寬帶等效果。
[Abstract]:With the popularity of the Internet, email has become the most convenient and economical means of communication in people's daily life. There is also the inevitable by-product of spam. Because of the simplicity of implementation and driven by profit, some companies and individuals have adopted this most economical way of advertising. Some hackers also break the law by sending emails: stealing confidential information from users, attacking their computers, etc. E-mail users receive dozens or hundreds of spam messages almost every day. It takes a certain amount of time and effort to determine whether or not it is spam every day and to clear it away. Spam affects not only e-mail users, but also email users. These so-called spam messages can take up users' bandwidth, time, and storage resources, and if flooding is serious, it can even jam network traffic. Normal mail can not be sent and received normally, spam seriously hinders the healthy development of the Internet. Therefore, the research of spam filtering technology has great practical value, and is also an urgent problem to be solved. This paper comprehensively analyzes the main characteristics of current spam and the development of spam filtering technology. In this paper, the relevant theories and advantages and disadvantages of various anti-spam technologies are discussed in depth. The KNN algorithm, which has better effect at present, is studied deeply, and the disadvantage of KNN algorithm is pointed out: the traditional KNN algorithm only considers the sum of similarity. Or simply using the number of similarity to determine; KNN algorithm is applied to spam filtering, not taking into account the spam itself partial dependence, that is, users prefer to receive one more spam, Also unwilling to let the spam filtering system misjudge their normal email as spam filtering; when the traditional KNN classification algorithm classifies, it needs to compare each sample in the training sample set with the test sample each time. The computation of similarity is very large, which is not suitable for spam filtering system with high real-time requirement. In this paper, the shortcomings of the above KNN algorithm are improved. This paper proposes and designs a KNN algorithm based on average similarity and number of similarity, which takes into account the property of partial dependence. Firstly, the average similarity is calculated instead of the sum of similarity to represent the class weight. At the same time, the effect of the number of similar samples on the classification performance is considered. Secondly, two parameters are introduced to express the partial dependence of spam itself. Finally, in order to solve the problem of large computational complexity of KNN algorithm, In this paper, the idea of class center vector method is used to transform the original sample into a small class, and calculate the center vector of each subclass to build a classification model on behalf of the original training sample, which is equivalent to transforming a large sample into a small sample. Compared with the traditional KNN algorithm, the proposed APC-KNN algorithm is applied to spam filtering, which has the advantages of high accuracy and low false alarm rate. And can better achieve spam filtering, played a role in protecting email users and saving broadband effect.
【學(xué)位授予單位】:湖南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2010
【分類號(hào)】:TP393.098
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 白秋穎;章t
本文編號(hào):1497230
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1497230.html
最近更新
教材專著