基于內(nèi)容的垃圾郵件過濾技術(shù)研究
發(fā)布時(shí)間:2018-03-20 16:31
本文選題:垃圾郵件 切入點(diǎn):郵件過濾 出處:《西南交通大學(xué)》2009年碩士論文 論文類型:學(xué)位論文
【摘要】: 隨著計(jì)算機(jī)網(wǎng)絡(luò)的飛速發(fā)展,電子郵件成了人們?nèi)粘I钪胁豢苫蛉钡耐ㄐ欧绞。然而同時(shí)也有大量的垃圾郵件隨之而生,這些垃圾郵件包含反動(dòng)、詐騙、推銷及非法出售等各種內(nèi)容,在嚴(yán)重干擾人們正常通信的同時(shí),也存在危害社會(huì)的隱患。最近的調(diào)查顯示,在這些垃圾郵件中,文字仍然是其主要的傳播形式,因此基于郵件內(nèi)容的垃圾郵件過濾技術(shù)一直是反垃圾郵件的主要研究方向。 基于內(nèi)容的垃圾郵件過濾技術(shù)主要分為分詞、文本表示、特征選擇和分類四大部分,很多研究人員在這四個(gè)方面做了大量工作,取得了很多成果。本論文對(duì)垃圾郵件過濾的這四個(gè)部分進(jìn)行了原理分析,主要研究了其中的特征選擇算法,并根據(jù)垃圾郵件過濾的特點(diǎn)對(duì)互信息特征選擇算法進(jìn)行了改進(jìn)。 其中,簡(jiǎn)要地闡述了基于內(nèi)容的垃圾郵件過濾技術(shù)的發(fā)展、應(yīng)用和現(xiàn)狀,描述了各個(gè)環(huán)節(jié)的算法原理。在分詞部分,在對(duì)垃圾郵件內(nèi)容分析的基礎(chǔ)上,對(duì)傳統(tǒng)的分詞算法增加了分詞預(yù)處理環(huán)節(jié),并給出了新的分詞算法流程。在特征選擇部分,重點(diǎn)論述了互信息算法在垃圾郵件過濾中的應(yīng)用,從頻度、分散度和集中度三個(gè)方面對(duì)傳統(tǒng)互信息算法進(jìn)行了分析和改進(jìn),在傳統(tǒng)互信息算法中加入了詞頻因子,采用類別貢獻(xiàn)比來(lái)衡量特征對(duì)類別貢獻(xiàn)的差別,并采用真實(shí)郵件集在MATLAB上進(jìn)行了仿真實(shí)驗(yàn)。在文本分類部分,分析了bayes分類算法在垃圾郵件過濾中的應(yīng)用,并選擇樸素bayes分類算法在weka環(huán)境中進(jìn)行了郵件分類實(shí)驗(yàn)。 將改進(jìn)算法的實(shí)驗(yàn)與傳統(tǒng)互信息算法以及其他文獻(xiàn)中的實(shí)驗(yàn)進(jìn)行對(duì)比,對(duì)比結(jié)果表明,在維數(shù)壓縮率相近的條件下,改進(jìn)后的互信息算法顯著提高了垃圾郵件的查準(zhǔn)率和查全率,為后續(xù)的郵件分類環(huán)節(jié)提供了更好的基礎(chǔ)。
[Abstract]:With the rapid development of computer network, email has become an indispensable means of communication in people's daily life. However, at the same time, there is also a large number of spam, which includes reactionary, fraud, Marketing and illegal sales of all kinds of content, while seriously interfering with people's normal communications, but also harmful to society. Recent surveys show that in these spam, text is still its main form of dissemination. Therefore, spam filtering technology based on email content has been the main research direction of anti-spam. The content based spam filtering technology is mainly divided into four parts: word segmentation, text representation, feature selection and classification. Many researchers have done a lot of work in these four areas. In this paper, the four parts of spam filtering are analyzed, the feature selection algorithm is studied, and the mutual information feature selection algorithm is improved according to the characteristics of spam filtering. In this paper, the development, application and current situation of content-based spam filtering technology are briefly described, and the algorithm principle of each link is described. In the part of word segmentation, based on the analysis of spam content, In the part of feature selection, the application of mutual information algorithm in spam filtering is discussed. This paper analyzes and improves the traditional mutual information algorithm from three aspects of dispersion and concentration. The word frequency factor is added to the traditional mutual information algorithm, and the category contribution ratio is used to measure the difference of the feature contribution to the category. In the part of text classification, the application of bayes classification algorithm in spam filtering is analyzed, and the simple bayes classification algorithm is selected to carry out the mail classification experiment in weka environment. The experiments of the improved algorithm are compared with those of the traditional mutual information algorithm and other literatures. The comparison results show that, under the condition of similar dimension compression ratio, The improved mutual information algorithm can significantly improve the precision and recall rate of spam, and provide a better basis for the subsequent mail classification.
【學(xué)位授予單位】:西南交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2009
【分類號(hào)】:TP393.098
【引證文獻(xiàn)】
相關(guān)期刊論文 前1條
1 王園;龔尚福;;基于二次TF* IDF的互信息文本特征選擇算法研究[J];計(jì)算機(jī)應(yīng)用與軟件;2011年04期
相關(guān)碩士學(xué)位論文 前4條
1 徐麗平;基于內(nèi)容挖掘的中文垃圾郵件過濾技術(shù)研究[D];東北財(cái)經(jīng)大學(xué);2010年
2 宋興祖;一種改進(jìn)的TF-IDF算法實(shí)現(xiàn)及其在垃圾郵件識(shí)別中的應(yīng)用[D];吉林大學(xué);2012年
3 梁婷;基于內(nèi)容的垃圾郵件過濾技術(shù)研究[D];華東師范大學(xué);2013年
4 祝冰洋;粒子群優(yōu)化的SVM垃圾郵件過濾研究[D];鄭州大學(xué);2013年
,本文編號(hào):1639909
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1639909.html
最近更新
教材專著