基于粗糙集理論的垃圾郵件識別方法

發(fā)布時間：2018-07-22 21:04

【摘要】：電子郵件在給人與人之間相互交流帶來便利的同時也帶來了困擾，一些為獲得盈利的商家在互聯(lián)網(wǎng)中向郵件用戶發(fā)送大量廣告郵件，一些不法之徒，利用電子郵件傳播非法、反動以及詐騙等垃圾信息，這種做法不僅造成服務器的堵塞，更會對社會造成一定的危害。目前主流的反垃圾郵件技術為基于郵件內(nèi)容識別技術，但是這種技術需要進行大量的匹配運算，對CPU和內(nèi)存的占用極高，，并且由于垃圾郵件發(fā)送者會變換不同的方式來偽裝發(fā)送的垃圾郵件內(nèi)容，所以隨著時間的改變，基于內(nèi)容的垃圾郵件識別效率會逐漸變低，因此本文將研究的重點轉移到郵件的信頭中。由于郵件信頭的字段特征較為模糊，不同類別的郵件可能含有相同的信頭特征，具有不確定性和不一致性，同時并非所有的郵件都含有定義的屬性涉及到的字段，會有一部分屬性值缺失的情況，因此提出一種基于粗糙集中不完備信息系統(tǒng)的相關理論的垃圾郵件識別方法。首先對已經(jīng)分類的郵件訓練集進行特征提取，由于電子郵件的信頭是半結構化文本，本文選擇了9個能夠反映郵件特征的信頭字段，自主定義了24個特征屬性，其中23個條件屬性，1個決策屬性，條件屬性的屬性值均為離散值，決策屬性值根據(jù)樣本本身的類別賦值。根據(jù)定義的特征屬性對訓練集中的郵件進行特征提取之后得到一個數(shù)據(jù)表，由于這個數(shù)據(jù)表中有一些獲取不到的屬性值，因此在粗糙集理論中稱之為一個不完備信息系統(tǒng)。然后在特征選擇階段使用粗糙集理論中針對不完備系統(tǒng)的相關知識進行離散化和知識約簡，最終獲得一個可以用于分類的決策表，決策表中每一行都是一條規(guī)則，待識別樣本通過與這個決策表中規(guī)則的規(guī)則前件進行字符匹配，找到相匹配的規(guī)則，則該條規(guī)則的后件即為郵件最終的類別。最后通過本文設計的實驗對郵件識別的召回率和準確率進行計算和比較，對于不完備系統(tǒng)的處理方法來說，相比較傳統(tǒng)的補齊方法，本文中的對等價關系進行擴充的方法更有效，針對其他的基于信頭的識別方法SVM算法、決策樹算法、貝葉斯算法和傳統(tǒng)的粗糙集算法來說，本文的算法具有更高的召回率和準確率。本文的研究內(nèi)容主要有以下幾個方面： (1)定義用于特征提取的屬性。電子郵件中的信頭是由若干頭字段組成的，通過分析大量的垃圾郵件與正常郵件的信頭得到9個出現(xiàn)概率較高頭字段，為From、Sender、Reply-to、To、Delivered-To、Return-Path、Received、Message-ID、Date。并通過分析字段之間的關系自主定義了24個屬性，包括23個條件屬性和1個決策屬性。 (2)改進不完備系統(tǒng)中非對稱的相似關系。根據(jù)本文中定義的屬性進行特征提取之后的信息系統(tǒng)由于信頭中一些字段的缺失導致得到了一個不完備信息系統(tǒng)，雖然本文中屬性值當前不存在，但是可以根據(jù)樣本之間的其他屬性的屬性值是否相同判定它們是否是同一個類別，因此將完備信息系統(tǒng)的等價關系進行擴充，本文在原有的非對稱相似關系的基礎上提出了一種改進的非對稱相似關系，這種關系將代替完備系統(tǒng)中的等價關系對樣本進行劃分。 (3)利用粗糙集中不完備信息系統(tǒng)的相關理論進行特征選擇與樣本識別。首先使用基于屬性重要性的離散化算法對決策表進行離散化處理，基于屬性重要性的離散化算法在離散的過程中始終堅持不改變決策表的分類能力，離散化后的決策表屬性值種類更少，會有效增加識別率。然后基于本文希望屬性約簡得到的屬性的值含有較少的缺失屬性，定義了下近似完備重要度的概念，提出了基于下近似完備重要度的屬性約簡算法，值約簡時使用改進的非對稱相似關系對規(guī)則進行簡化，符合改進的非對稱相似關系的規(guī)則可以進行合并，并計算出每條規(guī)則的可信度。最后在樣本識別的時候也是根據(jù)改進的非對稱相似關系進行規(guī)則匹配，如有多條可匹配的規(guī)則，則選擇可信度大的規(guī)則，若不存在這樣的規(guī)則，則將樣本加入未識別樣本集中。 (4)對本文中提出的算法進行實驗。首先通過設置不同訓練集數(shù)目進行實驗，分別得出屬性約簡結果以及召回率、準確率和識別率，實驗結果表明，本文提出的算法具有較好的穩(wěn)定性，且對垃圾郵件的識別起到了很好的效果；其次，將本文算法與SVM算法、貝葉斯算法、決策樹算法和傳統(tǒng)的粗糙集算法進行對比，本文算法召回率和準確率達到了87.10%和89.01%，優(yōu)于其他的算法。綜上所述，本文首次將粗糙集理論中不完備信息系統(tǒng)的處理方法應用于垃圾郵件信頭識別的領域中，并使用粗糙集理論中不完備信息系統(tǒng)的離散化、知識約簡以及識別方法進行獲取決策表以及識別郵件，通過兩個實驗進行驗證本文提出的方法的有效性，實驗結果無論是從召回率和準確率來看本文方法都能夠獲得令人滿意的效果，為垃圾郵件過濾的進一步研究奠定了基礎。
[Abstract]:E - mail brings convenience to people and people, but it also brings trouble. Some profitable businesses send a lot of mail to mail users on the Internet. Some of them use e-mail to disseminate illegal, reactionary and fraudulent information, which not only causes the congestion of the server, but also the illegal, reactionary and fraudulent information. It will cause certain harm to society.
At present, the mainstream anti spam technology is based on mail content recognition technology, but this technology requires a large number of matching operations, high occupancy for CPU and memory, and because the spam sender will change different ways to disguise the sent spam content, so the content based spam mail is changed over time. The efficiency of part recognition will be gradually reduced, so the focus of this study is transferred to the letter head of the mail. Because the field features of the mail message head are more fuzzy, the different types of mail may contain the same feature, which has the uncertainty and inconsistency, and not all mail contains the fields involved in the defined attributes. There are some missing attribute values. Therefore, a spam identification method based on the theory of incomplete information system in rough set is proposed.
First of all, the feature extraction of the mail training set has been classified. Since the e-mail message header is semi structured text, this paper selects 9 letter header fields that can reflect the mail features, and defines 24 characteristic attributes independently, including 23 condition attributes and 1 decision attributes, and the attribute values of the attributes are both discrete values and the decision attribute values root. According to the category assignment of the sample itself, a data table is obtained after the characteristics of the defined feature attribute to the training of the training set. Because there are some attribute values that can not be obtained in the data table, it is called an incomplete information system in the rough set theory. Then the rough set theory is used in the feature selection stage. The relevant knowledge of incomplete system is discretized and knowledge reduction, and a decision table can be obtained. Each line in the decision table is a rule. The sample to be identified is matched by the rule of the rule in the decision table to find the matching rules, then the post of the rule is mail. In the end, the recall rate and accuracy rate of mail recognition are calculated and compared by the experiment designed in this paper. For the incomplete system processing methods, the traditional method is compared with the traditional method. The method of expanding the equivalence relation is more effective in this paper, and the SVM algorithm based on the other recognition method based on the letter head is used. The algorithm of decision tree, Bias algorithm and the traditional rough set algorithm have higher recall and accuracy. The main contents of this paper are as follows:
(1) define the attributes used for feature extraction.
The head of e-mail is made up of several header fields. By analyzing a large number of spam and normal mail, 9 higher probability head fields are obtained, and 24 attributes are defined for From, Sender, Reply-to, To, Delivered-To, Return-Path, Received, Message-ID, Date. and through the analysis fields, including 24 attributes, including 23 A conditional attribute and 1 decision attributes.
(2) improve the asymmetric similarity relation in incomplete systems.
The information system which is extracted from the attributes defined in this paper has an incomplete information system due to the lack of some fields in the header. Although the attribute values in this paper do not exist at present, it is possible to determine whether they are the same category according to the same attribute values of the other attributes between the samples. This paper extends the equivalence relation of the complete information system. In this paper, an improved asymmetric similarity relation is proposed on the basis of the original asymmetric similarity relation, which will replace the equivalent relation in the complete system to the sample.
(3) feature selection and sample recognition based on the theory of incomplete information system in rough set.
The discretization algorithm based on attribute importance is used to discretize the decision table, and the discretization algorithm based on the importance of attribute always persists without changing the classification ability of the decision table. The attribute value of the decision table after discretization is less, and the recognition rate will be added effectively. The value of the obtained attributes contains less missing attributes, defines the concept of lower approximate complete importance, proposes an attribute reduction algorithm based on the lower approximation complete importance, and simplifies the rules using an improved asymmetric similarity relation in value reduction. The rules of the improved asymmetric similarity can be merged and calculated. The reliability of each rule. Finally, when the sample is identified, the rules match according to the improved asymmetric similarity relation. If there are many matching rules, the rules with large credibility are selected. If there is no such rule, the sample is added to the unidentified sample set.
(4) experiment the algorithm proposed in this paper.
First, by setting the number of different training sets, the results of attribute reduction, recall, accuracy and recognition rate are obtained. The experimental results show that the algorithm proposed in this paper has good stability, and it has a good effect on the identification of spam. Secondly, the algorithm and SVM algorithm, Bias algorithm, decision tree are used. Compared with the traditional rough set algorithm, the recall rate and accuracy of the algorithm reach 87.10% and 89.01%, which are better than other algorithms.
To sum up, this paper applies the processing method of incomplete information system in the rough set theory for the first time in the domain of spam mail recognition, and uses the discretization of incomplete information systems in rough set theory, knowledge reduction and recognition method to obtain decision table and recognition mail. This paper is verified by two experiments. The effectiveness of the proposed method, the experimental results, both from the recall rate and the accuracy rate, can achieve satisfactory results and lay a foundation for further research on spam filtering.
【學位授予單位】：吉林大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP393.098

【參考文獻】

相關期刊論文前8條

1 王國胤;Rough集理論在不完備信息系統(tǒng)中的擴充[J];計算機研究與發(fā)展;2002年10期

2 李志君;王國胤;吳渝;;基于Rough Set的電子郵件分類系統(tǒng)[J];計算機科學;2004年03期

3 鄧維斌;王國胤;洪智勇;;基于粗糙集的加權樸素貝葉斯郵件過濾方法[J];計算機科學;2011年02期

4 周念念,冉蜀陽,曾劍宇,鐘響;基于人工免疫的反垃圾郵件系統(tǒng)模型[J];計算機應用;2005年11期

5 常犁云,263.net,王國胤,263.net,吳渝,263.net;一種基于Rough Set理論的屬性約簡及規(guī)則提取方法[J];軟件學報;1999年11期

6 黃海;王國胤;吳渝;;一種不完備信息系統(tǒng)的直接約簡方法[J];小型微型計算機系統(tǒng);2005年10期

7 朱顥東;鐘勇;;一種無決策屬性的信息系統(tǒng)的屬性約簡算法[J];小型微型計算機系統(tǒng);2010年02期

8 譚營;朱元春;;反垃圾電子郵件方法研究進展[J];智能系統(tǒng)學報;2010年03期

相關博士學位論文前1條

1 裴小兵;粗糙集的知識約簡研究[D];華中科技大學;2006年

相關碩士學位論文前7條

1 費巧玲;安全電子郵件解決方案與系統(tǒng)實現(xiàn)[D];湖南大學;2006年

2 張耀龍;行為識別技術在反垃圾郵件系統(tǒng)中的研究與應用[D];北京郵電大學;2006年

3 潘文鋒;基于內(nèi)容的垃圾郵件過濾研究[D];中國科學院研究生院（計算技術研究所）;2004年

4 錢誠慎;SMTP電子郵件客戶端與服務器的設計與實現(xiàn)[D];大連理工大學;2006年

5 侯巖;基于SVM的中文電子郵件過濾方法研究[D];山西大學;2008年

6 歐紅星;電子郵件安全過濾與檢查技術研究[D];中南大學;2008年

7 王蕓;基于Rough集的垃圾郵件過濾技術的研究與應用[D];南昌大學;2008年

本文編號：2138568

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2138568.html

上一篇：關于平面廣告設計的誘導問題研究
下一篇：分析廣告設計專業(yè)美術基礎教育管理

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于粗糙集理論的垃圾郵件識別方法