基于粗糙集理論的垃圾郵件識(shí)別方法
[Abstract]:E - mail brings convenience to people and people, but it also brings trouble. Some profitable businesses send a lot of mail to mail users on the Internet. Some of them use e-mail to disseminate illegal, reactionary and fraudulent information, which not only causes the congestion of the server, but also the illegal, reactionary and fraudulent information. It will cause certain harm to society.
At present, the mainstream anti spam technology is based on mail content recognition technology, but this technology requires a large number of matching operations, high occupancy for CPU and memory, and because the spam sender will change different ways to disguise the sent spam content, so the content based spam mail is changed over time. The efficiency of part recognition will be gradually reduced, so the focus of this study is transferred to the letter head of the mail. Because the field features of the mail message head are more fuzzy, the different types of mail may contain the same feature, which has the uncertainty and inconsistency, and not all mail contains the fields involved in the defined attributes. There are some missing attribute values. Therefore, a spam identification method based on the theory of incomplete information system in rough set is proposed.
First of all, the feature extraction of the mail training set has been classified. Since the e-mail message header is semi structured text, this paper selects 9 letter header fields that can reflect the mail features, and defines 24 characteristic attributes independently, including 23 condition attributes and 1 decision attributes, and the attribute values of the attributes are both discrete values and the decision attribute values root. According to the category assignment of the sample itself, a data table is obtained after the characteristics of the defined feature attribute to the training of the training set. Because there are some attribute values that can not be obtained in the data table, it is called an incomplete information system in the rough set theory. Then the rough set theory is used in the feature selection stage. The relevant knowledge of incomplete system is discretized and knowledge reduction, and a decision table can be obtained. Each line in the decision table is a rule. The sample to be identified is matched by the rule of the rule in the decision table to find the matching rules, then the post of the rule is mail. In the end, the recall rate and accuracy rate of mail recognition are calculated and compared by the experiment designed in this paper. For the incomplete system processing methods, the traditional method is compared with the traditional method. The method of expanding the equivalence relation is more effective in this paper, and the SVM algorithm based on the other recognition method based on the letter head is used. The algorithm of decision tree, Bias algorithm and the traditional rough set algorithm have higher recall and accuracy. The main contents of this paper are as follows:
(1) define the attributes used for feature extraction.
The head of e-mail is made up of several header fields. By analyzing a large number of spam and normal mail, 9 higher probability head fields are obtained, and 24 attributes are defined for From, Sender, Reply-to, To, Delivered-To, Return-Path, Received, Message-ID, Date. and through the analysis fields, including 24 attributes, including 23 A conditional attribute and 1 decision attributes.
(2) improve the asymmetric similarity relation in incomplete systems.
The information system which is extracted from the attributes defined in this paper has an incomplete information system due to the lack of some fields in the header. Although the attribute values in this paper do not exist at present, it is possible to determine whether they are the same category according to the same attribute values of the other attributes between the samples. This paper extends the equivalence relation of the complete information system. In this paper, an improved asymmetric similarity relation is proposed on the basis of the original asymmetric similarity relation, which will replace the equivalent relation in the complete system to the sample.
(3) feature selection and sample recognition based on the theory of incomplete information system in rough set.
The discretization algorithm based on attribute importance is used to discretize the decision table, and the discretization algorithm based on the importance of attribute always persists without changing the classification ability of the decision table. The attribute value of the decision table after discretization is less, and the recognition rate will be added effectively. The value of the obtained attributes contains less missing attributes, defines the concept of lower approximate complete importance, proposes an attribute reduction algorithm based on the lower approximation complete importance, and simplifies the rules using an improved asymmetric similarity relation in value reduction. The rules of the improved asymmetric similarity can be merged and calculated. The reliability of each rule. Finally, when the sample is identified, the rules match according to the improved asymmetric similarity relation. If there are many matching rules, the rules with large credibility are selected. If there is no such rule, the sample is added to the unidentified sample set.
(4) experiment the algorithm proposed in this paper.
First, by setting the number of different training sets, the results of attribute reduction, recall, accuracy and recognition rate are obtained. The experimental results show that the algorithm proposed in this paper has good stability, and it has a good effect on the identification of spam. Secondly, the algorithm and SVM algorithm, Bias algorithm, decision tree are used. Compared with the traditional rough set algorithm, the recall rate and accuracy of the algorithm reach 87.10% and 89.01%, which are better than other algorithms.
To sum up, this paper applies the processing method of incomplete information system in the rough set theory for the first time in the domain of spam mail recognition, and uses the discretization of incomplete information systems in rough set theory, knowledge reduction and recognition method to obtain decision table and recognition mail. This paper is verified by two experiments. The effectiveness of the proposed method, the experimental results, both from the recall rate and the accuracy rate, can achieve satisfactory results and lay a foundation for further research on spam filtering.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.098
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 王國胤;Rough集理論在不完備信息系統(tǒng)中的擴(kuò)充[J];計(jì)算機(jī)研究與發(fā)展;2002年10期
2 李志君;王國胤;吳渝;;基于Rough Set的電子郵件分類系統(tǒng)[J];計(jì)算機(jī)科學(xué);2004年03期
3 鄧維斌;王國胤;洪智勇;;基于粗糙集的加權(quán)樸素貝葉斯郵件過濾方法[J];計(jì)算機(jī)科學(xué);2011年02期
4 周念念,冉蜀陽,曾劍宇,鐘響;基于人工免疫的反垃圾郵件系統(tǒng)模型[J];計(jì)算機(jī)應(yīng)用;2005年11期
5 常犁云,263.net,王國胤,263.net,吳渝,263.net;一種基于Rough Set理論的屬性約簡及規(guī)則提取方法[J];軟件學(xué)報(bào);1999年11期
6 黃海;王國胤;吳渝;;一種不完備信息系統(tǒng)的直接約簡方法[J];小型微型計(jì)算機(jī)系統(tǒng);2005年10期
7 朱顥東;鐘勇;;一種無決策屬性的信息系統(tǒng)的屬性約簡算法[J];小型微型計(jì)算機(jī)系統(tǒng);2010年02期
8 譚營;朱元春;;反垃圾電子郵件方法研究進(jìn)展[J];智能系統(tǒng)學(xué)報(bào);2010年03期
相關(guān)博士學(xué)位論文 前1條
1 裴小兵;粗糙集的知識(shí)約簡研究[D];華中科技大學(xué);2006年
相關(guān)碩士學(xué)位論文 前7條
1 費(fèi)巧玲;安全電子郵件解決方案與系統(tǒng)實(shí)現(xiàn)[D];湖南大學(xué);2006年
2 張耀龍;行為識(shí)別技術(shù)在反垃圾郵件系統(tǒng)中的研究與應(yīng)用[D];北京郵電大學(xué);2006年
3 潘文鋒;基于內(nèi)容的垃圾郵件過濾研究[D];中國科學(xué)院研究生院(計(jì)算技術(shù)研究所);2004年
4 錢誠慎;SMTP電子郵件客戶端與服務(wù)器的設(shè)計(jì)與實(shí)現(xiàn)[D];大連理工大學(xué);2006年
5 侯巖;基于SVM的中文電子郵件過濾方法研究[D];山西大學(xué);2008年
6 歐紅星;電子郵件安全過濾與檢查技術(shù)研究[D];中南大學(xué);2008年
7 王蕓;基于Rough集的垃圾郵件過濾技術(shù)的研究與應(yīng)用[D];南昌大學(xué);2008年
本文編號(hào):2138568
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2138568.html