天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 文藝論文 > 廣告藝術(shù)論文 >

基于粗糙集理論的垃圾郵件識(shí)別方法

發(fā)布時(shí)間:2018-07-22 21:04
【摘要】:電子郵件在給人與人之間相互交流帶來便利的同時(shí)也帶來了困擾,一些為獲得盈利的商家在互聯(lián)網(wǎng)中向郵件用戶發(fā)送大量廣告郵件,一些不法之徒,利用電子郵件傳播非法、反動(dòng)以及詐騙等垃圾信息,這種做法不僅造成服務(wù)器的堵塞,更會(huì)對(duì)社會(huì)造成一定的危害。 目前主流的反垃圾郵件技術(shù)為基于郵件內(nèi)容識(shí)別技術(shù),但是這種技術(shù)需要進(jìn)行大量的匹配運(yùn)算,對(duì)CPU和內(nèi)存的占用極高,,并且由于垃圾郵件發(fā)送者會(huì)變換不同的方式來偽裝發(fā)送的垃圾郵件內(nèi)容,所以隨著時(shí)間的改變,基于內(nèi)容的垃圾郵件識(shí)別效率會(huì)逐漸變低,因此本文將研究的重點(diǎn)轉(zhuǎn)移到郵件的信頭中。由于郵件信頭的字段特征較為模糊,不同類別的郵件可能含有相同的信頭特征,具有不確定性和不一致性,同時(shí)并非所有的郵件都含有定義的屬性涉及到的字段,會(huì)有一部分屬性值缺失的情況,因此提出一種基于粗糙集中不完備信息系統(tǒng)的相關(guān)理論的垃圾郵件識(shí)別方法。 首先對(duì)已經(jīng)分類的郵件訓(xùn)練集進(jìn)行特征提取,由于電子郵件的信頭是半結(jié)構(gòu)化文本,本文選擇了9個(gè)能夠反映郵件特征的信頭字段,自主定義了24個(gè)特征屬性,其中23個(gè)條件屬性,1個(gè)決策屬性,條件屬性的屬性值均為離散值,決策屬性值根據(jù)樣本本身的類別賦值。根據(jù)定義的特征屬性對(duì)訓(xùn)練集中的郵件進(jìn)行特征提取之后得到一個(gè)數(shù)據(jù)表,由于這個(gè)數(shù)據(jù)表中有一些獲取不到的屬性值,因此在粗糙集理論中稱之為一個(gè)不完備信息系統(tǒng)。然后在特征選擇階段使用粗糙集理論中針對(duì)不完備系統(tǒng)的相關(guān)知識(shí)進(jìn)行離散化和知識(shí)約簡,最終獲得一個(gè)可以用于分類的決策表,決策表中每一行都是一條規(guī)則,待識(shí)別樣本通過與這個(gè)決策表中規(guī)則的規(guī)則前件進(jìn)行字符匹配,找到相匹配的規(guī)則,則該條規(guī)則的后件即為郵件最終的類別。最后通過本文設(shè)計(jì)的實(shí)驗(yàn)對(duì)郵件識(shí)別的召回率和準(zhǔn)確率進(jìn)行計(jì)算和比較,對(duì)于不完備系統(tǒng)的處理方法來說,相比較傳統(tǒng)的補(bǔ)齊方法,本文中的對(duì)等價(jià)關(guān)系進(jìn)行擴(kuò)充的方法更有效,針對(duì)其他的基于信頭的識(shí)別方法SVM算法、決策樹算法、貝葉斯算法和傳統(tǒng)的粗糙集算法來說,本文的算法具有更高的召回率和準(zhǔn)確率。本文的研究內(nèi)容主要有以下幾個(gè)方面: (1)定義用于特征提取的屬性。 電子郵件中的信頭是由若干頭字段組成的,通過分析大量的垃圾郵件與正常郵件的信頭得到9個(gè)出現(xiàn)概率較高頭字段,為From、Sender、Reply-to、To、Delivered-To、Return-Path、Received、Message-ID、Date。并通過分析字段之間的關(guān)系自主定義了24個(gè)屬性,包括23個(gè)條件屬性和1個(gè)決策屬性。 (2)改進(jìn)不完備系統(tǒng)中非對(duì)稱的相似關(guān)系。 根據(jù)本文中定義的屬性進(jìn)行特征提取之后的信息系統(tǒng)由于信頭中一些字段的缺失導(dǎo)致得到了一個(gè)不完備信息系統(tǒng),雖然本文中屬性值當(dāng)前不存在,但是可以根據(jù)樣本之間的其他屬性的屬性值是否相同判定它們是否是同一個(gè)類別,因此將完備信息系統(tǒng)的等價(jià)關(guān)系進(jìn)行擴(kuò)充,本文在原有的非對(duì)稱相似關(guān)系的基礎(chǔ)上提出了一種改進(jìn)的非對(duì)稱相似關(guān)系,這種關(guān)系將代替完備系統(tǒng)中的等價(jià)關(guān)系對(duì)樣本進(jìn)行劃分。 (3)利用粗糙集中不完備信息系統(tǒng)的相關(guān)理論進(jìn)行特征選擇與樣本識(shí)別。 首先使用基于屬性重要性的離散化算法對(duì)決策表進(jìn)行離散化處理,基于屬性重要性的離散化算法在離散的過程中始終堅(jiān)持不改變決策表的分類能力,離散化后的決策表屬性值種類更少,會(huì)有效增加識(shí)別率。然后基于本文希望屬性約簡得到的屬性的值含有較少的缺失屬性,定義了下近似完備重要度的概念,提出了基于下近似完備重要度的屬性約簡算法,值約簡時(shí)使用改進(jìn)的非對(duì)稱相似關(guān)系對(duì)規(guī)則進(jìn)行簡化,符合改進(jìn)的非對(duì)稱相似關(guān)系的規(guī)則可以進(jìn)行合并,并計(jì)算出每條規(guī)則的可信度。最后在樣本識(shí)別的時(shí)候也是根據(jù)改進(jìn)的非對(duì)稱相似關(guān)系進(jìn)行規(guī)則匹配,如有多條可匹配的規(guī)則,則選擇可信度大的規(guī)則,若不存在這樣的規(guī)則,則將樣本加入未識(shí)別樣本集中。 (4)對(duì)本文中提出的算法進(jìn)行實(shí)驗(yàn)。 首先通過設(shè)置不同訓(xùn)練集數(shù)目進(jìn)行實(shí)驗(yàn),分別得出屬性約簡結(jié)果以及召回率、準(zhǔn)確率和識(shí)別率,實(shí)驗(yàn)結(jié)果表明,本文提出的算法具有較好的穩(wěn)定性,且對(duì)垃圾郵件的識(shí)別起到了很好的效果;其次,將本文算法與SVM算法、貝葉斯算法、決策樹算法和傳統(tǒng)的粗糙集算法進(jìn)行對(duì)比,本文算法召回率和準(zhǔn)確率達(dá)到了87.10%和89.01%,優(yōu)于其他的算法。 綜上所述,本文首次將粗糙集理論中不完備信息系統(tǒng)的處理方法應(yīng)用于垃圾郵件信頭識(shí)別的領(lǐng)域中,并使用粗糙集理論中不完備信息系統(tǒng)的離散化、知識(shí)約簡以及識(shí)別方法進(jìn)行獲取決策表以及識(shí)別郵件,通過兩個(gè)實(shí)驗(yàn)進(jìn)行驗(yàn)證本文提出的方法的有效性,實(shí)驗(yàn)結(jié)果無論是從召回率和準(zhǔn)確率來看本文方法都能夠獲得令人滿意的效果,為垃圾郵件過濾的進(jìn)一步研究奠定了基礎(chǔ)。
[Abstract]:E - mail brings convenience to people and people, but it also brings trouble. Some profitable businesses send a lot of mail to mail users on the Internet. Some of them use e-mail to disseminate illegal, reactionary and fraudulent information, which not only causes the congestion of the server, but also the illegal, reactionary and fraudulent information. It will cause certain harm to society.
At present, the mainstream anti spam technology is based on mail content recognition technology, but this technology requires a large number of matching operations, high occupancy for CPU and memory, and because the spam sender will change different ways to disguise the sent spam content, so the content based spam mail is changed over time. The efficiency of part recognition will be gradually reduced, so the focus of this study is transferred to the letter head of the mail. Because the field features of the mail message head are more fuzzy, the different types of mail may contain the same feature, which has the uncertainty and inconsistency, and not all mail contains the fields involved in the defined attributes. There are some missing attribute values. Therefore, a spam identification method based on the theory of incomplete information system in rough set is proposed.
First of all, the feature extraction of the mail training set has been classified. Since the e-mail message header is semi structured text, this paper selects 9 letter header fields that can reflect the mail features, and defines 24 characteristic attributes independently, including 23 condition attributes and 1 decision attributes, and the attribute values of the attributes are both discrete values and the decision attribute values root. According to the category assignment of the sample itself, a data table is obtained after the characteristics of the defined feature attribute to the training of the training set. Because there are some attribute values that can not be obtained in the data table, it is called an incomplete information system in the rough set theory. Then the rough set theory is used in the feature selection stage. The relevant knowledge of incomplete system is discretized and knowledge reduction, and a decision table can be obtained. Each line in the decision table is a rule. The sample to be identified is matched by the rule of the rule in the decision table to find the matching rules, then the post of the rule is mail. In the end, the recall rate and accuracy rate of mail recognition are calculated and compared by the experiment designed in this paper. For the incomplete system processing methods, the traditional method is compared with the traditional method. The method of expanding the equivalence relation is more effective in this paper, and the SVM algorithm based on the other recognition method based on the letter head is used. The algorithm of decision tree, Bias algorithm and the traditional rough set algorithm have higher recall and accuracy. The main contents of this paper are as follows:
(1) define the attributes used for feature extraction.
The head of e-mail is made up of several header fields. By analyzing a large number of spam and normal mail, 9 higher probability head fields are obtained, and 24 attributes are defined for From, Sender, Reply-to, To, Delivered-To, Return-Path, Received, Message-ID, Date. and through the analysis fields, including 24 attributes, including 23 A conditional attribute and 1 decision attributes.
(2) improve the asymmetric similarity relation in incomplete systems.
The information system which is extracted from the attributes defined in this paper has an incomplete information system due to the lack of some fields in the header. Although the attribute values in this paper do not exist at present, it is possible to determine whether they are the same category according to the same attribute values of the other attributes between the samples. This paper extends the equivalence relation of the complete information system. In this paper, an improved asymmetric similarity relation is proposed on the basis of the original asymmetric similarity relation, which will replace the equivalent relation in the complete system to the sample.
(3) feature selection and sample recognition based on the theory of incomplete information system in rough set.
The discretization algorithm based on attribute importance is used to discretize the decision table, and the discretization algorithm based on the importance of attribute always persists without changing the classification ability of the decision table. The attribute value of the decision table after discretization is less, and the recognition rate will be added effectively. The value of the obtained attributes contains less missing attributes, defines the concept of lower approximate complete importance, proposes an attribute reduction algorithm based on the lower approximation complete importance, and simplifies the rules using an improved asymmetric similarity relation in value reduction. The rules of the improved asymmetric similarity can be merged and calculated. The reliability of each rule. Finally, when the sample is identified, the rules match according to the improved asymmetric similarity relation. If there are many matching rules, the rules with large credibility are selected. If there is no such rule, the sample is added to the unidentified sample set.
(4) experiment the algorithm proposed in this paper.
First, by setting the number of different training sets, the results of attribute reduction, recall, accuracy and recognition rate are obtained. The experimental results show that the algorithm proposed in this paper has good stability, and it has a good effect on the identification of spam. Secondly, the algorithm and SVM algorithm, Bias algorithm, decision tree are used. Compared with the traditional rough set algorithm, the recall rate and accuracy of the algorithm reach 87.10% and 89.01%, which are better than other algorithms.
To sum up, this paper applies the processing method of incomplete information system in the rough set theory for the first time in the domain of spam mail recognition, and uses the discretization of incomplete information systems in rough set theory, knowledge reduction and recognition method to obtain decision table and recognition mail. This paper is verified by two experiments. The effectiveness of the proposed method, the experimental results, both from the recall rate and the accuracy rate, can achieve satisfactory results and lay a foundation for further research on spam filtering.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.098

【參考文獻(xiàn)】

相關(guān)期刊論文 前8條

1 王國胤;Rough集理論在不完備信息系統(tǒng)中的擴(kuò)充[J];計(jì)算機(jī)研究與發(fā)展;2002年10期

2 李志君;王國胤;吳渝;;基于Rough Set的電子郵件分類系統(tǒng)[J];計(jì)算機(jī)科學(xué);2004年03期

3 鄧維斌;王國胤;洪智勇;;基于粗糙集的加權(quán)樸素貝葉斯郵件過濾方法[J];計(jì)算機(jī)科學(xué);2011年02期

4 周念念,冉蜀陽,曾劍宇,鐘響;基于人工免疫的反垃圾郵件系統(tǒng)模型[J];計(jì)算機(jī)應(yīng)用;2005年11期

5 常犁云,263.net,王國胤,263.net,吳渝,263.net;一種基于Rough Set理論的屬性約簡及規(guī)則提取方法[J];軟件學(xué)報(bào);1999年11期

6 黃海;王國胤;吳渝;;一種不完備信息系統(tǒng)的直接約簡方法[J];小型微型計(jì)算機(jī)系統(tǒng);2005年10期

7 朱顥東;鐘勇;;一種無決策屬性的信息系統(tǒng)的屬性約簡算法[J];小型微型計(jì)算機(jī)系統(tǒng);2010年02期

8 譚營;朱元春;;反垃圾電子郵件方法研究進(jìn)展[J];智能系統(tǒng)學(xué)報(bào);2010年03期

相關(guān)博士學(xué)位論文 前1條

1 裴小兵;粗糙集的知識(shí)約簡研究[D];華中科技大學(xué);2006年

相關(guān)碩士學(xué)位論文 前7條

1 費(fèi)巧玲;安全電子郵件解決方案與系統(tǒng)實(shí)現(xiàn)[D];湖南大學(xué);2006年

2 張耀龍;行為識(shí)別技術(shù)在反垃圾郵件系統(tǒng)中的研究與應(yīng)用[D];北京郵電大學(xué);2006年

3 潘文鋒;基于內(nèi)容的垃圾郵件過濾研究[D];中國科學(xué)院研究生院(計(jì)算技術(shù)研究所);2004年

4 錢誠慎;SMTP電子郵件客戶端與服務(wù)器的設(shè)計(jì)與實(shí)現(xiàn)[D];大連理工大學(xué);2006年

5 侯巖;基于SVM的中文電子郵件過濾方法研究[D];山西大學(xué);2008年

6 歐紅星;電子郵件安全過濾與檢查技術(shù)研究[D];中南大學(xué);2008年

7 王蕓;基于Rough集的垃圾郵件過濾技術(shù)的研究與應(yīng)用[D];南昌大學(xué);2008年



本文編號(hào):2138568

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2138568.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶5a065***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com