一種改進的TF-IDF算法實現及其在垃圾郵件識別中的應用

發(fā)布時間：2019-05-20 15:40

【摘要】：互聯網技術將21世紀帶入了信息時代，它使信息的產生和傳播變得前所未有的便捷。然而互聯網技術也是一把雙刃劍，正是由于這種信息產生和傳播方面的便捷也同時導致了垃圾信息的泛濫。從這些浩如煙海的信息中識別出垃圾信息并加以排除正日益成為目前計算機領域研究的熱點問題之一。與此同時電子郵件業(yè)務作為互聯網技術中最重要的業(yè)務之一,也被垃圾信息不斷的干擾。由此，需要找到一種切實可行的方法對垃圾郵件進行識別和分離，以保障正常的通信和工作需要。本文提出了一種基于改進TF-IDF（term frequency inverse document frequency）算法的垃圾郵件識別策略。該策略是基于在搜索引擎領域應用較為廣泛的TF-IDF算法改進的，本文針對該算法對垃圾郵件特征詞選取不全面，，特征詞區(qū)分度不夠等問題，引入了對特征項在各類之間分布，以及內容、位置權重的考量。本文中主要的改進策略有以下幾點： 1.通過對TF-IDF算法中權值引入信息熵系數進行特征修正； 2.其次，我們針對傳統TF-IDF算法中對內容和位置權重考慮不足的情況，在IDF值計算過程中引入位置和內容權值進行修正； 3.本文引入了獨立性系數的概念作為衡量特征詞條與所分類別之間關聯性的參數。 4.最后，根據垃圾郵件識別的二元分類特征，簡化了IDF值計算的相應的參數。 5.通過對語料庫中的數據進行對比試驗表明，改進的TF-IDF算法比傳統的TF-IDF 算法在召回率，錯誤率以及F1值等方面的指標均有較大提高。進一步，我們引入了機器學習中的支持向量機理論，應用改進后的TF-IDF算法建立了一個對垃圾郵件進行識別分類模型。該模型包含三個主要模塊：訓練模塊，測試模塊和統計模塊。他們分別通過對郵件進行文本分詞，特征詞條的提取和篩選，轉換數據模式進行相似度比較實現了對系統的訓練、對未知郵件的分類判定和對郵件數據統計的相關工作。我們通過使用語料庫中的測試郵件集合對系統進行測試，實驗證明我們實現的中文垃圾郵件識別系統能夠基本有效的對大部分垃圾郵件進行識別和隔離。與基于傳統的TF-IDF算法以及騰訊公司曾經使用過的垃圾郵件識別系統相比有顯著的提高，基本實現了對用戶垃圾郵件進行篩選分離，保障用戶正常通信工作的需求。
[Abstract]:Internet technology brings the 21 st century into the information age, which makes the generation and dissemination of information more convenient than ever. However, Internet technology is also a double-edged sword, precisely because of the convenience of the generation and dissemination of this information, but also led to the proliferation of junk information. It is increasingly becoming one of the hot issues in the field of computer research to identify and eliminate garbage information from these vast amounts of information. At the same time, e-mail business, as one of the most important services in Internet technology, is also constantly interfered with by spam. Therefore, it is necessary to find a feasible method to identify and separate spam in order to ensure normal communication and work needs. In this paper, a spam recognition strategy based on improved TF-IDF (term frequency / inverse document frequency) algorithm is proposed. This strategy is based on the improvement of TF-IDF algorithm, which is widely used in the field of search engines. In this paper, in order to solve the problems of incomplete selection of spam feature words and insufficient discrimination of feature words, the distribution of feature items among various categories is introduced. And the consideration of content and position weight. The main improvement strategies in this paper are as follows: 1. The characteristic correction of information entropy coefficient is carried out by introducing information entropy coefficient into TF-IDF algorithm. Secondly, in view of the insufficient consideration of content and position weight in the traditional TF-IDF algorithm, the position and content weight are modified in the process of IDF value calculation. In this paper, the concept of independence coefficient is introduced as a parameter to measure the correlation between feature entries and their categories. 4. Finally, according to the binary classification characteristics of spam recognition, the corresponding parameters of IDF value calculation are simplified. 5. By comparing the data in corpus, it is shown that the improved TF-IDF algorithm has a great improvement over the traditional TF-IDF algorithm in recall rate, error rate and F1 value. Furthermore, we introduce the theory of support vector machine in machine learning, and establish a recognition and classification model for spam by using the improved TF-IDF algorithm. The model consists of three main modules: training module, test module and statistics module. They realized the training of the system, the classification and determination of unknown mail and the statistics of mail data by extracting and filtering text segmentation, feature entry extraction and filtering, and converting data patterns to compare the similarity of the system, the classification and determination of unknown mail and the statistics of mail data, respectively. they realized the training of the system, the classification of unknown mail and the statistics of mail data. We test the system by using the test mail set in the corpus. The experiment shows that the Chinese spam recognition system can basically effectively identify and isolate most of the spam. Compared with the traditional TF-IDF algorithm and the spam identification system that Tencent has used, it basically realizes the screening and separation of user spam and ensures the normal communication work of users.
【學位授予單位】：吉林大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP393.098

【參考文獻】

相關期刊論文前9條

1 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計算機應用;2009年S1期

2 徐文海;溫有奎;;一種基于TFIDF方法的中文關鍵詞抽取算法[J];情報理論與實踐;2008年02期

3 張海龍;王蓮芝;;自動文本分類特征選擇方法研究[J];計算機工程與設計;2006年20期

4 張玉芳;彭時名;呂佳;;基于文本分類TFIDF方法的改進與應用[J];計算機工程;2006年19期

5 陳文亮;朱靖波;朱慕華;姚天順;;基于領域詞典的文本特征表示[J];計算機研究與發(fā)展;2005年12期

6 羅欣,夏德麟,晏蒲柳;基于詞頻差異的特征選取及改進的TF-IDF公式[J];計算機應用;2005年09期

7 宋楓溪,高林;文本分類器性能評估指標[J];計算機工程;2004年13期

8 王連軍;Web文本挖掘淺析[J];現代圖書情報技術;2002年06期

9 陳濤;謝陽群;;文本分類中的特征降維方法綜述[J];情報學報;2005年06期

相關碩士學位論文前2條

1 盧揚竹;基于內容的垃圾郵件過濾技術研究[D];西南交通大學;2009年

2 潘文鋒;基于內容的垃圾郵件過濾研究[D];中國科學院研究生院（計算技術研究所）;2004年

本文編號：2481749

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2481749.html

上一篇：一種基于雙數組Trie的B2B規(guī)則串提取方法
下一篇：基于語義分析的微博搜索

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

一種改進的TF-IDF算法實現及其在垃圾郵件識別中的應用