一種改進(jìn)的TF-IDF算法實(shí)現(xiàn)及其在垃圾郵件識(shí)別中的應(yīng)用
[Abstract]:Internet technology brings the 21 st century into the information age, which makes the generation and dissemination of information more convenient than ever. However, Internet technology is also a double-edged sword, precisely because of the convenience of the generation and dissemination of this information, but also led to the proliferation of junk information. It is increasingly becoming one of the hot issues in the field of computer research to identify and eliminate garbage information from these vast amounts of information. At the same time, e-mail business, as one of the most important services in Internet technology, is also constantly interfered with by spam. Therefore, it is necessary to find a feasible method to identify and separate spam in order to ensure normal communication and work needs. In this paper, a spam recognition strategy based on improved TF-IDF (term frequency / inverse document frequency) algorithm is proposed. This strategy is based on the improvement of TF-IDF algorithm, which is widely used in the field of search engines. In this paper, in order to solve the problems of incomplete selection of spam feature words and insufficient discrimination of feature words, the distribution of feature items among various categories is introduced. And the consideration of content and position weight. The main improvement strategies in this paper are as follows: 1. The characteristic correction of information entropy coefficient is carried out by introducing information entropy coefficient into TF-IDF algorithm. Secondly, in view of the insufficient consideration of content and position weight in the traditional TF-IDF algorithm, the position and content weight are modified in the process of IDF value calculation. In this paper, the concept of independence coefficient is introduced as a parameter to measure the correlation between feature entries and their categories. 4. Finally, according to the binary classification characteristics of spam recognition, the corresponding parameters of IDF value calculation are simplified. 5. By comparing the data in corpus, it is shown that the improved TF-IDF algorithm has a great improvement over the traditional TF-IDF algorithm in recall rate, error rate and F1 value. Furthermore, we introduce the theory of support vector machine in machine learning, and establish a recognition and classification model for spam by using the improved TF-IDF algorithm. The model consists of three main modules: training module, test module and statistics module. They realized the training of the system, the classification and determination of unknown mail and the statistics of mail data by extracting and filtering text segmentation, feature entry extraction and filtering, and converting data patterns to compare the similarity of the system, the classification and determination of unknown mail and the statistics of mail data, respectively. they realized the training of the system, the classification of unknown mail and the statistics of mail data. We test the system by using the test mail set in the corpus. The experiment shows that the Chinese spam recognition system can basically effectively identify and isolate most of the spam. Compared with the traditional TF-IDF algorithm and the spam identification system that Tencent has used, it basically realizes the screening and separation of user spam and ensures the normal communication work of users.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.098
【參考文獻(xiàn)】
相關(guān)期刊論文 前9條
1 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計(jì)算機(jī)應(yīng)用;2009年S1期
2 徐文海;溫有奎;;一種基于TFIDF方法的中文關(guān)鍵詞抽取算法[J];情報(bào)理論與實(shí)踐;2008年02期
3 張海龍;王蓮芝;;自動(dòng)文本分類特征選擇方法研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2006年20期
4 張玉芳;彭時(shí)名;呂佳;;基于文本分類TFIDF方法的改進(jìn)與應(yīng)用[J];計(jì)算機(jī)工程;2006年19期
5 陳文亮;朱靖波;朱慕華;姚天順;;基于領(lǐng)域詞典的文本特征表示[J];計(jì)算機(jī)研究與發(fā)展;2005年12期
6 羅欣,夏德麟,晏蒲柳;基于詞頻差異的特征選取及改進(jìn)的TF-IDF公式[J];計(jì)算機(jī)應(yīng)用;2005年09期
7 宋楓溪,高林;文本分類器性能評(píng)估指標(biāo)[J];計(jì)算機(jī)工程;2004年13期
8 王連軍;Web文本挖掘淺析[J];現(xiàn)代圖書情報(bào)技術(shù);2002年06期
9 陳濤;謝陽群;;文本分類中的特征降維方法綜述[J];情報(bào)學(xué)報(bào);2005年06期
相關(guān)碩士學(xué)位論文 前2條
1 盧揚(yáng)竹;基于內(nèi)容的垃圾郵件過濾技術(shù)研究[D];西南交通大學(xué);2009年
2 潘文鋒;基于內(nèi)容的垃圾郵件過濾研究[D];中國(guó)科學(xué)院研究生院(計(jì)算技術(shù)研究所);2004年
本文編號(hào):2481749
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2481749.html