天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

一種改進(jìn)的TF-IDF算法實(shí)現(xiàn)及其在垃圾郵件識(shí)別中的應(yīng)用

發(fā)布時(shí)間:2019-05-20 15:40
【摘要】:互聯(lián)網(wǎng)技術(shù)將21世紀(jì)帶入了信息時(shí)代,它使信息的產(chǎn)生和傳播變得前所未有的便捷。然而互聯(lián)網(wǎng)技術(shù)也是一把雙刃劍,正是由于這種信息產(chǎn)生和傳播方面的便捷也同時(shí)導(dǎo)致了垃圾信息的泛濫。從這些浩如煙海的信息中識(shí)別出垃圾信息并加以排除正日益成為目前計(jì)算機(jī)領(lǐng)域研究的熱點(diǎn)問題之一。與此同時(shí)電子郵件業(yè)務(wù)作為互聯(lián)網(wǎng)技術(shù)中最重要的業(yè)務(wù)之一,也被垃圾信息不斷的干擾。由此,需要找到一種切實(shí)可行的方法對(duì)垃圾郵件進(jìn)行識(shí)別和分離,以保障正常的通信和工作需要。 本文提出了一種基于改進(jìn)TF-IDF(term frequency inverse document frequency)算法的垃圾郵件識(shí)別策略。該策略是基于在搜索引擎領(lǐng)域應(yīng)用較為廣泛的TF-IDF算法改進(jìn)的,本文針對(duì)該算法對(duì)垃圾郵件特征詞選取不全面,,特征詞區(qū)分度不夠等問題,引入了對(duì)特征項(xiàng)在各類之間分布,以及內(nèi)容、位置權(quán)重的考量。本文中主要的改進(jìn)策略有以下幾點(diǎn): 1.通過對(duì)TF-IDF算法中權(quán)值引入信息熵系數(shù)進(jìn)行特征修正; 2.其次,我們針對(duì)傳統(tǒng)TF-IDF算法中對(duì)內(nèi)容和位置權(quán)重考慮不足的情況,在IDF值計(jì)算過程中引入位置和內(nèi)容權(quán)值進(jìn)行修正; 3.本文引入了獨(dú)立性系數(shù)的概念作為衡量特征詞條與所分類別之間關(guān)聯(lián)性的參 數(shù)。 4.最后,根據(jù)垃圾郵件識(shí)別的二元分類特征,簡(jiǎn)化了IDF值計(jì)算的相應(yīng)的參數(shù)。 5.通過對(duì)語料庫中的數(shù)據(jù)進(jìn)行對(duì)比試驗(yàn)表明,改進(jìn)的TF-IDF算法比傳統(tǒng)的TF-IDF 算法在召回率,錯(cuò)誤率以及F1值等方面的指標(biāo)均有較大提高。 進(jìn)一步,我們引入了機(jī)器學(xué)習(xí)中的支持向量機(jī)理論,應(yīng)用改進(jìn)后的TF-IDF算法建立了一個(gè)對(duì)垃圾郵件進(jìn)行識(shí)別分類模型。該模型包含三個(gè)主要模塊:訓(xùn)練模塊,測(cè)試模塊和統(tǒng)計(jì)模塊。他們分別通過對(duì)郵件進(jìn)行文本分詞,特征詞條的提取和篩選,轉(zhuǎn)換數(shù)據(jù)模式進(jìn)行相似度比較實(shí)現(xiàn)了對(duì)系統(tǒng)的訓(xùn)練、對(duì)未知郵件的分類判定和對(duì)郵件數(shù)據(jù)統(tǒng)計(jì)的相關(guān)工作。我們通過使用語料庫中的測(cè)試郵件集合對(duì)系統(tǒng)進(jìn)行測(cè)試,實(shí)驗(yàn)證明我們實(shí)現(xiàn)的中文垃圾郵件識(shí)別系統(tǒng)能夠基本有效的對(duì)大部分垃圾郵件進(jìn)行識(shí)別和隔離。與基于傳統(tǒng)的TF-IDF算法以及騰訊公司曾經(jīng)使用過的垃圾郵件識(shí)別系統(tǒng)相比有顯著的提高,基本實(shí)現(xiàn)了對(duì)用戶垃圾郵件進(jìn)行篩選分離,保障用戶正常通信工作的需求。
[Abstract]:Internet technology brings the 21 st century into the information age, which makes the generation and dissemination of information more convenient than ever. However, Internet technology is also a double-edged sword, precisely because of the convenience of the generation and dissemination of this information, but also led to the proliferation of junk information. It is increasingly becoming one of the hot issues in the field of computer research to identify and eliminate garbage information from these vast amounts of information. At the same time, e-mail business, as one of the most important services in Internet technology, is also constantly interfered with by spam. Therefore, it is necessary to find a feasible method to identify and separate spam in order to ensure normal communication and work needs. In this paper, a spam recognition strategy based on improved TF-IDF (term frequency / inverse document frequency) algorithm is proposed. This strategy is based on the improvement of TF-IDF algorithm, which is widely used in the field of search engines. In this paper, in order to solve the problems of incomplete selection of spam feature words and insufficient discrimination of feature words, the distribution of feature items among various categories is introduced. And the consideration of content and position weight. The main improvement strategies in this paper are as follows: 1. The characteristic correction of information entropy coefficient is carried out by introducing information entropy coefficient into TF-IDF algorithm. Secondly, in view of the insufficient consideration of content and position weight in the traditional TF-IDF algorithm, the position and content weight are modified in the process of IDF value calculation. In this paper, the concept of independence coefficient is introduced as a parameter to measure the correlation between feature entries and their categories. 4. Finally, according to the binary classification characteristics of spam recognition, the corresponding parameters of IDF value calculation are simplified. 5. By comparing the data in corpus, it is shown that the improved TF-IDF algorithm has a great improvement over the traditional TF-IDF algorithm in recall rate, error rate and F1 value. Furthermore, we introduce the theory of support vector machine in machine learning, and establish a recognition and classification model for spam by using the improved TF-IDF algorithm. The model consists of three main modules: training module, test module and statistics module. They realized the training of the system, the classification and determination of unknown mail and the statistics of mail data by extracting and filtering text segmentation, feature entry extraction and filtering, and converting data patterns to compare the similarity of the system, the classification and determination of unknown mail and the statistics of mail data, respectively. they realized the training of the system, the classification of unknown mail and the statistics of mail data. We test the system by using the test mail set in the corpus. The experiment shows that the Chinese spam recognition system can basically effectively identify and isolate most of the spam. Compared with the traditional TF-IDF algorithm and the spam identification system that Tencent has used, it basically realizes the screening and separation of user spam and ensures the normal communication work of users.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.098

【參考文獻(xiàn)】

相關(guān)期刊論文 前9條

1 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計(jì)算機(jī)應(yīng)用;2009年S1期

2 徐文海;溫有奎;;一種基于TFIDF方法的中文關(guān)鍵詞抽取算法[J];情報(bào)理論與實(shí)踐;2008年02期

3 張海龍;王蓮芝;;自動(dòng)文本分類特征選擇方法研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2006年20期

4 張玉芳;彭時(shí)名;呂佳;;基于文本分類TFIDF方法的改進(jìn)與應(yīng)用[J];計(jì)算機(jī)工程;2006年19期

5 陳文亮;朱靖波;朱慕華;姚天順;;基于領(lǐng)域詞典的文本特征表示[J];計(jì)算機(jī)研究與發(fā)展;2005年12期

6 羅欣,夏德麟,晏蒲柳;基于詞頻差異的特征選取及改進(jìn)的TF-IDF公式[J];計(jì)算機(jī)應(yīng)用;2005年09期

7 宋楓溪,高林;文本分類器性能評(píng)估指標(biāo)[J];計(jì)算機(jī)工程;2004年13期

8 王連軍;Web文本挖掘淺析[J];現(xiàn)代圖書情報(bào)技術(shù);2002年06期

9 陳濤;謝陽群;;文本分類中的特征降維方法綜述[J];情報(bào)學(xué)報(bào);2005年06期

相關(guān)碩士學(xué)位論文 前2條

1 盧揚(yáng)竹;基于內(nèi)容的垃圾郵件過濾技術(shù)研究[D];西南交通大學(xué);2009年

2 潘文鋒;基于內(nèi)容的垃圾郵件過濾研究[D];中國(guó)科學(xué)院研究生院(計(jì)算技術(shù)研究所);2004年



本文編號(hào):2481749

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2481749.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4f279***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
91欧美日韩精品在线| 亚洲精品偷拍一区二区三区| 亚洲视频一级二级三级| 日韩免费午夜福利视频| 在线欧美精品二区三区| 日韩欧美综合中文字幕| 欧美av人人妻av人人爽蜜桃| 欧美日韩国产自拍亚洲| 98精品永久免费视频| 国产传媒精品视频一区| 97人摸人人澡人人人超碰| 老熟女露脸一二三四区| 深夜福利欲求不满的人妻| 国产亚洲视频香蕉一区| 亚洲国产成人一区二区在线观看| 成人精品网一区二区三区| 亚洲免费观看一区二区三区| 欧美午夜视频免费观看| 欧美日韩国产自拍亚洲| 欧美一区二区黑人在线| 亚洲国产另类久久精品| 亚洲视频在线观看免费中文字幕 | 日韩av亚洲一区二区三区| 自拍偷拍福利视频在线观看| 亚洲少妇一区二区三区懂色| 亚洲一区二区三区熟女少妇| 亚洲精品福利视频在线观看| 在线懂色一区二区三区精品| 好吊日视频这里都是精品| 天堂网中文字幕在线视频| 精品老司机视频在线观看| 东京干男人都知道的天堂| 熟妇人妻av中文字幕老熟妇| 久久99爱爱视频视频| 99久热只有精品视频最新| 丰满人妻一二区二区三区av| 超碰在线免费公开中国黄片| 亚洲av熟女一区二区三区蜜桃| 久久精品亚洲精品国产欧美| 久热在线视频这里只有精品| 国内精品偷拍视频久久|