天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于WordNet概念向量空間模型的電子郵件分類方法的研究與實(shí)現(xiàn)

發(fā)布時(shí)間:2019-02-22 09:39
【摘要】: 隨著計(jì)算機(jī)技術(shù)、信息化程度的日益提高,尤其是互聯(lián)網(wǎng)的日益普及,電子郵件因其快捷、經(jīng)濟(jì)等特點(diǎn)而逐漸成為人們普遍采用的一種通信手段。正因如此,電子郵件往往反映出社會(huì)當(dāng)前的熱點(diǎn)問(wèn)題和公眾的輿論焦點(diǎn)。然而電子郵件使用的越來(lái)越頻繁,垃圾郵件、廣告、群發(fā)消息等的泛濫,使得用戶花費(fèi)在處理郵件上的時(shí)間增多,也影響了人們對(duì)信息的整理和獲取。倘若能將電子郵件進(jìn)行分類,那么人們就可以準(zhǔn)確、全面、迅速地獲取到自己關(guān)心的內(nèi)容,大大提高了工作效率,從而減少了人力、財(cái)力、物力等方面的損失。因此,電子郵件分類引起了許多學(xué)者的研究興趣。 現(xiàn)有的電子郵件分類技術(shù)可以分為基于統(tǒng)計(jì)、基于連結(jié)和基于規(guī)則的三種方法。常用的基于統(tǒng)計(jì)的方法有Naive Bayes、KNN、類中心向量、回歸模型、支持向量機(jī)、最大熵模型等。常用的基于連結(jié)的方法是人工神經(jīng)網(wǎng)絡(luò)。常用的基于規(guī)則的方法有決策樹(shù)、關(guān)聯(lián)規(guī)則等。這些分類方法存在一個(gè)共同的問(wèn)題:都不考慮郵件文本中詞與詞之間的語(yǔ)義關(guān)系,然而現(xiàn)實(shí)的郵件文本中的用詞往往是有關(guān)聯(lián)的,比如:同義詞、同義詞集合間的上下位關(guān)系等,不考慮郵件文本中詞與詞之間的語(yǔ)義關(guān)系往往會(huì)出現(xiàn)向量空間的高維性,其結(jié)果是高維性會(huì)造成分類性能和分類精度的降低。 為解決上述問(wèn)題,本文提出了一種特征提取方法,即以WordNet本體庫(kù)為基礎(chǔ),以同義詞集合來(lái)代替詞條,同時(shí)考慮同義詞集合間的上下位關(guān)系,建立郵件文本的概念空間向量模型作為郵件文本的特征向量,使得在訓(xùn)練過(guò)程中能夠提取出能作為類別特征的高層次信息。本文還設(shè)計(jì)了一種確定閥值的方法(百分比閥值確定法),可以通過(guò)調(diào)整閥值來(lái)滿足不同的查全率和查準(zhǔn)率。最后本文將提出的方法付諸實(shí)現(xiàn),并通過(guò)試驗(yàn)證明了基于WordNet概念向量空間模型的電子郵件分類方法的有效性。 本文提出的基于WordNet概念向量空間模型的電子郵件分類方法對(duì)現(xiàn)有的電子郵件分類方法進(jìn)行了改進(jìn),并在分類性能和效率上獲得了提升。這些結(jié)果使能夠快速準(zhǔn)確的獲取有用的信息,從而大大提高了人們的工作效率。
[Abstract]:With the development of computer technology and information technology, especially the popularity of the Internet, email has become a popular means of communication because of its quick and economical characteristics. Because of this, e-mail often reflects the current hot social issues and public opinion focus. However, the more and more frequent use of email, spam, advertising, mass messaging and other flooding, users spend more time on the processing of mail, but also affect the collation and access to information. If email can be classified, people can get the contents of their concern accurately, comprehensively and quickly, and greatly improve their work efficiency, thus reducing the loss of human, financial, material and other aspects. Therefore, email classification has attracted the interest of many scholars. The existing email classification techniques can be classified into three methods: statistical based, linked-based and rule-based. The commonly used statistical methods include Naive Bayes,KNN, class center vector, regression model, support vector machine, maximum entropy model and so on. The commonly used method based on link is artificial neural network. The commonly used rule-based methods are decision tree, association rules and so on. There is a common problem with these classification methods: they do not consider the semantic relationship between words and words in email texts, but the words used in real mail texts are often related, such as synonyms, etc. The relationship between the upper and lower synonyms and so on, without considering the semantic relationship between words and words in the email text, often leads to the high dimension of vector space, and the result is that the classification performance and classification accuracy will be reduced because of the high dimensionality. In order to solve the above problems, a feature extraction method is proposed in this paper, which is based on WordNet ontology library, using synonym set instead of entries, and considering the relationship between the upper and lower synonyms. The concept space vector model of mail text is established as the feature vector of mail text, which makes it possible to extract high-level information which can be used as category feature in the process of training. This paper also designs a method of determining the threshold (percentage threshold), which can satisfy different recall and precision by adjusting the threshold. Finally, the proposed method is implemented, and the validity of the email classification method based on WordNet concept vector space model is proved by experiments. In this paper, the email classification method based on WordNet concept vector space model is improved, and the classification performance and efficiency are improved. These results make it possible to obtain useful information quickly and accurately, thus greatly improving people's working efficiency.
【學(xué)位授予單位】:華東師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2008
【分類號(hào)】:TP393.098

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 章成敏,章成志;一種基于知識(shí)庫(kù)的電子郵件自動(dòng)分類系統(tǒng)[J];淮海工學(xué)院學(xué)報(bào)(自然科學(xué)版);2004年02期

2 朱斌,熊應(yīng),朱海云;人工智能在電子郵件分類中的應(yīng)用研究[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2001年12期

3 徐海濤,楊森,柴喬林;基于統(tǒng)計(jì)分詞的中文郵件智能分類系統(tǒng)[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2003年S1期

4 林鴻飛,戰(zhàn)學(xué)剛,姚天順;基于概念的文本結(jié)構(gòu)分析方法[J];計(jì)算機(jī)研究與發(fā)展;2000年03期

5 邱科寧,郭清順,張小波;基于Agent的個(gè)性化分類郵件系統(tǒng)研究[J];計(jì)算機(jī)工程與應(yīng)用;2005年07期

6 王小偉;王黎明;;基于動(dòng)態(tài)人工免疫的郵件分類算法研究[J];計(jì)算機(jī)應(yīng)用;2006年10期

7 宗平,田震生;基于樸素貝葉斯分類器郵件分類系統(tǒng)的改進(jìn)[J];計(jì)算機(jī)與現(xiàn)代化;2004年12期

8 張學(xué)工;關(guān)于統(tǒng)計(jì)學(xué)習(xí)理論與支持向量機(jī)[J];自動(dòng)化學(xué)報(bào);2000年01期

9 葉浩,王明文,曾雪強(qiáng);基于潛在語(yǔ)義的多類文本分類模型研究[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年S1期

10 余剛,陳華月,朱征宇,高原;基于詞同現(xiàn)頻率的文本特征描述[J];計(jì)算機(jī)工程與設(shè)計(jì);2005年08期

,

本文編號(hào):2428089

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2428089.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶7d129***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com