基于特征選擇的文本分類方法研究及其應(yīng)用

發(fā)布時間：2018-03-31 12:57

本文選題：文本分類　切入點：特征選擇　出處：《江南大學(xué)》2017年碩士論文

【摘要】：隨著計算機技術(shù)的不斷發(fā)展,網(wǎng)絡(luò)信息數(shù)據(jù)呈爆發(fā)式增長,這些信息在豐富人們生活的同時,也產(chǎn)生了很多無用甚至有害的信息,給信息的合理有效應(yīng)用帶了困難和挑戰(zhàn)。如何在眾多數(shù)據(jù)中準(zhǔn)確尋找到對自己有用的信息,已成為信息技術(shù)領(lǐng)域有待進一步解決的問題。而文本分類技術(shù)為這一問題提供有效的解決方案,傳統(tǒng)基于專家知識的人工分類方法花費大量人力和時間成本,已難以適應(yīng)現(xiàn)代社會數(shù)據(jù)的增長,隨著科學(xué)發(fā)展,出現(xiàn)了自動文本分類方法。特征選擇方法是文本分類中不可或缺的技術(shù),其對特征的選取能力將嚴(yán)重影響類別判斷的結(jié)果。本文主要針對傳統(tǒng)的卡方統(tǒng)計特征選擇方法未能充分考慮類內(nèi)詞頻和特征項分布情況,提出了一種關(guān)于類內(nèi)信息優(yōu)化卡方統(tǒng)計的特征選擇方法。在分類方法中,支持向量機作為文本自動分類方法中最典型的機器學(xué)習(xí)方法之一,具有簡單、高效,且分類準(zhǔn)確率高等優(yōu)點,不斷受到眾多學(xué)者的廣泛關(guān)注。本文采用支持向量機進行文本分類,為進一步提高其分類精度,針對支持向量機中參數(shù)難以選擇問題,提出改進人工蜂群算法優(yōu)化支持向量機模型對文本進行分類,對基本人工蜂群算法的引領(lǐng)蜂和跟隨蜂搜索策略進行改進,有效提高分類準(zhǔn)確率。為拓寬文本分類方法的應(yīng)用領(lǐng)域,構(gòu)建基于人類p53癌癥基因二級生物信息數(shù)據(jù)庫作為文本分類的語料庫,該數(shù)據(jù)庫主要包含了多種癌癥p53基因的外顯子和內(nèi)含子序列信息,為深入研究癌癥提供良好的平臺。同時提出了一種基于擬比對細胞神經(jīng)網(wǎng)絡(luò)的序列比對方法對數(shù)據(jù)庫中的癌癥p53基因進行序列比對分析,有效提高了序列比對的相似度,為進一步研究癌癥文本分類提供了理論基礎(chǔ)。
[Abstract]:With the development of computer technology, the data of network information is increasing explosively, which not only enriches people's life, but also produces a lot of useless and even harmful information, which brings difficulties and challenges to the rational and effective application of information.How to accurately find useful information in many data has become a problem to be solved in the field of information technology.Text classification technology provides an effective solution to this problem. The traditional manual classification method based on expert knowledge costs a lot of manpower and time, so it is difficult to adapt to the growth of modern social data, with the development of science.An automatic text categorization method appears.Feature selection is an indispensable technique in text categorization, and its ability to select features will seriously affect the result of category judgment.Aiming at the fact that the traditional chi-square statistical feature selection method fails to fully consider the word frequency and the distribution of feature items within the class, this paper proposes a feature selection method for optimizing chi-square statistics on intra-class information.As one of the most typical machine learning methods in automatic text classification, support vector machine (SVM) has the advantages of simplicity, high efficiency and high classification accuracy, so it has been paid more and more attention by many scholars.In this paper, support vector machine (SVM) is used for text classification. In order to improve the classification accuracy, an improved artificial bee colony algorithm is proposed to optimize the support vector machine model for text classification, aiming at the difficulty of selecting parameters in support vector machine (SVM).In order to improve the classification accuracy of the basic artificial bee colony algorithm, the search strategies of leading bee and following bee are improved.In order to widen the application field of text classification methods, the secondary biological information database of human p53 cancer gene is constructed as the corpus of text classification. The database mainly contains exon and intron sequence information of many kinds of cancer p53 gene.It provides a good platform for further research on cancer.At the same time, a sequence alignment method based on pseudo alignment cell neural network is proposed to analyze the cancer p53 gene sequence alignment in the database, which effectively improves the similarity of sequence alignment.It provides a theoretical basis for the further study of cancer text classification.
【學(xué)位授予單位】：江南大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.1

【參考文獻】

相關(guān)期刊論文前5條

1 秦全德;程適;李麗;史玉回;;人工蜂群算法研究綜述[J];智能系統(tǒng)學(xué)報;2014年02期

2 林煜明;王曉玲;朱濤;周傲英;;用戶評論的質(zhì)量檢測與控制研究綜述[J];軟件學(xué)報;2014年03期

3 張紫瓊;葉強;李一軍;;互聯(lián)網(wǎng)商品評論情感分析研究綜述[J];管理科學(xué)學(xué)報;2010年06期

4 周炎濤;唐劍波;王家琴;;基于信息熵的改進TFIDF特征選擇算法[J];計算機工程與應(yīng)用;2007年35期

5 錢曉東,王正歐;基于改進KNN的文本分類方法[J];情報科學(xué);2005年04期

，

本文編號：1690835

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1690835.html

上一篇：基于VR與3RPS機構(gòu)的滑雪模擬器的設(shè)計與實現(xiàn)
下一篇：基于激光三角測距的鋸材表面缺陷檢測方法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于特征選擇的文本分類方法研究及其應(yīng)用