基于特征選擇的文本分類方法研究及其應(yīng)用
發(fā)布時(shí)間:2018-03-31 12:57
本文選題:文本分類 切入點(diǎn):特征選擇 出處:《江南大學(xué)》2017年碩士論文
【摘要】:隨著計(jì)算機(jī)技術(shù)的不斷發(fā)展,網(wǎng)絡(luò)信息數(shù)據(jù)呈爆發(fā)式增長,這些信息在豐富人們生活的同時(shí),也產(chǎn)生了很多無用甚至有害的信息,給信息的合理有效應(yīng)用帶了困難和挑戰(zhàn)。如何在眾多數(shù)據(jù)中準(zhǔn)確尋找到對自己有用的信息,已成為信息技術(shù)領(lǐng)域有待進(jìn)一步解決的問題。而文本分類技術(shù)為這一問題提供有效的解決方案,傳統(tǒng)基于專家知識的人工分類方法花費(fèi)大量人力和時(shí)間成本,已難以適應(yīng)現(xiàn)代社會數(shù)據(jù)的增長,隨著科學(xué)發(fā)展,出現(xiàn)了自動文本分類方法。特征選擇方法是文本分類中不可或缺的技術(shù),其對特征的選取能力將嚴(yán)重影響類別判斷的結(jié)果。本文主要針對傳統(tǒng)的卡方統(tǒng)計(jì)特征選擇方法未能充分考慮類內(nèi)詞頻和特征項(xiàng)分布情況,提出了一種關(guān)于類內(nèi)信息優(yōu)化卡方統(tǒng)計(jì)的特征選擇方法。在分類方法中,支持向量機(jī)作為文本自動分類方法中最典型的機(jī)器學(xué)習(xí)方法之一,具有簡單、高效,且分類準(zhǔn)確率高等優(yōu)點(diǎn),不斷受到眾多學(xué)者的廣泛關(guān)注。本文采用支持向量機(jī)進(jìn)行文本分類,為進(jìn)一步提高其分類精度,針對支持向量機(jī)中參數(shù)難以選擇問題,提出改進(jìn)人工蜂群算法優(yōu)化支持向量機(jī)模型對文本進(jìn)行分類,對基本人工蜂群算法的引領(lǐng)蜂和跟隨蜂搜索策略進(jìn)行改進(jìn),有效提高分類準(zhǔn)確率。為拓寬文本分類方法的應(yīng)用領(lǐng)域,構(gòu)建基于人類p53癌癥基因二級生物信息數(shù)據(jù)庫作為文本分類的語料庫,該數(shù)據(jù)庫主要包含了多種癌癥p53基因的外顯子和內(nèi)含子序列信息,為深入研究癌癥提供良好的平臺。同時(shí)提出了一種基于擬比對細(xì)胞神經(jīng)網(wǎng)絡(luò)的序列比對方法對數(shù)據(jù)庫中的癌癥p53基因進(jìn)行序列比對分析,有效提高了序列比對的相似度,為進(jìn)一步研究癌癥文本分類提供了理論基礎(chǔ)。
[Abstract]:With the development of computer technology, the data of network information is increasing explosively, which not only enriches people's life, but also produces a lot of useless and even harmful information, which brings difficulties and challenges to the rational and effective application of information.How to accurately find useful information in many data has become a problem to be solved in the field of information technology.Text classification technology provides an effective solution to this problem. The traditional manual classification method based on expert knowledge costs a lot of manpower and time, so it is difficult to adapt to the growth of modern social data, with the development of science.An automatic text categorization method appears.Feature selection is an indispensable technique in text categorization, and its ability to select features will seriously affect the result of category judgment.Aiming at the fact that the traditional chi-square statistical feature selection method fails to fully consider the word frequency and the distribution of feature items within the class, this paper proposes a feature selection method for optimizing chi-square statistics on intra-class information.As one of the most typical machine learning methods in automatic text classification, support vector machine (SVM) has the advantages of simplicity, high efficiency and high classification accuracy, so it has been paid more and more attention by many scholars.In this paper, support vector machine (SVM) is used for text classification. In order to improve the classification accuracy, an improved artificial bee colony algorithm is proposed to optimize the support vector machine model for text classification, aiming at the difficulty of selecting parameters in support vector machine (SVM).In order to improve the classification accuracy of the basic artificial bee colony algorithm, the search strategies of leading bee and following bee are improved.In order to widen the application field of text classification methods, the secondary biological information database of human p53 cancer gene is constructed as the corpus of text classification. The database mainly contains exon and intron sequence information of many kinds of cancer p53 gene.It provides a good platform for further research on cancer.At the same time, a sequence alignment method based on pseudo alignment cell neural network is proposed to analyze the cancer p53 gene sequence alignment in the database, which effectively improves the similarity of sequence alignment.It provides a theoretical basis for the further study of cancer text classification.
【學(xué)位授予單位】:江南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 秦全德;程適;李麗;史玉回;;人工蜂群算法研究綜述[J];智能系統(tǒng)學(xué)報(bào);2014年02期
2 林煜明;王曉玲;朱濤;周傲英;;用戶評論的質(zhì)量檢測與控制研究綜述[J];軟件學(xué)報(bào);2014年03期
3 張紫瓊;葉強(qiáng);李一軍;;互聯(lián)網(wǎng)商品評論情感分析研究綜述[J];管理科學(xué)學(xué)報(bào);2010年06期
4 周炎濤;唐劍波;王家琴;;基于信息熵的改進(jìn)TFIDF特征選擇算法[J];計(jì)算機(jī)工程與應(yīng)用;2007年35期
5 錢曉東,王正歐;基于改進(jìn)KNN的文本分類方法[J];情報(bào)科學(xué);2005年04期
,本文編號:1690835
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1690835.html
最近更新
教材專著