流形學(xué)習(xí)及其在文本分類中的應(yīng)用

發(fā)布時間：2018-12-11 23:06

【摘要】：隨著計算機(jī)能力的日益增強(qiáng)和存儲容量的增長，大規(guī)模的數(shù)據(jù)獲取更為方便和普遍，同時也產(chǎn)生了新的問題。在很多領(lǐng)域中，如文本挖掘、生物特征認(rèn)證、圖像分析和計算機(jī)視覺、信息檢索中的文本分析和計算生物學(xué)等，獲得的是高維數(shù)據(jù)，這樣極有可能導(dǎo)致“維數(shù)災(zāi)難”的出現(xiàn)。近年來，流形學(xué)習(xí)成為了機(jī)器學(xué)習(xí)領(lǐng)域的一個熱點(diǎn)研究方向，流形學(xué)習(xí)期望從高維數(shù)據(jù)空間中尋找數(shù)據(jù)隱含的規(guī)律性與結(jié)構(gòu)，被廣泛用于高維數(shù)據(jù)降維，是一種非線性數(shù)據(jù)降維方法。文本分類作為信息檢索、搜索引擎、文本數(shù)據(jù)庫、數(shù)字化圖書館等領(lǐng)域的技術(shù)基礎(chǔ)，有著廣泛的應(yīng)用前景。由于文本數(shù)據(jù)的非結(jié)構(gòu)化特點(diǎn)，進(jìn)行文本表示時，特征向量高達(dá)幾萬維甚至于幾十萬維。高維的特點(diǎn)會大大增加冗余特征信息，從而導(dǎo)致分類的準(zhǔn)確度下降。數(shù)據(jù)降維能夠減少文本向量的維數(shù)，而使特征向量能更好地代表文本或者類別特征。本文假設(shè)文本向量空間存在一個潛在的文本流形，將文本看做是這個流形上抽樣的點(diǎn)，將流形學(xué)習(xí)應(yīng)用在文本分類的文本預(yù)處理過程中，提出了一種基于ISOMAP的Bagging文本分類算法，比較完整地描述了相關(guān)理論基礎(chǔ)及算法的具體流程，并對ISOMAP算法進(jìn)行了增量式改進(jìn)，，提出了一種基于增量流形學(xué)習(xí)的Bagging文本分類算法，并進(jìn)行了實(shí)驗(yàn)比較和分析，實(shí)驗(yàn)證明了流形學(xué)習(xí)在文本分類中的應(yīng)用，能有效提高文本分類的性能。
[Abstract]:With the increasing of computer capability and the increase of storage capacity, large-scale data acquisition is more convenient and universal, but also brings about new problems. In many fields, such as text mining, biometric authentication, image analysis and computer vision, text analysis and computational biology in information retrieval, high-dimensional data are obtained, which may lead to "dimensionality disaster". In recent years, manifold learning has become a hot research field in the field of machine learning. Manifold learning expects to find the hidden regularity and structure of data from high-dimensional data space and is widely used in high-dimensional data dimension reduction. It is a nonlinear data dimension reduction method. Text classification, as the technical foundation of information retrieval, search engine, text database, digital library and so on, has a wide application prospect. Because of the unstructured feature of text data, the feature vector reaches tens of thousands and even hundreds of thousands of dimensions. The feature of high dimension will greatly increase the redundant feature information, which leads to the decrease of classification accuracy. Data dimensionality reduction can reduce the dimension of text vectors and make feature vectors better represent text or category features. In this paper, we assume that there is a potential text manifold in text vector space, consider the text as a sampling point on the manifold, apply manifold learning to the text preprocessing process of text classification, and propose a Bagging text classification algorithm based on ISOMAP. This paper describes the relevant theories and the specific flow of the algorithm, improves the ISOMAP algorithm incrementally, proposes a Bagging text classification algorithm based on incremental manifold learning, and makes experimental comparison and analysis. Experimental results show that manifold learning can effectively improve the performance of text classification.
【學(xué)位授予單位】：合肥工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前8條

1 顧益軍,樊孝忠,王建華,汪濤,黃維金;中文停用詞表的自動選取[J];北京理工大學(xué)學(xué)報;2005年04期

2 沈?qū)W華,周志華,吳建鑫,陳兆乾;Boosting和Bagging綜述[J];計算機(jī)工程與應(yīng)用;2000年12期

3 張翔;周明全;耿國華;侯凡;;面向中文文本分類的C4.5Bagging算法研究[J];計算機(jī)工程與應(yīng)用;2009年26期

4 王煜,王正歐;基于模糊決策樹的文本分類規(guī)則抽取[J];計算機(jī)應(yīng)用;2005年07期

5 張秋余;竭洋;李凱;;基于模糊支持向量機(jī)與決策樹的文本分類器[J];計算機(jī)應(yīng)用;2008年12期

6 鞏知樂;張德賢;胡明明;;一種改進(jìn)的支持向量機(jī)的文本分類算法[J];計算機(jī)仿真;2009年07期

7 程紅莉;周寧;肖爽;;文本驅(qū)動的商務(wù)智能研究[J];情報科學(xué);2007年10期

8 王曉慧;;線性判別分析與主成分分析及其相關(guān)研究評述[J];中山大學(xué)研究生學(xué)刊(自然科學(xué)、醫(yī)學(xué)版);2007年04期

相關(guān)博士學(xué)位論文前4條

1 王靖;流形學(xué)習(xí)的理論與方法研究[D];浙江大學(xué);2006年

2 劉小明;數(shù)據(jù)降維及分類中的流形學(xué)習(xí)研究[D];浙江大學(xué);2007年

3 谷瑞軍;基于流形學(xué)習(xí)的高維空間分類器研究[D];江南大學(xué);2008年

4 趙凌瀟;基于流形的半監(jiān)督分類方法研究[D];浙江大學(xué);2009年

相關(guān)碩士學(xué)位論文前4條

1 李木;基于Rocchio算法的增量式主題爬行[D];吉林大學(xué);2007年

2 侯曉宇;基于流形學(xué)習(xí)的特征提取方法研究[D];大連理工大學(xué);2009年

3 李曉紅;中文文本分類技術(shù)研究[D];蘭州理工大學(xué);2009年

4 陸捷榮;基于流形學(xué)習(xí)與D-S證據(jù)理論的語音情感識別研究[D];江蘇大學(xué);2010年

本文編號：2373390

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2373390.html

上一篇：硬件設(shè)計搜索引擎的信息描述體設(shè)計
下一篇：基于主題搜索引擎服務(wù)提高軍訓(xùn)網(wǎng)信息資源應(yīng)用水平

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

流形學(xué)習(xí)及其在文本分類中的應(yīng)用