流形學(xué)習(xí)及其在文本分類中的應(yīng)用
[Abstract]:With the increasing of computer capability and the increase of storage capacity, large-scale data acquisition is more convenient and universal, but also brings about new problems. In many fields, such as text mining, biometric authentication, image analysis and computer vision, text analysis and computational biology in information retrieval, high-dimensional data are obtained, which may lead to "dimensionality disaster". In recent years, manifold learning has become a hot research field in the field of machine learning. Manifold learning expects to find the hidden regularity and structure of data from high-dimensional data space and is widely used in high-dimensional data dimension reduction. It is a nonlinear data dimension reduction method. Text classification, as the technical foundation of information retrieval, search engine, text database, digital library and so on, has a wide application prospect. Because of the unstructured feature of text data, the feature vector reaches tens of thousands and even hundreds of thousands of dimensions. The feature of high dimension will greatly increase the redundant feature information, which leads to the decrease of classification accuracy. Data dimensionality reduction can reduce the dimension of text vectors and make feature vectors better represent text or category features. In this paper, we assume that there is a potential text manifold in text vector space, consider the text as a sampling point on the manifold, apply manifold learning to the text preprocessing process of text classification, and propose a Bagging text classification algorithm based on ISOMAP. This paper describes the relevant theories and the specific flow of the algorithm, improves the ISOMAP algorithm incrementally, proposes a Bagging text classification algorithm based on incremental manifold learning, and makes experimental comparison and analysis. Experimental results show that manifold learning can effectively improve the performance of text classification.
【學(xué)位授予單位】:合肥工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 顧益軍,樊孝忠,王建華,汪濤,黃維金;中文停用詞表的自動選取[J];北京理工大學(xué)學(xué)報;2005年04期
2 沈?qū)W華,周志華,吳建鑫,陳兆乾;Boosting和Bagging綜述[J];計算機(jī)工程與應(yīng)用;2000年12期
3 張翔;周明全;耿國華;侯凡;;面向中文文本分類的C4.5Bagging算法研究[J];計算機(jī)工程與應(yīng)用;2009年26期
4 王煜,王正歐;基于模糊決策樹的文本分類規(guī)則抽取[J];計算機(jī)應(yīng)用;2005年07期
5 張秋余;竭洋;李凱;;基于模糊支持向量機(jī)與決策樹的文本分類器[J];計算機(jī)應(yīng)用;2008年12期
6 鞏知樂;張德賢;胡明明;;一種改進(jìn)的支持向量機(jī)的文本分類算法[J];計算機(jī)仿真;2009年07期
7 程紅莉;周寧;肖爽;;文本驅(qū)動的商務(wù)智能研究[J];情報科學(xué);2007年10期
8 王曉慧;;線性判別分析與主成分分析及其相關(guān)研究評述[J];中山大學(xué)研究生學(xué)刊(自然科學(xué)、醫(yī)學(xué)版);2007年04期
相關(guān)博士學(xué)位論文 前4條
1 王靖;流形學(xué)習(xí)的理論與方法研究[D];浙江大學(xué);2006年
2 劉小明;數(shù)據(jù)降維及分類中的流形學(xué)習(xí)研究[D];浙江大學(xué);2007年
3 谷瑞軍;基于流形學(xué)習(xí)的高維空間分類器研究[D];江南大學(xué);2008年
4 趙凌瀟;基于流形的半監(jiān)督分類方法研究[D];浙江大學(xué);2009年
相關(guān)碩士學(xué)位論文 前4條
1 李木;基于Rocchio算法的增量式主題爬行[D];吉林大學(xué);2007年
2 侯曉宇;基于流形學(xué)習(xí)的特征提取方法研究[D];大連理工大學(xué);2009年
3 李曉紅;中文文本分類技術(shù)研究[D];蘭州理工大學(xué);2009年
4 陸捷榮;基于流形學(xué)習(xí)與D-S證據(jù)理論的語音情感識別研究[D];江蘇大學(xué);2010年
本文編號:2373390
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2373390.html