天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

面向小樣本的文本分類模型及算法研究

發(fā)布時(shí)間:2018-06-20 21:29

  本文選題:文本分類 + 機(jī)器學(xué)習(xí); 參考:《電子科技大學(xué)》2017年博士論文


【摘要】:文本數(shù)據(jù)具有高維稀疏的特征,并且數(shù)據(jù)量也在爆發(fā)式增長(zhǎng),這給傳統(tǒng)的機(jī)器學(xué)習(xí)算法帶來了許多困難,具體表現(xiàn)在以下兩方面:其一,對(duì)于分類精度較高的分類算法,如支持向量機(jī)和人工神經(jīng)網(wǎng)絡(luò)等,大多都因?yàn)橛?xùn)練效率、計(jì)算資源消耗等瓶頸而無法成功應(yīng)用于海量數(shù)據(jù)挖掘和在線分類問題;其二,對(duì)于具有線性時(shí)間復(fù)雜度的分類算法,如質(zhì)心分類器,樸素貝葉斯和邏輯回歸等,其分類精度往往較低。因此,本文針對(duì)以上問題展開了一系列研究,研究?jī)?nèi)容主要包括小樣本數(shù)據(jù)集的提取方法和小樣本數(shù)據(jù)集的分類方法。本文中所指的“小樣本”是維度小、數(shù)量小的樣本。首先,本文研究的小樣本提取方法主要包括特征選擇方法和實(shí)例選擇方法,以上方法可以精簡(jiǎn)海量數(shù)據(jù)集,從而有效解決以上第一類問題;其次,本文研究了面向小樣本的線性分類模型,試圖從小樣本數(shù)據(jù)集上學(xué)習(xí)高精度的分類器,從而有效解決以上第二類問題。本文的主要研究?jī)?nèi)容和創(chuàng)新點(diǎn)如下:提出了一種新的統(tǒng)計(jì)指標(biāo)(LW-index)的方法來評(píng)價(jià)特征子集,進(jìn)而評(píng)估降維算法。本文所提出的方法是一種“經(jīng)典統(tǒng)計(jì)”的方法,基于特征子集計(jì)算經(jīng)驗(yàn)估計(jì)來評(píng)價(jià)特征子集的質(zhì)量。傳統(tǒng)的特征子集評(píng)估是指將給定的特征子集分解為訓(xùn)練集和測(cè)試集,訓(xùn)練集用于估計(jì)分類模型的參數(shù),而測(cè)試集用于估計(jì)模型預(yù)測(cè)的性能。然后,平均多個(gè)預(yù)測(cè)的結(jié)果,即交叉驗(yàn)證(Cross-Validation,CV)。然而,交叉驗(yàn)證評(píng)估往往是的是非常耗時(shí)的,需要很大的計(jì)算開銷。實(shí)驗(yàn)結(jié)果表明本文提出的方法在降維算法評(píng)價(jià)結(jié)果上基本與五折疊交叉驗(yàn)證方法一致,但計(jì)算時(shí)間開銷分別是采用SVM(Support Vector Machine)和CBC(Centroid-Based Classifier)分類器的1/10和1/2倍。提出了一種基于序列前向搜索(Sequential Forward Search,SFS)策略的特征選擇算法SFS-LW。文本分類中的封裝式特征選擇算法(Wapper)篩選出的特征對(duì)于分類有較高的價(jià)值,但是其評(píng)價(jià)過程伴隨著極大的時(shí)間復(fù)雜度。為此,本文將封裝式特征選擇算法中常用的前向序列搜索策略(SFS)與LW指標(biāo)相結(jié)合,提出了一種新的過濾式算法SFS-LW。實(shí)驗(yàn)結(jié)果表明SFS-LW具有與Wapper方法近似的分類精度,但時(shí)間復(fù)雜度則有數(shù)倍的改善,其時(shí)間消耗接近于現(xiàn)有的過濾式方法。提出了一種線性的自適應(yīng)支持向量選擇算法(Shell Extraction,SE)。針對(duì)傳統(tǒng)分類算法無法應(yīng)用于海量數(shù)據(jù)集的問題,本文基于向量空間中樣本分布密度不均衡的特點(diǎn),研究了向量空間中支持向量的識(shí)別方法,從而實(shí)現(xiàn)了大規(guī)模數(shù)據(jù)集縮減和噪聲過濾。傳統(tǒng)的實(shí)例選擇算法大多基于最近鄰或聚類的方法,由于此類方法時(shí)間復(fù)雜度高,同樣面臨無法應(yīng)用于海量數(shù)據(jù)集的問題。實(shí)驗(yàn)結(jié)果表明本文提出的SE算法不僅在精度上超過了現(xiàn)有算法,并且其執(zhí)行效率遠(yuǎn)高于現(xiàn)有的實(shí)例選擇算法。提出了一種新的分類模型,即引力模型(Gravitation Model,GM)。文本分類中基于質(zhì)心的分類算法憑借其簡(jiǎn)單高效,已成為應(yīng)用最廣泛的文本分類算法之一。然而,質(zhì)心分類算法的精確度過于依賴訓(xùn)練樣本的分布,當(dāng)樣本分布偏斜時(shí),質(zhì)心分類模型不能很好地?cái)M合訓(xùn)練樣本數(shù)據(jù),因而分類效果不理想。本文提出的GM模型可以有效解決質(zhì)心分類模型欠擬合問題,在模型訓(xùn)練階段,GM為每一個(gè)類別定義一個(gè)表征該類樣本分布的質(zhì)量因子,該因子可從訓(xùn)練樣本中學(xué)習(xí)得到;在模型測(cè)試階段,GM將未知樣本劃分到對(duì)其最大引力的特定類別中。提出了一種基于算術(shù)平均質(zhì)心(Arithmetical Average Centroid,AAC)與隨機(jī)質(zhì)量因子學(xué)習(xí)算法(Stochastic Learning Mass,SLA)相結(jié)合的GM模型學(xué)習(xí)算法AAC-SLA。實(shí)驗(yàn)結(jié)果表明AAC-SLA算法在精度上持續(xù)優(yōu)于原質(zhì)心分類算法,而且達(dá)到了與目前最好的質(zhì)心分類器類似的性能,同時(shí)具有比它更穩(wěn)定的優(yōu)勢(shì)。提出了基于最小球算法(Minimum Enclosing Ball,MEB)與隨機(jī)質(zhì)量因子學(xué)習(xí)算法(SLA)相結(jié)合的GM模型學(xué)習(xí)算法MEB-SLA。MEB算法可以有效避免類別中樣本隨機(jī)分布給算術(shù)平均質(zhì)心位置帶來的影響,實(shí)驗(yàn)結(jié)果表明MEB-SLA算法要優(yōu)于AAC-SLA算法,并且在小樣本數(shù)據(jù)集上它們都超過了向量機(jī)。最后,本文利用提出的SFS-LW算法和SE算法生成了特征維數(shù)和樣本數(shù)量同時(shí)為原數(shù)據(jù)集1/10倍的小樣本數(shù)據(jù)集,并采用小樣本數(shù)據(jù)集訓(xùn)練AAC-SLA、MEBSLA和SVM算法,實(shí)驗(yàn)表明AAC-SLA算法和MEB-SLA算法的學(xué)習(xí)/分類精度在大部分?jǐn)?shù)據(jù)集上只有輕微下降,并持續(xù)超過了SVM算法。本文的研究結(jié)論是:(1)在中/小規(guī)模的數(shù)據(jù)集學(xué)習(xí)任務(wù)中可直接采用MEB-SLA算法;(2)在大規(guī)模的數(shù)據(jù)集學(xué)習(xí)任務(wù)中可采用SE與AAC-SLA相結(jié)合的算法。
[Abstract]:Text data is characterized by high dimension and sparsity, and the amount of data is growing in an explosive manner. It brings many difficulties to the traditional machine learning algorithm, which is shown in the following two aspects: first, for the classification algorithms with high classification precision, such as support vector machines and artificial neural networks, most of them are trained because of the efficiency of training and the consumption of resources. And other bottlenecks can not be successfully applied to mass data mining and online classification problems. Secondly, the classification precision of the classification algorithms with linear time complexity, such as the centroid classifier, simple Bias and logical regression, is often low. Therefore, a series of research on the above problems is carried out in this paper, and the main contents include small samples. The method of extracting data sets and the classification of small sample data sets. The "small sample" in this paper is a small and small sample. First of all, the method of small sample extraction mainly includes the feature selection method and the example selection method. The above method can simplify the sea volume data set so as to effectively solve the above first class questions. Secondly, this paper studies the linear classification model of small sample, and tries to learn the high precision classifier from the small sample data set, so as to effectively solve the above second kinds of problems. The main research contents and innovation points of this paper are as follows: a new statistical index (LW-index) method is proposed to evaluate the feature subset and then to evaluate the dimensionality reduction. The method proposed in this paper is a "classic statistics" method, which evaluates the quality of the feature subset based on the empirical estimation of the feature subset. The traditional feature subset evaluation means that the set of feature subsets is decomposed into a training set and a test set. The training set is used to estimate the parameters of the classification model, and the test set is used to estimate the model preview. The performance of the measurement. Then, the results of the average multiple prediction, namely, Cross-Validation (CV). However, the cross validation evaluation is often very time-consuming and requires a lot of computing overhead. The experimental results show that the proposed method is consistent with the five fold cross validation method in the evaluation result of the dimensionality reduction algorithm, but the calculation time is open. The pin is the 1/10 and 1/2 times of the SVM (Support Vector Machine) and CBC (Centroid-Based Classifier) classifier. A feature selection algorithm based on the sequence forward search (Sequential Forward Search) strategy is proposed. High value, but its evaluation process is accompanied by a great time complexity. Therefore, this paper combines the forward sequence search strategy (SFS) commonly used in the packaged feature selection algorithm and LW index, and proposes a new filtering algorithm SFS-LW. experimental results showing that SFS-LW has an approximate classification accuracy with the Wapper method, but the time is complex. The time consumption is much improved, and its time consumption is close to the existing filtering method. A linear adaptive support vector selection algorithm (Shell Extraction, SE) is proposed. The problem that the traditional classification algorithm can not be applied to mass data sets is not applied. This paper studies the vector space based on the characteristics of the unbalance of the density of sample distribution in vector space. The recognition method of support vector in the middle is realized, and the large scale data set reduction and noise filtering are realized. The traditional example selection algorithm is mostly based on the nearest neighbor or clustering method. Because of the high time complexity of this method, it also faces the problem that can not be applied to the mass data set. The results show that the SE algorithm proposed in this paper is not only fine. More than the existing algorithm, and its efficiency is far higher than the existing example selection algorithm. A new classification model, Gravitation Model (GM), is proposed. The classification algorithm based on the centroid in text classification has become one of the most widely used text classification algorithms with its simplicity and efficiency. However, the centroid classification algorithm is used. The accuracy depends too much on the distribution of training samples. When the sample distribution is skewed, the centroid classification model can not fit the training sample data well, so the classification effect is not ideal. The GM model proposed in this paper can effectively solve the underestimated problem of the centroid classification model. In the model training stage, GM defines one category for each category. The quality factor of the sample distribution can be learned from the training sample. In the model test stage, GM divides the unknown samples into the particular category of its maximum gravity. A Arithmetical Average Centroid (AAC) and random mass factor learning algorithm (Stochastic Learning Mass, SLA) is proposed. The GM model learning algorithm AAC-SLA. experimental results show that the AAC-SLA algorithm is superior to the original centroid classification algorithm in precision, and achieves the performance similar to the best centroid classifier at present, and has a more stable advantage than it. It is based on the least ball algorithm (Minimum Enclosing Ball, MEB) and random quality factor learning. The algorithm (SLA) combined with GM model learning algorithm MEB-SLA.MEB algorithm can effectively avoid the influence of the random distribution of samples to the arithmetic mean centroid position in the class. The experimental results show that the MEB-SLA algorithm is superior to the AAC-SLA algorithm, and they all exceed the vector machines on the small sample data sets. Finally, this paper uses the proposed SFS-LW calculation. The algorithm and SE algorithm generate small sample data sets with the feature dimension and the number of samples at the same time as the original data set 1/10 times, and use small sample data sets to train AAC-SLA, MEBSLA and SVM algorithms. The experiment shows that the learning / classification accuracy of AAC-SLA algorithm and MEB-SLA algorithm is only slightly decreased on most data sets, and it continues to exceed the SVM algorithm. The conclusion of the study are as follows: (1) in the medium / small scale data sets can be directly used in the MEB-SLA algorithm learning tasks; (2) in the large-scale data set by SE combined with AAC-SLA algorithm learning tasks.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前6條

1 周志華;;《機(jī)器學(xué)習(xí)》[J];中國民商;2016年03期

2 羅瑜;易文德;何大可;林宇;;大規(guī)模訓(xùn)練集的快速縮減[J];西南交通大學(xué)學(xué)報(bào);2007年04期

3 肖小玲;李臘元;張翔;;提高支持向量機(jī)訓(xùn)練速度的CM-SVM方法[J];計(jì)算機(jī)工程與設(shè)計(jì);2006年22期

4 曹淑娟;劉小茂;張鈞;劉振丙;;基于類中心思想的去邊緣模糊支持向量機(jī)[J];計(jì)算機(jī)工程與應(yīng)用;2006年22期

5 李青,焦李成,周偉達(dá);基于向量投影的支撐向量預(yù)選取[J];計(jì)算機(jī)學(xué)報(bào);2005年02期

6 李紅蓮,王春花,袁保宗;一種改進(jìn)的支持向量機(jī)NN-SVM[J];計(jì)算機(jī)學(xué)報(bào);2003年08期

,

本文編號(hào):2045729

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2045729.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4c7ca***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com