基于SVM的不平衡數(shù)據(jù)分類算法研究及其應(yīng)用
本文選題:SVM + 不平衡數(shù)據(jù)分類 ; 參考:《華僑大學(xué)》2017年碩士論文
【摘要】:隨著計(jì)算機(jī)和信息技術(shù)的發(fā)展,在生產(chǎn)和生活中,每天都會(huì)產(chǎn)生大量的數(shù)據(jù)。如何有效地尋找和發(fā)掘這些數(shù)據(jù)中的知識(shí)和規(guī)律,對(duì)其進(jìn)行分類和預(yù)測(cè)已成為人工智能和機(jī)器學(xué)習(xí)等領(lǐng)域的重要研究內(nèi)容之一。SVM是一種基于統(tǒng)計(jì)學(xué)習(xí)理論和結(jié)構(gòu)風(fēng)險(xiǎn)最小化原則的分類算法,它的決策函數(shù)只由少數(shù)的支持向量決定,增加或刪除部分非支持向量樣本并不影響模型的性能。相比于傳統(tǒng)的分類算法,SVM具有較強(qiáng)的泛化能力,不易陷入局部極小值且適合分類高維小樣本,能有效地解決平衡數(shù)據(jù)集的分類問題。但是,當(dāng)兩類數(shù)據(jù)分布不均衡時(shí),SVM顯現(xiàn)出了以下不足:一是由于SVM是基于軟間隔最大化的方法,從而使得在邊界區(qū)域分類超平面會(huì)向少數(shù)類傾斜。二是支持向量的不平衡比率也將導(dǎo)致測(cè)試樣本的周圍充滿更多的負(fù)支持向量。本文針對(duì)SVM對(duì)分布不平衡的數(shù)據(jù)集進(jìn)行分類時(shí)的難點(diǎn)和不足,從數(shù)據(jù)層面和算法層面展開研究,并將不平衡數(shù)據(jù)分類算法應(yīng)用到微博情感分類問題中。主要工作包括以下3方面:1)在數(shù)據(jù)層面提出了一種基于類邊界樣本自適應(yīng)合成的重采樣方法BADASYN算法。該算法首先找出類邊界區(qū)域的少類樣本,然后根據(jù)它們的分布自適應(yīng)地合成部分少類樣本,并將新合成的樣本添加到訓(xùn)練集。經(jīng)BADASYN采樣的數(shù)據(jù)集,訓(xùn)練得到的SVM模型的支持向量主要由新合成的樣本構(gòu)成,并最終使分離超平面靠近多類樣本。2)在算法層面提出了一種基于負(fù)相關(guān)學(xué)習(xí)和Ada Boost SVM算法的選擇性集成學(xué)習(xí)方法NCAB-SVM。將負(fù)相關(guān)學(xué)習(xí)理論融合到Ada Boost SVM的訓(xùn)練過程中,目的是訓(xùn)練一批多樣性很好的強(qiáng)SVM分類器組成一個(gè)更強(qiáng)的集成分類系統(tǒng),即強(qiáng)強(qiáng)聯(lián)手。該算法利用負(fù)相關(guān)學(xué)習(xí)理論計(jì)算基分類器間的相關(guān)性,并根據(jù)相關(guān)性的值自適應(yīng)調(diào)整各基分類器的權(quán)重,進(jìn)而得到加權(quán)后的決策分類器。3)針對(duì)微博情感分類過程中存在樣本分布不平衡和特征分布不平衡的問題,結(jié)合數(shù)據(jù)層面和算法層面的方法,使用基于SVM的不平衡數(shù)據(jù)分類算法對(duì)微博情感極性進(jìn)行分類。首先,使用BADASYN算法自適應(yīng)合成部分少類樣本,調(diào)整訓(xùn)練樣本的不平衡度;然后,使用NCAB-SVM算法,訓(xùn)練得到一系列SVM基分類器,并選擇性集成得到?jīng)Q策系統(tǒng);最后,使用爬取的不同領(lǐng)域的新浪微博數(shù)據(jù)集和公開的評(píng)測(cè)數(shù)據(jù)集測(cè)試該方法的性能。
[Abstract]:With the development of computer and information technology, a lot of data are produced every day in production and life. How to effectively find and discover the knowledge and rules in these data, Classification and prediction has become one of the important research contents in artificial intelligence and machine learning. SVM is a classification algorithm based on statistical learning theory and structural risk minimization principle. Its decision function is determined by only a few support vectors. Adding or deleting some non-support vector samples does not affect the performance of the model. Compared with the traditional classification algorithm, SVM has a strong generalization ability. It is difficult to fall into local minima and is suitable for classifying high-dimensional small samples. It can effectively solve the classification problem of balanced datasets. However, when the two classes of data are distributed unevenly, SVM shows the following shortcomings: first, because SVM is based on the method of maximization of soft interval, the hyperplane in the boundary region is inclined to a few classes. Second, the unbalance ratio of support vector will lead to more negative support vectors around the test samples. Aiming at the difficulties and shortcomings of SVM in classifying unevenly distributed datasets, this paper studies the classification of unbalanced data from the data and algorithm levels, and applies the unbalanced data classification algorithm to the Weibo emotional classification problem. The main work includes the following three aspects: (1) A new method of resampling BADASYN based on adaptive composition of class boundary samples is proposed in the data level. The algorithm firstly finds out the few class samples in the boundary region of the class, then adaptively synthesizes some small class samples according to their distribution, and adds the newly synthesized samples to the training set. Based on the data set sampled by BADASYN, the support vector of SVM model is mainly composed of newly synthesized samples. Finally, the separation hyperplane is close to the multi-class sample .2) at the algorithm level, a selective ensemble learning method, NCAB-SVM, is proposed based on negative correlation learning and Ada boost SVM algorithm. The negative correlation learning theory is integrated into the training process of Ada boost SVM. The purpose is to train a group of strong SVM classifiers with good diversity to form a stronger ensemble classification system, that is, strong and strong join forces. The algorithm uses the negative correlation learning theory to calculate the correlation among the base classifiers, and adaptively adjusts the weights of each base classifier according to the value of the correlation. Then the weighted decision classifier. 3) aiming at the unbalance of sample distribution and feature distribution in the process of Weibo emotional classification, combining the methods of data level and algorithm level. The Weibo affective polarity is classified by the unbalanced data classification algorithm based on SVM. Firstly, a series of SVM based classifiers are trained by using BADASYN algorithm to self-adaptively synthesize a few classes of samples to adjust the unbalance of training samples. Finally, a series of SVM classifiers are trained by NCAB-SVM algorithm, and the decision system is obtained by selective integration. The performance of the method is tested using crawled Sina Weibo datasets in different domains and publicly evaluated datasets.
【學(xué)位授予單位】:華僑大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP18
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 馬秉楠;黃永峰;鄧北星;;基于表情符的社交網(wǎng)絡(luò)情緒詞典構(gòu)造[J];計(jì)算機(jī)工程與設(shè)計(jì);2016年05期
2 夏莘媛;戴靜;潘用科;韓揚(yáng);;基于貝葉斯證據(jù)框架下SVM的油層識(shí)別模型研究[J];重慶郵電大學(xué)學(xué)報(bào)(自然科學(xué)版);2016年02期
3 倪志偉;張琛;倪麗萍;;基于螢火蟲群優(yōu)化算法的選擇性集成霧霾天氣預(yù)測(cè)方法[J];模式識(shí)別與人工智能;2016年02期
4 姚旭;王曉丹;張玉璽;雷蕾;;基于AdaBoost和匹配追蹤的選擇性集成算法[J];控制與決策;2014年02期
5 馬雯雯;鄧一貴;;新的短文本特征權(quán)重計(jì)算方法[J];計(jì)算機(jī)應(yīng)用;2013年08期
6 周勝臣;瞿文婷;石英子;施詢之;孫韻辰;;中文微博情感分析研究綜述[J];計(jì)算機(jī)應(yīng)用與軟件;2013年03期
7 曹瑩;苗啟廣;劉家辰;高琳;;AdaBoost算法研究進(jìn)展與展望[J];自動(dòng)化學(xué)報(bào);2013年06期
8 王中卿;李壽山;朱巧明;李培峰;周國棟;;基于不平衡數(shù)據(jù)的中文情感分類[J];中文信息學(xué)報(bào);2012年03期
9 謝麗星;周明;孫茂松;;基于層次結(jié)構(gòu)的多策略中文微博情感分析和特征抽取[J];中文信息學(xué)報(bào);2012年01期
10 張春霞;張講社;;選擇性集成學(xué)習(xí)算法綜述[J];計(jì)算機(jī)學(xué)報(bào);2011年08期
相關(guān)碩士學(xué)位論文 前2條
1 洪淑芳;基于支持向量機(jī)的不平衡數(shù)據(jù)分類算法研究[D];江蘇科技大學(xué);2014年
2 朱麗娜;中文網(wǎng)頁分類特征提取方法研究[D];中國石油大學(xué);2009年
,本文編號(hào):2065441
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2065441.html