天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于不平衡數(shù)據(jù)集的數(shù)據(jù)挖掘分類算法研究

發(fā)布時(shí)間:2018-04-28 22:10

  本文選題:數(shù)據(jù)挖掘 + 不平衡數(shù)據(jù)集 ; 參考:《蘭州理工大學(xué)》2017年碩士論文


【摘要】:21世紀(jì)是一個(gè)高度信息化的時(shí)代,數(shù)據(jù)作為載體隱藏著大量可以挖掘的有用信息,如何處理數(shù)據(jù)和提取有價(jià)值的信息已成為迫在眉睫的問題。分類是數(shù)據(jù)挖掘領(lǐng)域的重要研究分支,是數(shù)據(jù)分析的一種重要形式。在實(shí)際生活中,重要的有研究價(jià)值的往往是那些數(shù)量稀少的數(shù)據(jù)類,簡稱不平衡數(shù)據(jù)集。那么如何在不平衡數(shù)據(jù)集中,有效的提取少數(shù)類數(shù)據(jù)集,將是本文研究的重點(diǎn)。主要研究內(nèi)容如下:(1)針對不平衡數(shù)據(jù)集中正類分類準(zhǔn)確率不高的問題,提出了一種集成C4.5和改進(jìn)樸素貝葉斯(C4.5-INB)算法。首先通過對多數(shù)類概率乘以比例系數(shù)得到改進(jìn)樸素貝葉斯分類結(jié)果,再利用C4.5算法對原數(shù)據(jù)分類。根據(jù)兩種分類結(jié)果通過等權(quán)法或最優(yōu)搭配器優(yōu)先法確定這兩種基分類算法的權(quán)值,最后根據(jù)平均表決法得到新的分類結(jié)果。利用UCI數(shù)據(jù)集對三種算法進(jìn)行分類驗(yàn)證,結(jié)果表明提出的算法分類效果更準(zhǔn)確,穩(wěn)定性更好。(2)針對不平衡數(shù)據(jù)集在分類過程中易產(chǎn)生噪聲數(shù)據(jù)和分類精度低的問題,提出了一種基于改進(jìn)SMOTE的不平衡數(shù)據(jù)集主動(dòng)學(xué)習(xí)SVM分類算法。該算法對訓(xùn)練樣本集利用少數(shù)類樣本的歸屬值通過多數(shù)票選擇法控制合成少數(shù)類樣本的數(shù)量,以距離公式為衡量標(biāo)準(zhǔn)劃分超平面,在分類超平面兩側(cè)選擇最近距離的等量對稱的多數(shù)類樣本,組成平衡采樣數(shù)據(jù)集,利用支持向量機(jī)(SVM)進(jìn)行分類得到優(yōu)化分類器,再用主動(dòng)學(xué)習(xí)對去除了訓(xùn)練樣本的不平衡數(shù)據(jù)集利用優(yōu)化分類器進(jìn)行分類循環(huán),直到剩余樣本為零。利用UCI數(shù)據(jù)集中的數(shù)據(jù)實(shí)驗(yàn)表明,提出的算法有效地減少了噪聲數(shù)據(jù)對分類的影響,并有效改善了不平衡數(shù)據(jù)集的分類精度。(3)針對高維不平衡數(shù)據(jù)集分類性能較差的問題,提出了一種改進(jìn)非監(jiān)督線性差分投影(I-ULDP)高維不平衡數(shù)據(jù)集分類算法。算法首先將一個(gè)樣本分成的局部小塊都構(gòu)造在同一個(gè)流形上,使得每個(gè)樣本都有屬于自己的流形空間;然后構(gòu)造出每一個(gè)子流形的最小局部嵌入和最大全局方差,再利用優(yōu)化求解目標(biāo)函數(shù)得出在高維空間中嵌入的低維流形;最后通過流形距離設(shè)定支持向量機(jī)的分類超平面,通過訓(xùn)練支持向量機(jī)得到最終的分類器。經(jīng)UCI數(shù)據(jù)集驗(yàn)證,I-ULDP分類算法在處理高維不平衡數(shù)據(jù)集問題上有明顯的優(yōu)勢。
[Abstract]:The 21st century is a highly information age. Data as a carrier hides a lot of useful information that can be mined. How to deal with data and extract valuable information has become an urgent problem. Classification is an important research branch in the field of data mining and an important form of data analysis. In real life, the important research value is often those few data classes, referred to as unbalanced datasets. So how to extract a few kinds of data sets effectively in unbalanced data sets will be the focus of this paper. The main contents of this paper are as follows: (1) aiming at the problem that the accuracy of positive class classification in unbalanced data sets is not high, a new algorithm of integrating C4.5 and improving naive Bayesian C4.5-INB is proposed. First, the improved naive Bayes classification results are obtained by multiplying the probability of most classes by the proportional coefficients, and then the original data are classified by C4.5 algorithm. According to the two classification results, the weights of the two basic classification algorithms are determined by the equal weight method or the optimal collocation priority method. Finally, the new classification results are obtained according to the average voting method. The UCI dataset is used to classify the three algorithms. The results show that the proposed algorithm is more accurate and stable. (2) aiming at the problem that the unbalanced dataset is prone to produce noisy data and low classification accuracy in the process of classification, the proposed algorithm is more accurate and stable. An active learning SVM classification algorithm for unbalanced datasets based on improved SMOTE is proposed. In this algorithm, the number of synthesized minority samples is controlled by the method of majority vote selection, and the hyperplane is divided according to the distance formula. Two sides of the classification hyperplane selected most of the samples with the nearest distance and symmetry to form the balanced sampling data set, and the support vector machine (SVM) was used to classify the optimal classifier. Then active learning is used to loop the unbalanced data set which removes the training samples by using an optimized classifier until the remaining samples are zero. The experimental results of UCI dataset show that the proposed algorithm can effectively reduce the influence of noise data on classification, and improve the classification accuracy of unbalanced dataset effectively. An improved classification algorithm for unsupervised linear differential projection (I-ULDP) high dimensional unbalanced datasets is proposed. The algorithm first constructs a local block of a sample on the same manifold so that each sample has its own manifold space, and then constructs the minimum local embedding and the maximum global variance of each submanifold. Finally, the hyperplane of support vector machine is set up by manifold distance, and the final classifier is obtained by training support vector machine. The UCI data set verifies that the I-ULDP classification algorithm has obvious advantages in dealing with the problem of high dimensional unbalanced datasets.
【學(xué)位授予單位】:蘭州理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 柳培忠;洪銘;黃德天;駱炎民;王守覺;;基于ADASYN與AdaBoostSVM相結(jié)合的不平衡分類算法[J];北京工業(yè)大學(xué)學(xué)報(bào);2017年03期

2 張宗藝;劉鵬舉;唐小明;;基于粗糙集與C5.0決策樹的林地質(zhì)量評價(jià)[J];西北農(nóng)林科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2017年03期

3 黃超;劉傳毅;劉偉;;基于PSF參數(shù)估計(jì)與后處理的圖像去模糊算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2016年09期

4 蔡柳;`u飛;葉敏;康科;趙祥模;;基于不確定抽樣的半監(jiān)督城市土地功能分類方法[J];吉林大學(xué)學(xué)報(bào)(信息科學(xué)版);2016年04期

5 衣柏衡;朱建軍;李杰;;基于改進(jìn)SMOTE的小額貸款公司客戶信用風(fēng)險(xiǎn)非均衡SVM分類[J];中國管理科學(xué);2016年03期

6 姚宇;董本志;陳廣勝;;一種改進(jìn)的樸素貝葉斯不平衡數(shù)據(jù)集分類算法[J];黑龍江大學(xué)自然科學(xué)學(xué)報(bào);2015年05期

7 陳f ;王志豪;趙程綺;李江夢;;基于集成學(xué)習(xí)的最小錯(cuò)誤率訓(xùn)練算法[J];廈門大學(xué)學(xué)報(bào)(自然科學(xué)版);2015年06期

8 陳斌;蘇一丹;黃山;;基于KM-SMOTE和隨機(jī)森林的不平衡數(shù)據(jù)分類[J];計(jì)算機(jī)技術(shù)與發(fā)展;2015年09期

9 李秋潔;趙亞琴;顧洲;;代價(jià)敏感學(xué)習(xí)中的損失函數(shù)設(shè)計(jì)[J];控制理論與應(yīng)用;2015年05期

10 王衛(wèi)衛(wèi);李小平;馮象初;王斯琪;;稀疏子空間聚類綜述[J];自動(dòng)化學(xué)報(bào);2015年08期

,

本文編號:1817090

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1817090.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶7a620***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com