天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 碩博論文 > 信息類碩士論文 >

類別不平衡與代價(jià)敏感數(shù)據(jù)的集成分類方法研究

發(fā)布時(shí)間:2018-05-28 21:29

  本文選題:機(jī)器學(xué)習(xí) + 類別不平衡分類 ; 參考:《中國(guó)科學(xué)技術(shù)大學(xué)》2017年碩士論文


【摘要】:隨著大數(shù)據(jù)時(shí)代的來臨,機(jī)器學(xué)習(xí)作為現(xiàn)代數(shù)據(jù)分析技術(shù)的理論基石,發(fā)揮了至關(guān)重要的作用,同時(shí)也面臨著大大小小的挑戰(zhàn)。分類問題作為機(jī)器學(xué)習(xí)領(lǐng)域最基本最核心的問題之一,持續(xù)受到學(xué)術(shù)界的熱切關(guān)注。傳統(tǒng)的分類算法一般基于兩個(gè)假設(shè):一是不同類別的樣本數(shù)量大致相同;二是不同類別的錯(cuò)分代價(jià)基本相等。然而在真實(shí)世界中,數(shù)據(jù)集往往存在類別不平衡問題和代價(jià)敏感問題,這使得基于準(zhǔn)確率的傳統(tǒng)分類算法變得不再適用。類別不平衡指的是不同類別的樣本數(shù)量分布不平衡;代價(jià)敏感指的是不同類別的錯(cuò)誤分類代價(jià)相差很大。在類別不平衡的數(shù)據(jù)集中,傳統(tǒng)分類算法為了獲得較高準(zhǔn)確率,傾向于錯(cuò)分少數(shù)類樣本,然而這些少數(shù)類樣本往往更加重要;在代價(jià)敏感的數(shù)據(jù)集中,傳統(tǒng)分類算法對(duì)錯(cuò)誤分類代價(jià)不敏感,無法最小化錯(cuò)誤分類總代價(jià)。由于類別不平衡問題和代價(jià)敏感問題在現(xiàn)實(shí)中的普遍性和重要性,國(guó)內(nèi)外學(xué)術(shù)界對(duì)此展開了廣泛而深入的研究,并提出了各種各樣的解決方法。經(jīng)過歸納總結(jié),這些方法大致從兩個(gè)層面來解決問題:一是從數(shù)據(jù)層面,通過重構(gòu)訓(xùn)練集改變樣本分布,典型的是采用重采樣技術(shù);二是從算法層面,通過重新設(shè)計(jì)現(xiàn)有算法使之適應(yīng)這兩個(gè)問題,典型的是代價(jià)敏感學(xué)習(xí)和基于Boosing的方法。在這些方法中,集成學(xué)習(xí)扮演了舉足輕重的角色。經(jīng)過十幾年的研究,該領(lǐng)域已經(jīng)取得了十分矚目的成就,但是仍然存在一些問題,比如過擬合,丟失信息等,影響了分類模型的穩(wěn)定性和可靠性。本文針對(duì)類別不平衡問題和代價(jià)敏感問題,做了以下兩點(diǎn)工作:·提出兩種基于重采樣的集成分類方法:xEnsemble和RSEnsemble。首先介紹這兩種方法的理論基石,然后對(duì)現(xiàn)有算法進(jìn)行改進(jìn),最后分別從偏差-方差分解、誤差-分歧分解的角度,理論上證明這兩種方法的有效性!Ensemble和RSEnsemble方法應(yīng)用于真實(shí)的糖尿病診斷數(shù)據(jù)集。該數(shù)據(jù)集規(guī)模龐大,高度類別不平衡且代價(jià)敏感。首先明確實(shí)驗(yàn)的評(píng)價(jià)標(biāo)準(zhǔn),然后對(duì)該數(shù)據(jù)集進(jìn)行預(yù)處理,最終實(shí)驗(yàn)結(jié)果證明:相比其他類似方法,這兩種方法能夠取得更好的分類效果。
[Abstract]:With the advent of big data era, machine learning, as the theoretical cornerstone of modern data analysis technology, plays a vital role, but also faces challenges large and small. As one of the most basic and core problems in the field of machine learning, classification problem has been paid more and more attention by academic circles. The traditional classification algorithms are generally based on two assumptions: one is that the number of samples of different categories is about the same; the other is that the cost of different categories of misdivision is basically equal. However, in the real world, the classification imbalance problem and the cost sensitivity problem often exist in the data sets, which makes the traditional classification algorithm based on accuracy no longer applicable. Class imbalance refers to the imbalance in the distribution of samples of different categories, while the cost sensitivity refers to the large difference in the cost of different categories of error classification. In class unbalanced data sets, traditional classification algorithms tend to misclassify a few samples in order to achieve higher accuracy. However, these minority samples are often more important; in cost sensitive data sets, The traditional classification algorithm is insensitive to the cost of error classification and can not minimize the total cost of error classification. Due to the universality and importance of category imbalance and cost sensitive problems, scholars at home and abroad have carried out extensive and in-depth research and put forward a variety of solutions. After summing up, these methods can solve the problem from two aspects: one is to change the distribution of samples from the data level by reconstructing the training set, and the other is to use resampling technology from the algorithm level. By redesigning the existing algorithms to adapt to these two problems, the typical cost sensitive learning and Boosing based approach. In these methods, integrated learning plays an important role. After more than ten years of research, this field has made great achievements, but there are still some problems, such as over-fitting, loss of information and so on, which affect the stability and reliability of classification models. In this paper, we focus on the problem of class imbalance and the problem of cost sensitivity, and do the following two works: we propose two integrated classification methods: xEnsemble and RSEnsemble based on resampling. First, the theoretical foundation of these two methods is introduced, and then the existing algorithms are improved. Finally, from the angle of deviation variance decomposition and error bifurcation decomposition, The effectiveness of these two methods is proved theoretically. The xEnsemble and RSEnsemble methods are applied to the real diabetes diagnosis data set. The data set is large, highly class unbalanced and cost sensitive. First, the evaluation criteria of the experiment are defined, and then the data set is preprocessed. Finally, the experimental results show that the two methods can achieve better classification effect than other similar methods.
【學(xué)位授予單位】:中國(guó)科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP181

【參考文獻(xiàn)】

相關(guān)期刊論文 前2條

1 李勇;劉戰(zhàn)東;張海軍;;不平衡數(shù)據(jù)的集成分類算法綜述[J];計(jì)算機(jī)應(yīng)用研究;2014年05期

2 葉志飛;文益民;呂寶糧;;不平衡分類問題研究綜述[J];智能系統(tǒng)學(xué)報(bào);2009年02期

相關(guān)博士學(xué)位論文 前1條

1 王瑞;針對(duì)類別不平衡和代價(jià)敏感分類問題的特征選擇和分類算法[D];中國(guó)科學(xué)技術(shù)大學(xué);2013年

,

本文編號(hào):1948257

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1948257.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶d86d0***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
国产精品自拍杆香蕉视频| 国产日产欧美精品大秀| 日本男人女人干逼视频| 国产精品色热综合在线| 麻豆视传媒短视频免费观看| 国产精品免费自拍视频| 色婷婷人妻av毛片一区二区三区| 亚洲第一区欧美日韩在线| 亚洲一区二区欧美激情| 区一区二区三中文字幕| 好吊色欧美一区二区三区顽频| 91精品国产av一区二区| 好吊日成人免费视频公开| 精品人妻少妇二区三区| 欧美日韩中国性生活视频| 亚洲天堂男人在线观看| 日本深夜福利视频在线| 国产人妻熟女高跟丝袜| 国产又粗又硬又大又爽的视频| 国产欧美日韩综合精品二区| 日韩美成人免费在线视频| 欧美成人久久久免费播放| 国产亚洲神马午夜福利| 亚洲少妇一区二区三区懂色| 欧美精品亚洲精品一区| 在线欧美精品二区三区| 日韩欧美91在线视频| 偷自拍亚洲欧美一区二页| 国产女同精品一区二区| 亚洲熟女国产熟女二区三区| 一区二区三区18禁看| 免费精品一区二区三区| 日韩少妇人妻中文字幕| 亚洲欧美一二区日韩高清在线| 少妇人妻无一区二区三区| 亚洲综合天堂一二三区| 老司机精品福利视频在线播放| 免费精品一区二区三区| 二区久久久国产av色| 欧美六区视频在线观看| 日韩免费av一区二区三区|