面向不平衡數(shù)據(jù)集分類的改進(jìn)K-近鄰法研究

發(fā)布時(shí)間：2018-09-19 15:34

【摘要】：在信息化大爆炸的今天,如何高效地從現(xiàn)有復(fù)雜多變的信息中提取出人們所需要的信息是一個(gè)急需解決的難題。為了解決這個(gè)難題,機(jī)器學(xué)習(xí)、人工智能和模式識(shí)別等領(lǐng)域的學(xué)者們展開了深入的研究,分類方法是其中重要的研究方向之一。經(jīng)過多年的不懈努力,已有許多分類性能較好的方法應(yīng)用于分類問題。然而這些分類方法主要是以整體的分類誤判率、準(zhǔn)確率和召回率等作為分類目標(biāo),這些分類性能的評價(jià)指標(biāo)在不平衡數(shù)據(jù)集的分類問題中容易降低少數(shù)類和分布稀疏類樣本的識(shí)別率。由于現(xiàn)實(shí)生活的需要,人們越來越重視少數(shù)類的分類精度,故在保證不平衡數(shù)據(jù)集整體分類質(zhì)量的前提下提高少數(shù)類樣本的識(shí)別率是一個(gè)值得研究的熱點(diǎn)。本文主要研究了面向不平衡數(shù)據(jù)集分類的K-近鄰法,具體的工作如下:(1)針對傳統(tǒng)K-近鄰法在尋找近鄰樣本時(shí)由于較大的相似度計(jì)算量而導(dǎo)致分類速度慢的不足,引入了代表樣本和閾值。各測試樣本的近鄰樣本只在其與各類代表樣本相似程度不小于相應(yīng)閾值的類中選取,從而減少了計(jì)算量,在不影響分類精度的同時(shí)提高了分類速度。(2)對于傳統(tǒng)K-近鄰法對不平衡數(shù)據(jù)集分類精度低的問題,提出了類代表度與樣本代表度。通過賦予類代表程度大的近鄰樣本和少數(shù)類樣本較大權(quán)重來減弱多數(shù)類及分布密集類對分類的影響,從而提高了傳統(tǒng)K-近鄰法對不平衡數(shù)據(jù)集的分類精度。本文以UCI分類數(shù)據(jù)集作為實(shí)驗(yàn)數(shù)據(jù)。通過比較傳統(tǒng)K-近鄰法與改進(jìn)K-近鄰法的各性能評價(jià)指標(biāo),結(jié)果顯示改進(jìn)的K-近鄰法在一定程度上提高了分類性能。
[Abstract]:How to efficiently extract the information that people need from the existing complex and changeable information is a difficult problem that needs to be solved in today's information-based Big Bang. In order to solve this problem, scholars in the fields of machine learning, artificial intelligence and pattern recognition have carried out in-depth research, and classification method is one of the important research directions. After years of unremitting efforts, there are many good classification performance methods applied to classification problems. However, these classification methods are mainly based on the overall classification error rate, accuracy rate and recall rate. It is easy to reduce the recognition rate of a few classes and distributed sparse class samples in the classification problem of unbalanced datasets. Due to the need of real life, people pay more and more attention to the classification accuracy of a few classes, so it is a hot topic to improve the recognition rate of a few kinds of samples on the premise of guaranteeing the overall classification quality of unbalanced data sets. In this paper, the K-nearest neighbor method for classification of unbalanced datasets is studied. The main works are as follows: (1) in order to solve the problem of slow classification speed caused by the large amount of similarity calculation, the traditional K-nearest neighbor method is used to find the nearest neighbor samples. The representative sample and threshold are introduced. The nearest neighbor sample of each test sample is only selected from the class whose similarity with each representative sample is not less than the corresponding threshold value, thus reducing the calculation amount. The classification accuracy is not affected and the classification speed is improved. (2) for the problem of low classification accuracy of traditional K-nearest neighbor method for unbalanced datasets, class representation and sample representation are proposed. In order to reduce the influence of most classes and distributed dense classes on the classification, the traditional K-nearest neighbor method can improve the classification accuracy of unbalanced data sets by giving a large weight to the nearest neighbor samples and a few class samples. In this paper, UCI classification data set is used as experimental data. By comparing the traditional K-nearest neighbor method with the improved K-nearest neighbor method, the results show that the improved K-nearest neighbor method improves the classification performance to some extent.
【學(xué)位授予單位】：西南交通大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 樊存佳;汪友生;邊航;;一種改進(jìn)的KNN文本分類算法[J];國外電子測量技術(shù);2015年12期

2 萬韓永;左家莉;萬劍怡;王明文;;基于樣本重要性原理的KNN文本分類算法[J];江西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2015年03期

3 羅賢鋒;祝勝林;陳澤健;袁玉強(qiáng);;基于K-Medoids聚類的改進(jìn)KNN文本分類算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2014年11期

4 楊柳;于劍;景麗萍;;一種自適應(yīng)的大間隔近鄰分類算法[J];計(jì)算機(jī)研究與發(fā)展;2013年11期

5 余鷹;苗奪謙;劉財(cái)輝;王磊;;基于變精度粗糙集的KNN分類改進(jìn)算法[J];模式識(shí)別與人工智能;2012年04期

6 周靖;劉晉勝;;特征聯(lián)合熵的一種改進(jìn)K近鄰分類算法[J];計(jì)算機(jī)應(yīng)用;2011年07期

7 趙俊杰;盛劍鋒;陶新民;;一種基于特征加權(quán)的KNN文本分類算法[J];電腦學(xué)習(xí);2010年02期

8 印鑒;譚煥云;;基于χ~2統(tǒng)計(jì)量的kNN文本分類算法[J];小型微型計(jì)算機(jī)系統(tǒng);2007年06期

9 王曉曄,王正歐;K-最近鄰分類技術(shù)的改進(jìn)算法[J];電子與信息學(xué)報(bào);2005年03期

10 李榮陸,胡運(yùn)發(fā);基于密度的kNN文本分類器訓(xùn)練樣本裁剪方法[J];計(jì)算機(jī)研究與發(fā)展;2004年04期

相關(guān)碩士學(xué)位論文前2條

1 梁洲;改進(jìn)的K-近鄰模式分類[D];電子科技大學(xué);2015年

2 孫麗華;中文文本自動(dòng)分類的研究[D];哈爾濱工程大學(xué);2002年

，

本文編號(hào)：2250543

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2250543.html

上一篇：融合全局與局部特征的相似視頻片段快速檢測技術(shù)研究
下一篇：基于長短時(shí)記憶網(wǎng)絡(luò)的多標(biāo)簽文本分類

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向不平衡數(shù)據(jù)集分類的改進(jìn)K-近鄰法研究