基于免疫系統(tǒng)的不平衡數(shù)據(jù)分類方法研究

發(fā)布時間：2018-10-25 06:27

【摘要】：隨著云計算和移動技術(shù)的發(fā)展,互聯(lián)網(wǎng)進(jìn)入大數(shù)據(jù)時代,人們面對急劇膨脹的多媒體信息,需要有效的內(nèi)容管理和快速的信息查找。分類算法通過學(xué)習(xí)已標(biāo)注數(shù)據(jù)建立模型,對數(shù)據(jù)進(jìn)行分類和標(biāo)簽,已經(jīng)廣泛應(yīng)用于計算機視覺、文字識別、聲音識別、文檔歸類等領(lǐng)域。基于標(biāo)注數(shù)據(jù)的分類算法已經(jīng)走向成熟,如樸素貝葉斯、邏輯回歸、支持向量機、決策樹等。然而,這些算法都依賴于數(shù)據(jù)集規(guī)模,按照學(xué)習(xí)理論,只有樣本規(guī)模超過規(guī)定下界時,正確率才能高于臨界點;同時不平衡數(shù)據(jù)集大量存在于人們的現(xiàn)實生活中,人們更關(guān)心少數(shù)類的樣本,錯分少數(shù)類所產(chǎn)生的代價更大。為了解決這個矛盾,本文致力于基于免疫系統(tǒng)的不平衡數(shù)據(jù)分類方法研究。借鑒人體免疫系統(tǒng)的原理和特性,研究和解決二類不平衡數(shù)據(jù)分類問題、多類不平衡數(shù)據(jù)分類問題,密度缺失下的不平衡數(shù)據(jù)分類問題,以及類內(nèi)簇不平衡下的不平衡數(shù)據(jù)分類問題,主要工作和貢獻(xiàn)如下:(1)在二類不平衡環(huán)境下,研究了基于免疫中心點的過采樣提高分類算法性能的理論和方法。在二類學(xué)習(xí)中,多數(shù)類(或負(fù)類)的樣本數(shù)量比少數(shù)類(或正類)的樣本數(shù)量更多,標(biāo)準(zhǔn)分類學(xué)習(xí)算法趨于偏向多數(shù)類,造成少數(shù)類的錯分率明顯高于多數(shù)類的錯分率。本文提出的基于免疫中心點的過采樣方法(ICOTE)借鑒免疫網(wǎng)絡(luò)原理,經(jīng)過繁殖、變異、抑制等過程,產(chǎn)生免疫型中心點來擴充少數(shù)類樣本,以達(dá)到樣本分布的類平衡。免疫型中心點反映少數(shù)類的分布特征,擴張后的樣本集不會改變原有樣本的形狀,防止新簇的產(chǎn)生,因而ICOTE在避免過學(xué)習(xí)的同時,也克服了隨機合成采樣方法不考慮樣本空間分布的問題。(2)在多類不平衡環(huán)境下,研究了基于多免疫子網(wǎng)絡(luò)的過采樣提高分類算法性能的理論和方法。與二類學(xué)習(xí)相比,多類學(xué)習(xí)面臨著搜索空間變大、算法復(fù)雜度升高、空間重合等新問題,往往無法簡單地把二類方法照搬到多類問題。同時,不平衡問題變得更加突出,少數(shù)類不止一個,類空間重疊現(xiàn)象更加普遍,造成傳統(tǒng)分類算法忽視少數(shù)類現(xiàn)象,更傾向降低多數(shù)類的錯分率。本文提出的基于免疫中心點的全局過采樣方法(Global-IC)借鑒免疫網(wǎng)絡(luò)原理,在每個少數(shù)類空間生成免疫子網(wǎng)絡(luò),網(wǎng)絡(luò)節(jié)點用來擴充少數(shù)類樣本,最終達(dá)到整個樣本分布的類平衡,促使分類算法在生成模型時,給予每個類同樣的權(quán)重,正確預(yù)測未知樣本。(3)在少數(shù)類數(shù)據(jù)密度稀疏條件下,研究基于陰性選擇的過采樣提高分類算法性能的理論和方法。與多數(shù)類樣本空間相比,少數(shù)類空間不僅樣本數(shù)量少,數(shù)據(jù)也比較稀疏,形成許多的孤立點或小簇,分類算法易于向多數(shù)類偏置。本文借鑒人體免疫系統(tǒng)的陰性選擇機制,提出非我抗原型檢測器和離散點檢測相結(jié)合,學(xué)習(xí)整個數(shù)據(jù)空間的分布特性,生成符合少數(shù)類密度分布的合成樣本,擴大少數(shù)類空間的決策區(qū)域。因為盡可能多的利用樣本數(shù)據(jù),在少數(shù)類空間生成更大或更稠密的決策區(qū)后,決策樹分類算法有足夠的分類信息,生成的決策樹能夠?qū)ξ礃?biāo)注樣本進(jìn)行正確分類。(4)在類內(nèi)簇不平衡條件下,研究基于形狀的過采樣提高分類算法性能的理論和方法。不平衡問題不簡單是類間的不平衡,而是類內(nèi)部有更多的“小簇”,簇間的不平衡造成預(yù)測精度變低。本文借鑒免疫網(wǎng)絡(luò)原理和離散點檢測,提出了基于形狀的過采樣方法(SBO)。SBO利用聚類算法識別類內(nèi)的“簇”,然后在簇內(nèi)構(gòu)建免疫子網(wǎng)絡(luò),網(wǎng)絡(luò)節(jié)點用來擴充少數(shù)類樣本。我們也研究解決了CURE算法對輸入?yún)?shù)的依賴性,利用免疫網(wǎng)絡(luò)生成代表點替換以前的向量均值;同時,SBO檢查簇算法引入的“假簇”,只對真實簇擴充樣本規(guī)模,避免重復(fù)樣本帶來的過學(xué)習(xí)問題。因為過采樣后的數(shù)據(jù)集變得類間和類內(nèi)平衡,并且擴展后數(shù)據(jù)集和原數(shù)據(jù)集有著相似的空間分布,因此生成的決策樹能夠?qū)ξ礃?biāo)注樣本進(jìn)行正確分類。
[Abstract]:With the development of cloud computing and mobile technology, the Internet has entered the age of big data, and people face the rapid expansion of multimedia information, requiring effective content management and quick information searching. The classification algorithm has been widely used in the fields of computer vision, text recognition, voice recognition, document classification and so on. The classification algorithm based on annotation data has been mature, such as naive Bayes, logistic regression, support vector machine, decision tree and so on. However, these algorithms depend on the size of the data set, and according to the learning theory, only the accuracy can be higher than the critical point when the sample size exceeds a prescribed lower limit; meanwhile, the unbalanced data set exists in the real life of people, and people are more concerned with a few samples. Mistakes are at a greater cost than they produce. In order to solve this contradiction, this paper is devoted to the study of unbalanced data classification based on immune system. Based on the principles and characteristics of human immune system, we study and solve the classification of unbalanced data of Class II, the classification of multi-class unbalanced data, the classification of unbalanced data under the loss of density, and the classification of unbalanced data under the imbalance of clusters. The main work and contribution are as follows: (1) In the second-class unbalanced environment, the theory and method of improving the performance of the classification algorithm based on the over-sampling of the immune central point are studied. In Class II study, the number of samples of most classes (or negative classes) is more than that of a few (or positive) classes, and the standard classification learning algorithm tends to favor most classes, resulting in a significant fraction of the error fraction of a few classes being significantly higher than that of the majority class. In this paper, we propose an immune central point-based oversampling method (ICOTE), which is based on the principle of immune network, propagation, mutation, inhibition and so on, to generate an immune center point to expand a few samples so as to achieve the class balance of sample distribution. An immunotype center point reflects the distribution characteristics of a few classes, and the expanded sample set does not change the shape of the original sample so as to prevent the generation of new clusters, so that the ICOTE overcomes the problem that the random synthesis sampling method does not take into account the distribution of the sample space at the same time of avoiding overlearning. (2) In the multi-class imbalance environment, the theory and method for improving the performance of classification algorithm based on over-sampling of multi-immune subnetworks are studied. Compared with the second-class learning, the multi-class learning is confronted with new problems such as large search space, high algorithm complexity and space coincidence, and the second-class method can not be simply copied to the multi-class problem. At the same time, the imbalance problem becomes more prominent, and a few more than one class space overlap phenomenon is more common, which causes the traditional classification algorithm to ignore a few phenomena and tends to lower the error rate of most classes. Global oversampling method based on immune central point (Global-IC), which is based on immune central point, uses the principle of immune network to generate immune sub-network in each small space, and the network node is used to expand a few samples, and finally, the class balance of the whole sample distribution is reached, and the classification algorithm is promoted to generate the model. Each class is given the same weight to correctly predict unknown samples. (3) Under the sparse condition of small data density, the theory and method for improving the performance of classification algorithm based on the over-sampling of negative selection are studied. Compared with most sample spaces, a few types of space have little sample quantity and sparse data, and many isolated points or clusters are formed, and the classification algorithm is easy to be biased to most classes. Based on the negative selection mechanism of human immune system, this paper puts forward a combination of non-my antigen-type detector and discrete point detection, and studies the distribution characteristics of the whole data space. Since sample data is used as much as possible, the decision tree classification algorithm has sufficient classification information after generating a larger or more dense decision region in a few types of space, and the generated decision tree is able to correctly classify the unlabeled samples. (4) Based on the shape-based oversampling, the theory and method of improving the performance of classification algorithm are studied under the condition of clustering in clusters. The imbalance is not simply an imbalance between classes, but there are more internal classes" Cluster "and the imbalance between clusters causes the prediction accuracy to be low. In this paper, based on the principle of immune network and the detection of discrete points, the shape-based oversampling method (SBO) is proposed." Cluster "and then constructing an immune sub-network within the cluster, the network node being used to augment a few samples. We also studied the dependence of the CURE algorithm on the input parameters, using the immune network to generate a representative point to replace the previous vector mean, and at the same time, the SBO check cluster algorithm introduced" false cluster "and avoiding the problem of over-learning caused by repeated samples only by expanding the sample size for the real cluster. Since the oversampled data set becomes inter-class and intra-class balance, and the extended data set and the original data set have a similar spatial distribution, the generated decision tree is able to correctly classify the unlabeled samples.
【學(xué)位授予單位】：蘇州大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2016
【分類號】：TP301.6

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 王勝祥;現(xiàn)實、實踐與理論——兼談圖書館高位理論[J];黑龍江圖書館;1990年02期

2 王健庭;火信號的采集與相關(guān)修正[J];數(shù)據(jù)采集與處理;1987年02期

3 陳國階;我國東西部發(fā)展不平衡與西部開發(fā)[J];科技導(dǎo)報;1995年07期

4 王萌;施艷艷;王海明;沈明輝;;不平衡電網(wǎng)電壓下雙饋風(fēng)力發(fā)電系統(tǒng)強勵控制[J];測控技術(shù);2014年07期

5 漫征;;克服地區(qū)落后論的錯誤思想[J];新聞戰(zhàn)線;1960年11期

6 ;來稿選題建議[J];青年研究;1999年01期

7 沈睿;;區(qū)域發(fā)展不平衡——不同地域中小企業(yè)信息化建設(shè)差距較大[J];每周電腦報;2004年08期

8 張昕竹;用電信普遍服務(wù)政策改善經(jīng)濟(jì)發(fā)展不平衡[J];通信世界;2001年16期

9 周耘;;試論我國年鑒發(fā)展的不平衡性[J];圖書館學(xué)研究;1987年04期

10 劉葉婷;;智慧城市應(yīng)依“標(biāo)”而建[J];信息化建設(shè);2013年09期

相關(guān)會議論文前6條

1 張雨石;唐麗敏;王庸凱;陳文科;;關(guān)于中日航線集裝箱運量不平衡原因的分析[A];中國航海學(xué)會——2004年度學(xué)術(shù)交流會優(yōu)秀論文集[C];2004年

2 廖芳宇;;基于LabVIEW的三相不平衡的測量[A];2011年云南電力技術(shù)論壇論文集（入選部分）[C];2011年

3 沙鵬程;;關(guān)于西部民營企業(yè)可持續(xù)發(fā)展的思考[A];第十四次全國回族學(xué)研討會論文匯編[C];2003年

4 張敦偉;丁博;;配電網(wǎng)三相不平衡補償?shù)奶接慬A];2007中國電機工程學(xué)會電力系統(tǒng)自動化專委會供用電管理自動化學(xué)科組（分專委會）二屆三次會議論文集[C];2007年

5 王仲生;王翔;;轉(zhuǎn)子不平衡自愈監(jiān)控系統(tǒng)設(shè)計[A];第七屆全國信息獲取與處理學(xué)術(shù)會議論文集[C];2009年

6 王中卿;李壽山;朱巧明;李培峰;周國棟;;基于不平衡數(shù)據(jù)的中文情感分類[A];中國計算語言學(xué)研究前沿進(jìn)展（2009-2011）[C];2011年

相關(guān)重要報紙文章前10條

1 本報記者劉金松;教育最大的不公平是教育資源不平衡[N];經(jīng)濟(jì)觀察報;2014年

2 程凱;解決不平衡還要靠市場[N];中華工商時報;2005年

3 本報見習(xí)記者周寧;示范小城鎮(zhèn)建設(shè)“四個不平衡”[N];經(jīng)濟(jì)信息時報;2013年

4 記者張黎明;我市治堵工作進(jìn)展不平衡[N];金華日報;2014年

5 本報記者任s，

本文編號：2292894

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/xxkjbs/2292894.html

上一篇：基于并行計算的蘋果采摘機器人關(guān)鍵技術(shù)研究
下一篇：新型高效率傳輸陣列天線研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于免疫系統(tǒng)的不平衡數(shù)據(jù)分類方法研究