基于免疫系統(tǒng)的不平衡數(shù)據(jù)分類方法研究
[Abstract]:With the development of cloud computing and mobile technology, the Internet has entered the age of big data, and people face the rapid expansion of multimedia information, requiring effective content management and quick information searching. The classification algorithm has been widely used in the fields of computer vision, text recognition, voice recognition, document classification and so on. The classification algorithm based on annotation data has been mature, such as naive Bayes, logistic regression, support vector machine, decision tree and so on. However, these algorithms depend on the size of the data set, and according to the learning theory, only the accuracy can be higher than the critical point when the sample size exceeds a prescribed lower limit; meanwhile, the unbalanced data set exists in the real life of people, and people are more concerned with a few samples. Mistakes are at a greater cost than they produce. In order to solve this contradiction, this paper is devoted to the study of unbalanced data classification based on immune system. Based on the principles and characteristics of human immune system, we study and solve the classification of unbalanced data of Class II, the classification of multi-class unbalanced data, the classification of unbalanced data under the loss of density, and the classification of unbalanced data under the imbalance of clusters. The main work and contribution are as follows: (1) In the second-class unbalanced environment, the theory and method of improving the performance of the classification algorithm based on the over-sampling of the immune central point are studied. In Class II study, the number of samples of most classes (or negative classes) is more than that of a few (or positive) classes, and the standard classification learning algorithm tends to favor most classes, resulting in a significant fraction of the error fraction of a few classes being significantly higher than that of the majority class. In this paper, we propose an immune central point-based oversampling method (ICOTE), which is based on the principle of immune network, propagation, mutation, inhibition and so on, to generate an immune center point to expand a few samples so as to achieve the class balance of sample distribution. An immunotype center point reflects the distribution characteristics of a few classes, and the expanded sample set does not change the shape of the original sample so as to prevent the generation of new clusters, so that the ICOTE overcomes the problem that the random synthesis sampling method does not take into account the distribution of the sample space at the same time of avoiding overlearning. (2) In the multi-class imbalance environment, the theory and method for improving the performance of classification algorithm based on over-sampling of multi-immune subnetworks are studied. Compared with the second-class learning, the multi-class learning is confronted with new problems such as large search space, high algorithm complexity and space coincidence, and the second-class method can not be simply copied to the multi-class problem. At the same time, the imbalance problem becomes more prominent, and a few more than one class space overlap phenomenon is more common, which causes the traditional classification algorithm to ignore a few phenomena and tends to lower the error rate of most classes. Global oversampling method based on immune central point (Global-IC), which is based on immune central point, uses the principle of immune network to generate immune sub-network in each small space, and the network node is used to expand a few samples, and finally, the class balance of the whole sample distribution is reached, and the classification algorithm is promoted to generate the model. Each class is given the same weight to correctly predict unknown samples. (3) Under the sparse condition of small data density, the theory and method for improving the performance of classification algorithm based on the over-sampling of negative selection are studied. Compared with most sample spaces, a few types of space have little sample quantity and sparse data, and many isolated points or clusters are formed, and the classification algorithm is easy to be biased to most classes. Based on the negative selection mechanism of human immune system, this paper puts forward a combination of non-my antigen-type detector and discrete point detection, and studies the distribution characteristics of the whole data space. Since sample data is used as much as possible, the decision tree classification algorithm has sufficient classification information after generating a larger or more dense decision region in a few types of space, and the generated decision tree is able to correctly classify the unlabeled samples. (4) Based on the shape-based oversampling, the theory and method of improving the performance of classification algorithm are studied under the condition of clustering in clusters. The imbalance is not simply an imbalance between classes, but there are more internal classes" Cluster "and the imbalance between clusters causes the prediction accuracy to be low. In this paper, based on the principle of immune network and the detection of discrete points, the shape-based oversampling method (SBO) is proposed." Cluster "and then constructing an immune sub-network within the cluster, the network node being used to augment a few samples. We also studied the dependence of the CURE algorithm on the input parameters, using the immune network to generate a representative point to replace the previous vector mean, and at the same time, the SBO check cluster algorithm introduced" false cluster "and avoiding the problem of over-learning caused by repeated samples only by expanding the sample size for the real cluster. Since the oversampled data set becomes inter-class and intra-class balance, and the extended data set and the original data set have a similar spatial distribution, the generated decision tree is able to correctly classify the unlabeled samples.
【學(xué)位授予單位】:蘇州大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2016
【分類號】:TP301.6
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 王勝祥;現(xiàn)實、實踐與理論——兼談圖書館高位理論[J];黑龍江圖書館;1990年02期
2 王健庭;火信號的采集與相關(guān)修正[J];數(shù)據(jù)采集與處理;1987年02期
3 陳國階;我國東西部發(fā)展不平衡與西部開發(fā)[J];科技導(dǎo)報;1995年07期
4 王萌;施艷艷;王海明;沈明輝;;不平衡電網(wǎng)電壓下雙饋風(fēng)力發(fā)電系統(tǒng)強勵控制[J];測控技術(shù);2014年07期
5 漫征;;克服地區(qū)落后論的錯誤思想[J];新聞戰(zhàn)線;1960年11期
6 ;來稿選題建議[J];青年研究;1999年01期
7 沈睿;;區(qū)域發(fā)展不平衡——不同地域中小企業(yè)信息化建設(shè)差距較大[J];每周電腦報;2004年08期
8 張昕竹;用電信普遍服務(wù)政策改善經(jīng)濟(jì)發(fā)展不平衡[J];通信世界;2001年16期
9 周耘;;試論我國年鑒發(fā)展的不平衡性[J];圖書館學(xué)研究;1987年04期
10 劉葉婷;;智慧城市應(yīng)依“標(biāo)”而建[J];信息化建設(shè);2013年09期
相關(guān)會議論文 前6條
1 張雨石;唐麗敏;王庸凱;陳文科;;關(guān)于中日航線集裝箱運量不平衡原因的分析[A];中國航海學(xué)會——2004年度學(xué)術(shù)交流會優(yōu)秀論文集[C];2004年
2 廖芳宇;;基于LabVIEW的三相不平衡的測量[A];2011年云南電力技術(shù)論壇論文集(入選部分)[C];2011年
3 沙鵬程;;關(guān)于西部民營企業(yè)可持續(xù)發(fā)展的思考[A];第十四次全國回族學(xué)研討會論文匯編[C];2003年
4 張敦偉;丁博;;配電網(wǎng)三相不平衡補償?shù)奶接慬A];2007中國電機工程學(xué)會電力系統(tǒng)自動化專委會供用電管理自動化學(xué)科組(分專委會)二屆三次會議論文集[C];2007年
5 王仲生;王翔;;轉(zhuǎn)子不平衡自愈監(jiān)控系統(tǒng)設(shè)計[A];第七屆全國信息獲取與處理學(xué)術(shù)會議論文集[C];2009年
6 王中卿;李壽山;朱巧明;李培峰;周國棟;;基于不平衡數(shù)據(jù)的中文情感分類[A];中國計算語言學(xué)研究前沿進(jìn)展(2009-2011)[C];2011年
相關(guān)重要報紙文章 前10條
1 本報記者 劉金松;教育最大的不公平是教育資源不平衡[N];經(jīng)濟(jì)觀察報;2014年
2 程凱;解決不平衡還要靠市場[N];中華工商時報;2005年
3 本報見習(xí)記者 周寧;示范小城鎮(zhèn)建設(shè)“四個不平衡”[N];經(jīng)濟(jì)信息時報;2013年
4 記者 張黎明;我市治堵工作進(jìn)展不平衡[N];金華日報;2014年
5 本報記者 任s,
本文編號:2292894
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/2292894.html