面向互聯(lián)網(wǎng)應(yīng)用的不平衡數(shù)據(jù)分類技術(shù)研究
發(fā)布時(shí)間:2018-06-28 22:07
本文選題:互聯(lián)網(wǎng)應(yīng)用 + 不平衡數(shù)據(jù); 參考:《國(guó)防科學(xué)技術(shù)大學(xué)》2016年博士論文
【摘要】:互聯(lián)網(wǎng)的飛速發(fā)展,尤其是各類互聯(lián)網(wǎng)應(yīng)用,如網(wǎng)絡(luò)新聞、電子郵件、電子商務(wù)等的發(fā)展為人們獲取信息提供了便捷,但也同時(shí)將人們淹沒(méi)在信息的海洋中。對(duì)海量的互聯(lián)網(wǎng)應(yīng)用數(shù)據(jù)自動(dòng)進(jìn)行分類可以有效提高人們獲取信息的效率,進(jìn)而提升決策效率。然而,很多互聯(lián)網(wǎng)應(yīng)用數(shù)據(jù)中某一類別或多個(gè)類別對(duì)應(yīng)的樣例數(shù)目明顯少于其它類別對(duì)應(yīng)的樣例數(shù)目,形成所謂不平衡數(shù)據(jù),如反動(dòng)新聞與正常新聞、垃圾郵件與正常郵件、異常交易與正常交易等。傳統(tǒng)的基于類別均勻分布假設(shè)所設(shè)計(jì)的分類方法以及評(píng)價(jià)策略通常以整體的準(zhǔn)確率為優(yōu)化目標(biāo),容易忽視其中的少數(shù)類別。而在實(shí)際應(yīng)用中,人們經(jīng)常更加關(guān)心少數(shù)類別,如網(wǎng)監(jiān)部門(mén)更加希望識(shí)別出反動(dòng)新聞、郵件服務(wù)商希望更好地識(shí)別出垃圾郵件、電子商務(wù)平臺(tái)希望檢測(cè)出其中的異常交易等;ヂ(lián)網(wǎng)應(yīng)用數(shù)據(jù)的持續(xù)到達(dá)特性以及類別分布的不平衡性為準(zhǔn)確進(jìn)行數(shù)據(jù)分類帶來(lái)了諸多困難與挑戰(zhàn)。因而對(duì)面向互聯(lián)網(wǎng)應(yīng)用的不平衡數(shù)據(jù)分類技術(shù)進(jìn)行研究具有很強(qiáng)的現(xiàn)實(shí)意義和社會(huì)價(jià)值。本文從互聯(lián)網(wǎng)應(yīng)用數(shù)據(jù)的特性以及承擔(dān)項(xiàng)目的實(shí)際需求出發(fā),遵循由簡(jiǎn)單到復(fù)雜的思路,對(duì)不同類型的互聯(lián)網(wǎng)應(yīng)用數(shù)據(jù)設(shè)計(jì)了相應(yīng)的處理算法。首先從常見(jiàn)的兩類別不平衡數(shù)據(jù)出發(fā),針對(duì)其特點(diǎn)及實(shí)際應(yīng)用需求,研究了不平衡數(shù)據(jù)預(yù)處理中的噪聲過(guò)濾策略和數(shù)據(jù)重采樣方法。之后,將其擴(kuò)展到多類別(類別數(shù)目多于兩個(gè),但每個(gè)樣例只能屬于一個(gè)類別)不平衡數(shù)據(jù)應(yīng)用場(chǎng)景,提出了分解策略與數(shù)據(jù)重采樣相結(jié)合的處理方法。之后,進(jìn)一步將前述研究成果拓展應(yīng)用到多標(biāo)簽(不同于多類別,此時(shí)同一樣例可以屬于多個(gè)類別)不平衡數(shù)據(jù)分類中,設(shè)計(jì)了新的集成學(xué)習(xí)框架和基礎(chǔ)分類算法。最后,根據(jù)互聯(lián)網(wǎng)應(yīng)用數(shù)據(jù)持續(xù)到達(dá)的特點(diǎn),研究了在不平衡數(shù)據(jù)流上的多窗口學(xué)習(xí)策略:(1)在兩類別不平衡數(shù)據(jù)的預(yù)處理方面,首先針對(duì)不平衡數(shù)據(jù)集中可能存在的噪聲,提出了基于IPF的改進(jìn)噪聲過(guò)濾方法,以盡可能減少噪聲過(guò)濾時(shí)將少數(shù)類樣例誤判為噪聲的可能性。之后,針對(duì)少數(shù)類樣例和多數(shù)類樣例各自的特點(diǎn),分別設(shè)計(jì)了基于近鄰分布的少數(shù)類過(guò)采樣算法以及基于距離排序的多數(shù)類欠采樣算法。在此基礎(chǔ)上,針對(duì)實(shí)際應(yīng)用需求,設(shè)計(jì)了少數(shù)類和多數(shù)類之間采樣比例的自適應(yīng)方法,從而減小了數(shù)據(jù)重采樣對(duì)后續(xù)處理流程的影響。最后,通過(guò)在大量真實(shí)數(shù)據(jù)集上的測(cè)試驗(yàn)證了所提方法的有效性,尤其是對(duì)于少數(shù)類別分類效果的提升明顯;(2)在多類別不平衡數(shù)據(jù)分類方面,針對(duì)互聯(lián)網(wǎng)應(yīng)用數(shù)據(jù)的多類別特性,提出分而治之的學(xué)習(xí)策略。首先使用一對(duì)多的OVA方法對(duì)訓(xùn)練數(shù)據(jù)進(jìn)行分解并訓(xùn)練得到多個(gè)子分類器。此時(shí),所有的子分類器都是基于全部類別數(shù)據(jù)訓(xùn)練得到,確保了子分類器的適應(yīng)性。之后,使用一對(duì)一的OVO方法對(duì)候選類別對(duì)應(yīng)的樣例集進(jìn)一步劃分,此階段根據(jù)劃分子集的類別分布決定是否進(jìn)行數(shù)據(jù)重采樣。最后,在采樣后的數(shù)據(jù)子集上訓(xùn)練得到更加細(xì)粒度的子分類器。此外,根據(jù)實(shí)際應(yīng)用需求,分別設(shè)計(jì)了子分類器輸出值為離散和連續(xù)情形下的不同處理策略。在理論分析的基礎(chǔ)上,對(duì)所提方法在多個(gè)真實(shí)數(shù)據(jù)集上進(jìn)行了測(cè)試,結(jié)果表明所提方法能夠有效處理多類別數(shù)據(jù)中存在的不平衡問(wèn)題;(3)在多標(biāo)簽不平衡數(shù)據(jù)分類方面,針對(duì)已有方法偏重多標(biāo)簽分解而缺乏對(duì)標(biāo)簽分布不平衡性考慮的問(wèn)題,提出了一種多標(biāo)簽不平衡數(shù)據(jù)集成學(xué)習(xí)框架并設(shè)計(jì)了相應(yīng)的基礎(chǔ)分類算法。以AdaBoost方法為基礎(chǔ),該框架將標(biāo)簽分布的不平衡特性集成到了各個(gè)子分類器的學(xué)習(xí)訓(xùn)練過(guò)程中。此外,以多標(biāo)簽神經(jīng)網(wǎng)絡(luò)方法BPMLL為基礎(chǔ),設(shè)計(jì)了針對(duì)多標(biāo)簽不平衡數(shù)據(jù)的改進(jìn)算法并將其作為集成學(xué)習(xí)框架的基礎(chǔ)分類算法,在多個(gè)實(shí)際應(yīng)用數(shù)據(jù)集上對(duì)分類效果進(jìn)行了測(cè)試,表明了所提方法的有效性;(4)在不平衡數(shù)據(jù)流分類方面,針對(duì)互聯(lián)網(wǎng)應(yīng)用數(shù)據(jù)流的動(dòng)態(tài)特性以及各個(gè)類別樣例到達(dá)順序的不確定性,提出了一種基于多窗口機(jī)制的集成學(xué)習(xí)方法。該方法根據(jù)不平衡數(shù)據(jù)流的特點(diǎn),定義了四個(gè)不同的窗口分別用于保存當(dāng)前滑動(dòng)窗口數(shù)據(jù)、最近的少數(shù)類樣例、經(jīng)篩選的子分類器以及子分類器對(duì)應(yīng)的歷史窗口數(shù)據(jù)。分別為不同的窗口設(shè)計(jì)了不同的更新策略。對(duì)于新的測(cè)試樣例,其類別標(biāo)簽通過(guò)多數(shù)加權(quán)投票確定。通過(guò)在多個(gè)人工合成數(shù)據(jù)集和真實(shí)數(shù)據(jù)集上的測(cè)試表明,該方法效果更好,效率更高。綜上所述,本文針對(duì)互聯(lián)網(wǎng)應(yīng)用中不同類型數(shù)據(jù)的不同分類需求,尤其針對(duì)其中存在的類別分布不平衡問(wèn)題,提出了有效的解決方案,并通過(guò)在不同領(lǐng)域真實(shí)數(shù)據(jù)集以及人工合成數(shù)據(jù)集上的實(shí)驗(yàn)驗(yàn)證了本文所提方法的有效性。本文的研究工作對(duì)于推進(jìn)各類互聯(lián)網(wǎng)應(yīng)用數(shù)據(jù)的分類處理具有一定的理論意義和應(yīng)用價(jià)值。
[Abstract]:The rapid development of the Internet, especially the development of all kinds of Internet applications, such as Internet news, e-mail, electronic commerce, has provided convenience for people to obtain information, but also drowns people in the ocean of information at the same time. Automatic classification of massive Internet application data can effectively improve the efficiency of people's access to information. To improve the efficiency of decision making, however, the number of samples corresponding to one or more categories in a lot of Internet application data is obviously less than the number of examples corresponding to other categories, forming so-called unbalanced data, such as reactionary news and normal news, spam and normal mail, abnormal transactions and normal transactions. The classification method and evaluation strategy designed by the cloth hypothesis usually take the overall accuracy as the optimization goal and easily ignore a few of them. In practical applications, people often pay more attention to the minority categories, such as the network supervision department is more willing to identify the reactionary news, the mail service providers want to better identify the spam, electronic business. The continuous arrival characteristics of the Internet application data and the imbalance of the category distribution have brought many difficulties and challenges to the accurate classification of data. Therefore, it is of great practical significance and social value to study the unbalanced data classification technology for Internet applications. Based on the characteristics of the Internet application data and the actual needs of the project, this paper designs the corresponding processing algorithms for different types of Internet application data from simple to complex ideas. Firstly, starting from the common two categories of unbalanced data, this paper studies the unbalance data preposition in view of its characteristics and practical application requirements. After the noise filtering strategy and data resampling method, it extends to multiple categories (more than two categories, but each sample can only belong to one category) unbalanced data application scenario, and proposes a combination of decomposition strategy and data resampling. After that, the previous research results are further extended to multi standard. The new integrated learning framework and basic classification algorithm are designed in the classification of unbalanced data, which are different from multiple categories. At the same time, according to the characteristics of the continuous arrival of the Internet application data, the multi window learning strategy on the unbalanced data flow is studied. (1) the pre location of the two categories of unbalanced data. In view of the possible noise in the unbalanced data set, an improved noise filtering method based on IPF is proposed in order to minimize the possibility of misjudging a few samples in noise filtering as possible. Then, a few classes based on near neighbour distribution are designed for a few samples and the characteristics of most class samples. Over sampling algorithm and the majority class under sampling algorithm based on distance sorting. Based on this, an adaptive method of sampling proportion between the minority and the majority class is designed to reduce the impact of the data resampling on the subsequent process. Finally, the tests on a large number of real data sets have been tested and verified. The effectiveness of the proposed method is especially significant for the improvement of the effect of a few categories of classification; (2) a divide and conquer learning strategy is proposed for the multi class unbalance data classification for the multi category characteristics of the Internet application data. First, a one to many OVA method is used to decompose the training data and train a number of sub classifiers. All of the sub classifiers are trained based on all category data training to ensure the adaptability of the Subclassifier. Then, the one to one OVO method is used to further divide the sample set corresponding to the candidate categories. This stage determines whether data resampling is determined by the classification of the subsets. Finally, the data subset after the sample is sampled. A more finer Subclassifier is trained. In addition, according to the actual application requirements, the different processing strategies of the discrete and continuous sub classifier are designed respectively. On the basis of the theoretical analysis, the proposed method is tested on multiple real data sets. The results show that the proposed method can effectively deal with multiple categories. The unbalance problem exists in the data; (3) in the classification of multi label unbalance data, aiming at the problem that the existing methods weigh the multi label decomposition and lack the imbalance of the label distribution, a multi label unbalanced data integration learning framework is proposed and the corresponding basic classification algorithm is designed. Based on the AdaBoost method, the frame is designed. In addition, based on the multi label neural network (BPMLL), an improved algorithm for multi label unbalanced data is designed and used as the basic classification algorithm for the integrated learning framework, and the classification efficiency is on the multiple practical application data sets. The test results show the effectiveness of the proposed method; (4) an integrated learning method based on the multi window mechanism is proposed for the dynamic characteristics of the data flow in the Internet application and the uncertainty in the arrival order of each class sample in the unbalanced data flow classification. The method is defined according to the characteristics of the unbalanced data flow. Four different windows are used to save the current sliding window data, the nearest few samples, the selected sub classifiers and the historical window data corresponding to the sub classifier. The different updating strategies are designed for different windows. For the new test examples, the class labels are determined by the majority of the weighted votes. The tests on personal synthetic data sets and real data sets show that the method has better effect and higher efficiency. In summary, this paper presents an effective solution to the different classification requirements of different types of data in Internet applications, especially for the problem of the disequilibrium of category distribution, and through the real number of different fields. The experiments on the dataset and the synthetic data set verify the effectiveness of the proposed method. The research work of this paper has a certain theoretical significance and application value for promoting the classification and processing of various kinds of Internet application data.
【學(xué)位授予單位】:國(guó)防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP393.09
,
本文編號(hào):2079537
本文鏈接:http://sikaile.net/jingjilunwen/dianzishangwulunwen/2079537.html
最近更新
教材專著