天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向互聯(lián)網(wǎng)應用的不平衡數(shù)據(jù)分類技術(shù)研究

發(fā)布時間:2018-06-28 22:07

  本文選題:互聯(lián)網(wǎng)應用 + 不平衡數(shù)據(jù); 參考:《國防科學技術(shù)大學》2016年博士論文


【摘要】:互聯(lián)網(wǎng)的飛速發(fā)展,尤其是各類互聯(lián)網(wǎng)應用,如網(wǎng)絡(luò)新聞、電子郵件、電子商務(wù)等的發(fā)展為人們獲取信息提供了便捷,但也同時將人們淹沒在信息的海洋中。對海量的互聯(lián)網(wǎng)應用數(shù)據(jù)自動進行分類可以有效提高人們獲取信息的效率,進而提升決策效率。然而,很多互聯(lián)網(wǎng)應用數(shù)據(jù)中某一類別或多個類別對應的樣例數(shù)目明顯少于其它類別對應的樣例數(shù)目,形成所謂不平衡數(shù)據(jù),如反動新聞與正常新聞、垃圾郵件與正常郵件、異常交易與正常交易等。傳統(tǒng)的基于類別均勻分布假設(shè)所設(shè)計的分類方法以及評價策略通常以整體的準確率為優(yōu)化目標,容易忽視其中的少數(shù)類別。而在實際應用中,人們經(jīng)常更加關(guān)心少數(shù)類別,如網(wǎng)監(jiān)部門更加希望識別出反動新聞、郵件服務(wù)商希望更好地識別出垃圾郵件、電子商務(wù)平臺希望檢測出其中的異常交易等;ヂ(lián)網(wǎng)應用數(shù)據(jù)的持續(xù)到達特性以及類別分布的不平衡性為準確進行數(shù)據(jù)分類帶來了諸多困難與挑戰(zhàn)。因而對面向互聯(lián)網(wǎng)應用的不平衡數(shù)據(jù)分類技術(shù)進行研究具有很強的現(xiàn)實意義和社會價值。本文從互聯(lián)網(wǎng)應用數(shù)據(jù)的特性以及承擔項目的實際需求出發(fā),遵循由簡單到復雜的思路,對不同類型的互聯(lián)網(wǎng)應用數(shù)據(jù)設(shè)計了相應的處理算法。首先從常見的兩類別不平衡數(shù)據(jù)出發(fā),針對其特點及實際應用需求,研究了不平衡數(shù)據(jù)預處理中的噪聲過濾策略和數(shù)據(jù)重采樣方法。之后,將其擴展到多類別(類別數(shù)目多于兩個,但每個樣例只能屬于一個類別)不平衡數(shù)據(jù)應用場景,提出了分解策略與數(shù)據(jù)重采樣相結(jié)合的處理方法。之后,進一步將前述研究成果拓展應用到多標簽(不同于多類別,此時同一樣例可以屬于多個類別)不平衡數(shù)據(jù)分類中,設(shè)計了新的集成學習框架和基礎(chǔ)分類算法。最后,根據(jù)互聯(lián)網(wǎng)應用數(shù)據(jù)持續(xù)到達的特點,研究了在不平衡數(shù)據(jù)流上的多窗口學習策略:(1)在兩類別不平衡數(shù)據(jù)的預處理方面,首先針對不平衡數(shù)據(jù)集中可能存在的噪聲,提出了基于IPF的改進噪聲過濾方法,以盡可能減少噪聲過濾時將少數(shù)類樣例誤判為噪聲的可能性。之后,針對少數(shù)類樣例和多數(shù)類樣例各自的特點,分別設(shè)計了基于近鄰分布的少數(shù)類過采樣算法以及基于距離排序的多數(shù)類欠采樣算法。在此基礎(chǔ)上,針對實際應用需求,設(shè)計了少數(shù)類和多數(shù)類之間采樣比例的自適應方法,從而減小了數(shù)據(jù)重采樣對后續(xù)處理流程的影響。最后,通過在大量真實數(shù)據(jù)集上的測試驗證了所提方法的有效性,尤其是對于少數(shù)類別分類效果的提升明顯;(2)在多類別不平衡數(shù)據(jù)分類方面,針對互聯(lián)網(wǎng)應用數(shù)據(jù)的多類別特性,提出分而治之的學習策略。首先使用一對多的OVA方法對訓練數(shù)據(jù)進行分解并訓練得到多個子分類器。此時,所有的子分類器都是基于全部類別數(shù)據(jù)訓練得到,確保了子分類器的適應性。之后,使用一對一的OVO方法對候選類別對應的樣例集進一步劃分,此階段根據(jù)劃分子集的類別分布決定是否進行數(shù)據(jù)重采樣。最后,在采樣后的數(shù)據(jù)子集上訓練得到更加細粒度的子分類器。此外,根據(jù)實際應用需求,分別設(shè)計了子分類器輸出值為離散和連續(xù)情形下的不同處理策略。在理論分析的基礎(chǔ)上,對所提方法在多個真實數(shù)據(jù)集上進行了測試,結(jié)果表明所提方法能夠有效處理多類別數(shù)據(jù)中存在的不平衡問題;(3)在多標簽不平衡數(shù)據(jù)分類方面,針對已有方法偏重多標簽分解而缺乏對標簽分布不平衡性考慮的問題,提出了一種多標簽不平衡數(shù)據(jù)集成學習框架并設(shè)計了相應的基礎(chǔ)分類算法。以AdaBoost方法為基礎(chǔ),該框架將標簽分布的不平衡特性集成到了各個子分類器的學習訓練過程中。此外,以多標簽神經(jīng)網(wǎng)絡(luò)方法BPMLL為基礎(chǔ),設(shè)計了針對多標簽不平衡數(shù)據(jù)的改進算法并將其作為集成學習框架的基礎(chǔ)分類算法,在多個實際應用數(shù)據(jù)集上對分類效果進行了測試,表明了所提方法的有效性;(4)在不平衡數(shù)據(jù)流分類方面,針對互聯(lián)網(wǎng)應用數(shù)據(jù)流的動態(tài)特性以及各個類別樣例到達順序的不確定性,提出了一種基于多窗口機制的集成學習方法。該方法根據(jù)不平衡數(shù)據(jù)流的特點,定義了四個不同的窗口分別用于保存當前滑動窗口數(shù)據(jù)、最近的少數(shù)類樣例、經(jīng)篩選的子分類器以及子分類器對應的歷史窗口數(shù)據(jù)。分別為不同的窗口設(shè)計了不同的更新策略。對于新的測試樣例,其類別標簽通過多數(shù)加權(quán)投票確定。通過在多個人工合成數(shù)據(jù)集和真實數(shù)據(jù)集上的測試表明,該方法效果更好,效率更高。綜上所述,本文針對互聯(lián)網(wǎng)應用中不同類型數(shù)據(jù)的不同分類需求,尤其針對其中存在的類別分布不平衡問題,提出了有效的解決方案,并通過在不同領(lǐng)域真實數(shù)據(jù)集以及人工合成數(shù)據(jù)集上的實驗驗證了本文所提方法的有效性。本文的研究工作對于推進各類互聯(lián)網(wǎng)應用數(shù)據(jù)的分類處理具有一定的理論意義和應用價值。
[Abstract]:The rapid development of the Internet, especially the development of all kinds of Internet applications, such as Internet news, e-mail, electronic commerce, has provided convenience for people to obtain information, but also drowns people in the ocean of information at the same time. Automatic classification of massive Internet application data can effectively improve the efficiency of people's access to information. To improve the efficiency of decision making, however, the number of samples corresponding to one or more categories in a lot of Internet application data is obviously less than the number of examples corresponding to other categories, forming so-called unbalanced data, such as reactionary news and normal news, spam and normal mail, abnormal transactions and normal transactions. The classification method and evaluation strategy designed by the cloth hypothesis usually take the overall accuracy as the optimization goal and easily ignore a few of them. In practical applications, people often pay more attention to the minority categories, such as the network supervision department is more willing to identify the reactionary news, the mail service providers want to better identify the spam, electronic business. The continuous arrival characteristics of the Internet application data and the imbalance of the category distribution have brought many difficulties and challenges to the accurate classification of data. Therefore, it is of great practical significance and social value to study the unbalanced data classification technology for Internet applications. Based on the characteristics of the Internet application data and the actual needs of the project, this paper designs the corresponding processing algorithms for different types of Internet application data from simple to complex ideas. Firstly, starting from the common two categories of unbalanced data, this paper studies the unbalance data preposition in view of its characteristics and practical application requirements. After the noise filtering strategy and data resampling method, it extends to multiple categories (more than two categories, but each sample can only belong to one category) unbalanced data application scenario, and proposes a combination of decomposition strategy and data resampling. After that, the previous research results are further extended to multi standard. The new integrated learning framework and basic classification algorithm are designed in the classification of unbalanced data, which are different from multiple categories. At the same time, according to the characteristics of the continuous arrival of the Internet application data, the multi window learning strategy on the unbalanced data flow is studied. (1) the pre location of the two categories of unbalanced data. In view of the possible noise in the unbalanced data set, an improved noise filtering method based on IPF is proposed in order to minimize the possibility of misjudging a few samples in noise filtering as possible. Then, a few classes based on near neighbour distribution are designed for a few samples and the characteristics of most class samples. Over sampling algorithm and the majority class under sampling algorithm based on distance sorting. Based on this, an adaptive method of sampling proportion between the minority and the majority class is designed to reduce the impact of the data resampling on the subsequent process. Finally, the tests on a large number of real data sets have been tested and verified. The effectiveness of the proposed method is especially significant for the improvement of the effect of a few categories of classification; (2) a divide and conquer learning strategy is proposed for the multi class unbalance data classification for the multi category characteristics of the Internet application data. First, a one to many OVA method is used to decompose the training data and train a number of sub classifiers. All of the sub classifiers are trained based on all category data training to ensure the adaptability of the Subclassifier. Then, the one to one OVO method is used to further divide the sample set corresponding to the candidate categories. This stage determines whether data resampling is determined by the classification of the subsets. Finally, the data subset after the sample is sampled. A more finer Subclassifier is trained. In addition, according to the actual application requirements, the different processing strategies of the discrete and continuous sub classifier are designed respectively. On the basis of the theoretical analysis, the proposed method is tested on multiple real data sets. The results show that the proposed method can effectively deal with multiple categories. The unbalance problem exists in the data; (3) in the classification of multi label unbalance data, aiming at the problem that the existing methods weigh the multi label decomposition and lack the imbalance of the label distribution, a multi label unbalanced data integration learning framework is proposed and the corresponding basic classification algorithm is designed. Based on the AdaBoost method, the frame is designed. In addition, based on the multi label neural network (BPMLL), an improved algorithm for multi label unbalanced data is designed and used as the basic classification algorithm for the integrated learning framework, and the classification efficiency is on the multiple practical application data sets. The test results show the effectiveness of the proposed method; (4) an integrated learning method based on the multi window mechanism is proposed for the dynamic characteristics of the data flow in the Internet application and the uncertainty in the arrival order of each class sample in the unbalanced data flow classification. The method is defined according to the characteristics of the unbalanced data flow. Four different windows are used to save the current sliding window data, the nearest few samples, the selected sub classifiers and the historical window data corresponding to the sub classifier. The different updating strategies are designed for different windows. For the new test examples, the class labels are determined by the majority of the weighted votes. The tests on personal synthetic data sets and real data sets show that the method has better effect and higher efficiency. In summary, this paper presents an effective solution to the different classification requirements of different types of data in Internet applications, especially for the problem of the disequilibrium of category distribution, and through the real number of different fields. The experiments on the dataset and the synthetic data set verify the effectiveness of the proposed method. The research work of this paper has a certain theoretical significance and application value for promoting the classification and processing of various kinds of Internet application data.
【學位授予單位】:國防科學技術(shù)大學
【學位級別】:博士
【學位授予年份】:2016
【分類號】:TP393.09
,

本文編號:2079537

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/jingjilunwen/dianzishangwulunwen/2079537.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶5f9a0***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
久久国产亚洲精品赲碰热| 日韩精品中文字幕亚洲| 亚洲中文字幕高清乱码毛片| 日本国产欧美精品视频| 91精品日本在线视频| 好骚国产99在线中文| 91香蕉国产观看免费人人| 国产又色又爽又黄又免费| 婷婷亚洲综合五月天麻豆| 三级高清有码在线观看| 黄色激情视频中文字幕| 日本午夜福利视频免费观看| 国产一区麻豆水好多高潮| 免费在线播放一区二区| 五月综合婷婷在线伊人| 亚洲一区二区三区av高清| 国产成人午夜av一区二区| 国产免费自拍黄片免费看| 女生更色还是男生更色| 日本在线不卡高清欧美 | 欧美国产日本高清在线| 91亚洲熟女少妇在线观看| 伊人网免费在线观看高清版| 亚洲一级二级三级精品| 国产欧美日韩精品一区二区| 蜜桃传媒视频麻豆第一区| 色涩一区二区三区四区| 日韩精品福利在线观看| 日韩国产亚洲欧美激情| 欧美日韩人妻中文一区二区| 色一情一乱一区二区三区码| 国产三级欧美三级日韩三级| 国产女性精品一区二区三区| 风间中文字幕亚洲一区| 久久午夜福利精品日韩| 国产精品视频一区麻豆专区| 欧美成人黄色一级视频| 日韩视频在线观看成人| 草草视频福利在线观看| 很黄很污在线免费观看| 一区二区三区国产日韩|