Imbalanced Big Data Classification Algorithm Based on Spark

發(fā)布時間：2023-04-28 19:09

　　隨著信息時代的發(fā)展,數(shù)據(jù)的產(chǎn)生速度不斷加快,為了先于他人獲得利益或者提前避免危機,人們開始著力于從現(xiàn)有的數(shù)據(jù)中挖掘出隱藏的信息加以利用,但是部分重要的信息并不包含在多數(shù)類的數(shù)據(jù)中,它們只存在于少數(shù)類,例如癌癥確診,信用詐騙等。因此在大數(shù)據(jù)集中識別微量級的數(shù)據(jù)類別成為現(xiàn)在研究的重點。對于不平衡數(shù)據(jù)集,傳統(tǒng)算法非常傾向于多數(shù)據(jù)分類。難以實現(xiàn)識別較少數(shù)據(jù)分類的精度。實際上,在現(xiàn)實生活中,有許多少數(shù)群體更有價值和更具代表性。國內(nèi)外對不均衡數(shù)據(jù)的分類做了相當多的研究,處理不均衡數(shù)據(jù)的分類方法主要分為兩種,一種是對數(shù)據(jù)本身進行處理,另一種是對分類算法進行改進。在數(shù)據(jù)層面的方法主要包括過抽樣策略和欠抽樣策略,這兩種方法分別對少數(shù)類進行擴充或者移除部分多數(shù)類數(shù)據(jù)以達到類別間數(shù)據(jù)量的平衡。在算法層面的改進方式主要包括改變概率密度、單類學(xué)習(xí)分類、集成算法以及核方法。數(shù)據(jù)不均衡主要表現(xiàn)在兩個方面,第一方面為類間不均衡,即某類樣本數(shù)量明顯少于其他類樣本數(shù)量但是類別間邊界較為清晰。另一方面為類內(nèi)不均衡,即一個類別中包含多個類別,同時還有重疊的部分,這個問題會導(dǎo)致分類器無法有效分辨出少數(shù)類樣本噪聲和少數(shù)類樣本子集...

【文章頁數(shù)】：64 頁

【學(xué)位級別】：碩士

【文章目錄】：
Acknowledgements
Abstract
1 Introduction
    1.1 Research Background
    1.2 Research status at home and abroad
        1.2.1 Approach of processing data sets
        1.2.2 Algorithm level approach
        1.2.3 Computing framework
    1.3 Thesis innovation
    1.4 Structure of thesis
2 Related Work
    2.1 Distributed Computing framework
        2.1.1 Introduction to Spark
        2.1.2 Resilient Distributed Datasets
    2.2 Traditional imbalanced big data processing method and principle
        2.2.1 The nature of data imbalance
        2.2.2 Method of equalizing data
    2.3 Algorithm introduction
        2.3.1 SMOTE algorithm
        2.3.2 SimHash algorithm
3 Specific algorithm design and implementation
    3.1 Data description
    3.2 Algorithm improvement and implementation process
        3.2.1 SimHash algorithm improvement-dimensionality reduction
        3.2.2 Improved SMOTE algorithm
        3.2.3 KNN algorithm improvement
        3.2.4 Model evaluation criteria
        3.2.5 Implementation of KNN Algorithm Based on Hash Technology andSpark
4 Experimental results and analysis
    4.1 Data Description
    4.2 Experimental environment
    4.3 Complete steps of the experimental design
    4.4 Algorithm efficiency
    4.5 Algorithm accuracy
5 Conclusion and Future Work
    5.1 Conclusion
    5.2 Future Work
References
Appendix A 摘要

本文編號：3804301

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/3804301.html

上一篇：基于DPDK的并行計算調(diào)度算法的研究
下一篇：基于用戶體驗的校園招聘APP設(shè)計研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Imbalanced Big Data Classification Algorithm Based on Spark