Imbalanced Big Data Classification Algorithm Based on Spark
發(fā)布時(shí)間:2023-04-28 19:09
隨著信息時(shí)代的發(fā)展,數(shù)據(jù)的產(chǎn)生速度不斷加快,為了先于他人獲得利益或者提前避免危機(jī),人們開始著力于從現(xiàn)有的數(shù)據(jù)中挖掘出隱藏的信息加以利用,但是部分重要的信息并不包含在多數(shù)類的數(shù)據(jù)中,它們只存在于少數(shù)類,例如癌癥確診,信用詐騙等。因此在大數(shù)據(jù)集中識(shí)別微量級(jí)的數(shù)據(jù)類別成為現(xiàn)在研究的重點(diǎn)。對(duì)于不平衡數(shù)據(jù)集,傳統(tǒng)算法非常傾向于多數(shù)據(jù)分類。難以實(shí)現(xiàn)識(shí)別較少數(shù)據(jù)分類的精度。實(shí)際上,在現(xiàn)實(shí)生活中,有許多少數(shù)群體更有價(jià)值和更具代表性。國(guó)內(nèi)外對(duì)不均衡數(shù)據(jù)的分類做了相當(dāng)多的研究,處理不均衡數(shù)據(jù)的分類方法主要分為兩種,一種是對(duì)數(shù)據(jù)本身進(jìn)行處理,另一種是對(duì)分類算法進(jìn)行改進(jìn)。在數(shù)據(jù)層面的方法主要包括過抽樣策略和欠抽樣策略,這兩種方法分別對(duì)少數(shù)類進(jìn)行擴(kuò)充或者移除部分多數(shù)類數(shù)據(jù)以達(dá)到類別間數(shù)據(jù)量的平衡。在算法層面的改進(jìn)方式主要包括改變概率密度、單類學(xué)習(xí)分類、集成算法以及核方法。數(shù)據(jù)不均衡主要表現(xiàn)在兩個(gè)方面,第一方面為類間不均衡,即某類樣本數(shù)量明顯少于其他類樣本數(shù)量但是類別間邊界較為清晰。另一方面為類內(nèi)不均衡,即一個(gè)類別中包含多個(gè)類別,同時(shí)還有重疊的部分,這個(gè)問題會(huì)導(dǎo)致分類器無(wú)法有效分辨出少數(shù)類樣本噪聲和少數(shù)類樣本子集...
【文章頁(yè)數(shù)】:64 頁(yè)
【學(xué)位級(jí)別】:碩士
【文章目錄】:
Acknowledgements
Abstract
1 Introduction
1.1 Research Background
1.2 Research status at home and abroad
1.2.1 Approach of processing data sets
1.2.2 Algorithm level approach
1.2.3 Computing framework
1.3 Thesis innovation
1.4 Structure of thesis
2 Related Work
2.1 Distributed Computing framework
2.1.1 Introduction to Spark
2.1.2 Resilient Distributed Datasets
2.2 Traditional imbalanced big data processing method and principle
2.2.1 The nature of data imbalance
2.2.2 Method of equalizing data
2.3 Algorithm introduction
2.3.1 SMOTE algorithm
2.3.2 SimHash algorithm
3 Specific algorithm design and implementation
3.1 Data description
3.2 Algorithm improvement and implementation process
3.2.1 SimHash algorithm improvement-dimensionality reduction
3.2.2 Improved SMOTE algorithm
3.2.3 KNN algorithm improvement
3.2.4 Model evaluation criteria
3.2.5 Implementation of KNN Algorithm Based on Hash Technology andSpark
4 Experimental results and analysis
4.1 Data Description
4.2 Experimental environment
4.3 Complete steps of the experimental design
4.4 Algorithm efficiency
4.5 Algorithm accuracy
5 Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
References
Appendix A 摘要
本文編號(hào):3804301
【文章頁(yè)數(shù)】:64 頁(yè)
【學(xué)位級(jí)別】:碩士
【文章目錄】:
Acknowledgements
Abstract
1 Introduction
1.1 Research Background
1.2 Research status at home and abroad
1.2.1 Approach of processing data sets
1.2.2 Algorithm level approach
1.2.3 Computing framework
1.3 Thesis innovation
1.4 Structure of thesis
2 Related Work
2.1 Distributed Computing framework
2.1.1 Introduction to Spark
2.1.2 Resilient Distributed Datasets
2.2 Traditional imbalanced big data processing method and principle
2.2.1 The nature of data imbalance
2.2.2 Method of equalizing data
2.3 Algorithm introduction
2.3.1 SMOTE algorithm
2.3.2 SimHash algorithm
3 Specific algorithm design and implementation
3.1 Data description
3.2 Algorithm improvement and implementation process
3.2.1 SimHash algorithm improvement-dimensionality reduction
3.2.2 Improved SMOTE algorithm
3.2.3 KNN algorithm improvement
3.2.4 Model evaluation criteria
3.2.5 Implementation of KNN Algorithm Based on Hash Technology andSpark
4 Experimental results and analysis
4.1 Data Description
4.2 Experimental environment
4.3 Complete steps of the experimental design
4.4 Algorithm efficiency
4.5 Algorithm accuracy
5 Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
References
Appendix A 摘要
本文編號(hào):3804301
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/3804301.html
最近更新
教材專著