隨機(jī)森林算法處理不平衡數(shù)據(jù)的改進(jìn)及其并行化

發(fā)布時(shí)間：2018-10-25 19:30

【摘要】：隨機(jī)森林(Random Forest)是用隨機(jī)的方式建立一個(gè)森林,森林里面有很多的決策樹組成,隨機(jī)森林的每一棵決策樹之間是沒有關(guān)聯(lián)的.每一棵決策樹的建立,采用的是隨機(jī)有放回采樣的過程,然后使用投票的形式進(jìn)行分類和預(yù)測(cè).該算法很好的解決了單分類器在性能上的瓶頸,因此被廣泛應(yīng)用在很多方面.當(dāng)然,該算法也存在一些有待完善的地方,針對(duì)隨機(jī)森林算法在處理不平衡數(shù)據(jù)集時(shí)運(yùn)行效率低下的問題,本文提出一種新的處理不平衡問題的方法,同時(shí)隨著計(jì)算量呈現(xiàn)指數(shù)值的增長(zhǎng),如何提高預(yù)測(cè)速度和縮短運(yùn)行時(shí)間,本文根據(jù)隨機(jī)森林算法在構(gòu)建過程中的特點(diǎn)提出了并行化的思想.本文在詳細(xì)參考國(guó)內(nèi)外文獻(xiàn)的基礎(chǔ)上,主要從兩個(gè)方面對(duì)隨機(jī)森林進(jìn)行優(yōu)化.一、對(duì)數(shù)據(jù)預(yù)處理的研究,提出一種新的數(shù)據(jù)預(yù)處理方法.針對(duì)隨機(jī)森林算法在處理不平衡數(shù)據(jù)集方面的缺點(diǎn)和SMOTE算法在選取樣本時(shí)存在一定的盲目性和容易邊緣化的問題,本文結(jié)合K-means算法,在SMOTE算法的基礎(chǔ)上,提出一種K_SMOTE算法K_SMOTE的主要思想是首先利用K-means方法找出原始負(fù)類的中心點(diǎn),再根據(jù)SMOTE得出“新增負(fù)類”,將原始數(shù)據(jù)集中的負(fù)類全部替換為“新增負(fù)類”,再次利用SMOTE得出“新數(shù)據(jù)集”.實(shí)驗(yàn)結(jié)果表明該方法在隨機(jī)森林算法上分類性能得到提升.二、基于Mapreduce框架的隨機(jī)森林算法并行化研究.隨著現(xiàn)代社會(huì)數(shù)據(jù)量呈指數(shù)增長(zhǎng),運(yùn)用隨機(jī)森林算法進(jìn)行分類,不但需要花費(fèi)大量的時(shí)間,而且分類性能也低下.在此背景下,本文根據(jù)隨機(jī)森林構(gòu)建單棵決策樹互相獨(dú)立的特點(diǎn),同時(shí)結(jié)合Hadoop平臺(tái)的分布式框架Mapreduce思想,提出將隨機(jī)森林算法基于Mapreduce框架并行研究Mapreduce框架的主要思想是分而治之,將復(fù)雜的問題分解成若干個(gè)相同的子問題,相應(yīng)的解決子問題就容易很多.具體到隨機(jī)森林算法中,分而治之主要體現(xiàn)在,構(gòu)建單棵決策樹的過程的并行化處理,然后將組合構(gòu)建好的多棵決策樹進(jìn)行投票.實(shí)驗(yàn)結(jié)果表明并行化的隨機(jī)森林在時(shí)間和效率上都得到改善.
[Abstract]:Random forest (Random Forest) is to build a forest in a random way. There are many decision trees in the forest, and there is no correlation between each decision tree of the random forest. Each decision tree is constructed by random sampling, and then the voting is used to classify and predict the decision tree. The algorithm solves the performance bottleneck of single classifier, so it is widely used in many aspects. Of course, there are still some problems to be improved in this algorithm. In view of the low efficiency of the stochastic forest algorithm in dealing with unbalanced data sets, this paper proposes a new method to deal with the unbalanced problem. At the same time, with the increase of the number of computations, how to improve the prediction speed and shorten the running time, according to the characteristics of the stochastic forest algorithm in the construction process, this paper proposes the idea of parallelization. Based on the detailed reference of domestic and foreign literatures, this paper mainly optimizes the random forest from two aspects. Firstly, a new method of data preprocessing is proposed. In view of the shortcomings of stochastic forest algorithm in dealing with unbalanced data sets and the problems of blindness and marginalization of SMOTE algorithm in selecting samples, this paper combines K-means algorithm with SMOTE algorithm. The main idea of K_SMOTE, a K_SMOTE algorithm, is to find out the center of the original negative class by using the K-means method, and then to get the "new negative class" according to SMOTE, and to replace all the negative classes in the original data set with the "new negative class". Use SMOTE again to get the "new data set". The experimental results show that the classification performance of the proposed method is improved on the stochastic forest algorithm. Second, the parallel research of stochastic forest algorithm based on Mapreduce framework. With the exponential growth of data volume in modern society, it takes a lot of time to classify by using stochastic forest algorithm, and the classification performance is also low. In this context, according to the independent characteristics of constructing a single decision tree in random forest, and combining with the Mapreduce idea of distributed framework of Hadoop platform, It is proposed that the main idea of parallel research on Mapreduce framework based on Mapreduce framework is to divide and conquer the complex problems into several identical sub-problems, and it is much easier to solve the corresponding sub-problems. In the stochastic forest algorithm, divide-and-conquer is mainly reflected in the parallel process of constructing a single decision tree, and then the combined construction of multiple decision trees is voted. The experimental results show that the time and efficiency of the parallel stochastic forest are improved.
【學(xué)位授予單位】：廣東工業(yè)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP311.13

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 劉足華;熊惠霖;;基于隨機(jī)森林的目標(biāo)檢測(cè)與定位[J];計(jì)算機(jī)工程;2012年13期

2 董師師;黃哲學(xué);;隨機(jī)森林理論淺析[J];集成技術(shù);2013年01期

3 王象剛;;基于K均值隨機(jī)森林快速算法及入侵檢測(cè)中的應(yīng)用[J];科技通報(bào);2013年08期

4 陳姝;彭小寧;;基于粒子濾波和在線隨機(jī)森林分類的目標(biāo)跟蹤[J];江蘇大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年02期

5 羅知林;陳挺;蔡皖東;;一個(gè)基于隨機(jī)森林的微博轉(zhuǎn)發(fā)預(yù)測(cè)算法[J];計(jì)算機(jī)科學(xué);2014年04期

6 王麗婷;丁曉青;方馳;;基于隨機(jī)森林的人臉關(guān)鍵點(diǎn)精確定位方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年04期

7 李建更;高志坤;;隨機(jī)森林針對(duì)小樣本數(shù)據(jù)類權(quán)重設(shè)置[J];計(jì)算機(jī)工程與應(yīng)用;2009年26期

8 張建;武東英;劉慧生;;基于隨機(jī)森林的流量分類方法[J];信息工程大學(xué)學(xué)報(bào);2012年05期

9 吳華芹;;基于訓(xùn)練集劃分的隨機(jī)森林算法[J];科技通報(bào);2013年10期

10 張華偉;王明文;甘麗新;;基于隨機(jī)森林的文本分類模型研究[J];山東大學(xué)學(xué)報(bào)(理學(xué)版);2006年03期

相關(guān)會(huì)議論文前7條

1 謝程利;王金橋;盧漢清;;核森林及其在目標(biāo)檢測(cè)中的應(yīng)用[A];第六屆和諧人機(jī)環(huán)境聯(lián)合學(xué)術(shù)會(huì)議（HHME2010)、第19屆全國(guó)多媒體學(xué)術(shù)會(huì)議（NCMT2010）、第6屆全國(guó)人機(jī)交互學(xué)術(shù)會(huì)議（CHCI2010）、第5屆全國(guó)普適計(jì)算學(xué)術(shù)會(huì)議（PCC2010）論文集[C];2010年

2 武曉巖;方慶偉;;基因表達(dá)數(shù)據(jù)分析的隨機(jī)森林方法及算法改進(jìn)[A];黑龍江省第十次統(tǒng)計(jì)科學(xué)討論會(huì)論文集[C];2008年

3 張?zhí)忑?梁龍;王康;李華;;隨機(jī)森林結(jié)合激光誘導(dǎo)擊穿光譜技術(shù)用于的鋼鐵分類[A];中國(guó)化學(xué)會(huì)第29屆學(xué)術(shù)年會(huì)摘要集——第19分會(huì)：化學(xué)信息學(xué)與化學(xué)計(jì)量學(xué)[C];2014年

4 相玉紅;張卓勇;;組蛋白去乙�；敢种苿┑臉�(gòu)效關(guān)系研究[A];第十一屆全國(guó)計(jì)算（機(jī)）化學(xué)學(xué)術(shù)會(huì)議論文摘要集[C];2011年

5 張濤;李貞子;武曉巖;李康;;隨機(jī)森林回歸分析方法及在代謝組學(xué)中的應(yīng)用[A];2011年中國(guó)衛(wèi)生統(tǒng)計(jì)學(xué)年會(huì)會(huì)議論文集[C];2011年

6 馮飛翔;馮輔周;江鵬程;劉菁;劉建敏;;隨機(jī)森林和k-近鄰法在某型坦克變速箱狀態(tài)識(shí)別中的應(yīng)用[A];第八屆全國(guó)轉(zhuǎn)子動(dòng)力學(xué)學(xué)術(shù)討論會(huì)論文集[C];2008年

7 曹東升;許青松;梁逸曾;陳憲;李洪東;;組合樹的集合體和后向消除策略去分類P-糖蛋白化合物[A];第十屆全國(guó)計(jì)算(機(jī))化學(xué)學(xué)術(shù)會(huì)議論文摘要集[C];2009年

相關(guān)博士學(xué)位論文前4條

1 曹正鳳;隨機(jī)森林算法優(yōu)化研究[D];首都經(jīng)濟(jì)貿(mào)易大學(xué);2014年

2 雷震;隨機(jī)森林及其在遙感影像處理中應(yīng)用研究[D];上海交通大學(xué);2012年

3 岳明;基于隨機(jī)森林和規(guī)則集成法的酒類市場(chǎng)預(yù)測(cè)與發(fā)展戰(zhàn)略[D];天津大學(xué);2008年

4 李書艷;單點(diǎn)氨基酸多態(tài)性與疾病相關(guān)關(guān)系的預(yù)測(cè)及其機(jī)制研究[D];蘭州大學(xué);2010年

相關(guān)碩士學(xué)位論文前10條

1 錢維;藥品不良反應(yīng)監(jiān)測(cè)中隨機(jī)森林方法的建立與實(shí)現(xiàn)[D];第二軍醫(yī)大學(xué);2012年

2 韓燕龍;基于隨機(jī)森林的指數(shù)化投資組合構(gòu)建研究[D];華南理工大學(xué);2015年

3 賀捷;隨機(jī)森林在文本分類中的應(yīng)用[D];華南理工大學(xué);2015年

4 張文婷;交通環(huán)境下基于改進(jìn)霍夫森林的目標(biāo)檢測(cè)與跟蹤[D];華南理工大學(xué);2015年

5 李強(qiáng);基于多視角特征融合與隨機(jī)森林的蛋白質(zhì)結(jié)晶預(yù)測(cè)[D];南京理工大學(xué);2015年

6 朱玟謙;一種收斂性隨機(jī)森林在人臉檢測(cè)中的應(yīng)用研究[D];武漢理工大學(xué);2015年

7 肖宇;基于序列圖像的手勢(shì)檢測(cè)與識(shí)別算法研究[D];電子科技大學(xué);2014年

8 李慧;一種改進(jìn)的隨機(jī)森林并行分類方法在運(yùn)營(yíng)商大數(shù)據(jù)的應(yīng)用[D];電子科技大學(xué);2015年

9 趙亞紅;面向多類標(biāo)分類的隨機(jī)森林算法研究[D];哈爾濱工業(yè)大學(xué);2014年

10 黎成;基于隨機(jī)森林和ReliefF的致病SNP識(shí)別方法[D];西安電子科技大學(xué);2014年

，

本文編號(hào)：2294610

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2294610.html

上一篇：基于J2EE技術(shù)的藝術(shù)學(xué)院招生信息管理系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
下一篇：顱骨點(diǎn)云模型的優(yōu)化配準(zhǔn)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

隨機(jī)森林算法處理不平衡數(shù)據(jù)的改進(jìn)及其并行化