天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 自動化論文 >

基于Spark的一種改進(jìn)的隨機森林算法研究

發(fā)布時間:2018-05-09 10:38

  本文選題:隨機森林 + 分類精度 ; 參考:《太原理工大學(xué)》2017年碩士論文


【摘要】:隨機森林算法是一種具有優(yōu)秀分類性能的機器學(xué)習(xí)算法,它具有擅長處理大規(guī)模數(shù)據(jù)集、可以處理多達(dá)幾千個屬性的數(shù)據(jù)集、需要調(diào)整的參數(shù)少、不會出現(xiàn)過擬合等特點。因此隨機森林算法在各個領(lǐng)域都得到了廣泛的應(yīng)用和發(fā)展,吸引了大量的學(xué)者對其進(jìn)行改進(jìn)和研究,并取得了豐碩的成果。但是傳統(tǒng)隨機森林算法在生成隨機森林模型的過程中,一是生成的決策樹模型在分類性能上參差不齊,二是決策樹模型之間會有相關(guān)性,那些分類性能差的決策樹以及相互之間相關(guān)性強的決策樹會對隨機森林模型的整體分類性能產(chǎn)生消極的影響。本文針對傳統(tǒng)隨機森林的這兩個特性,提出了一種基于分類精度和相似度的改進(jìn)的隨機森林算法。該算法選用分類性能評價指標(biāo)AUC值對隨機森林模型中的決策樹模型的分類性能進(jìn)行評判,選出其中分類性能在設(shè)定閾值之上的決策樹模型;然后對選出的分類性能好的決策樹模型進(jìn)行相似度計算,得到這些決策樹模型之間的相似度矩陣,因為相似度高的決策樹,他們之間的相關(guān)性就高,所以再根據(jù)相似度矩陣和相似度評判標(biāo)準(zhǔn)對這些決策樹模型進(jìn)行聚類;最后選出每一個聚類中AUC值最高的決策樹作為這一個聚類的代表,從而組成新的隨機森林模型。通過對心臟病、乳腺癌、Pima印第安人糖尿病和印度肝病等UCI數(shù)據(jù)集的測試結(jié)果表明,本文提出的基于分類精度和相關(guān)性的改進(jìn)的隨機森林算法比傳統(tǒng)的隨機森林算法在分類精度上有了一定的提升。本文先在MATLAB平臺上對改進(jìn)的隨機森林算法進(jìn)行了實現(xiàn),然后通過設(shè)計實驗在四個UCI數(shù)據(jù)集上對改進(jìn)的隨機森林算法和傳統(tǒng)的隨機森林算法在分類精度上進(jìn)行了比較,結(jié)果表明改進(jìn)的隨機森林算法在分類精度上有了一定的提升,但是由于相比傳統(tǒng)的隨機森林算法,改進(jìn)的隨機森林算法多了兩個優(yōu)化步驟,所以在分類速率上會有所下降,而且單機的MATLAB平臺對于較大型數(shù)據(jù)的處理和迭代速度會非常緩慢,因此最終又在Spark平臺上對改進(jìn)的隨機森林算法進(jìn)行了實現(xiàn),使得改進(jìn)的隨機森林算法的分類速率有了較大的提升。
[Abstract]:Stochastic forest algorithm is a machine learning algorithm with excellent classification performance. It is good at dealing with large scale data sets and can handle data sets with thousands of attributes. Therefore, stochastic forest algorithm has been widely used and developed in various fields, attracting a large number of scholars to improve and study it, and has achieved fruitful results. However, in the process of generating stochastic forest model, the traditional stochastic forest algorithm, one is that the decision tree model is different in classification performance, the other is the correlation between the decision tree model and the decision tree model. Those decision trees with poor classification performance and decision trees with strong correlation will have a negative impact on the overall classification performance of stochastic forest models. In this paper, an improved stochastic forest algorithm based on classification accuracy and similarity is proposed. In this algorithm, the classification performance of the decision tree model in the stochastic forest model is evaluated by AUC, and the decision tree model with the classification performance above the threshold is selected. Then, the similarity of the decision tree models with good classification performance is calculated, and the similarity matrix between these decision tree models is obtained. Because the decision trees with high similarity, the correlation between them is high. According to the similarity matrix and similarity evaluation criteria, these decision tree models are clustered. Finally, the decision tree with the highest AUC value in each cluster is selected as the representative of this cluster, and a new stochastic forest model is formed. Tests on UCI data sets such as heart disease, breast cancer, Pima Indian diabetes and Indian liver disease showed that, The improved stochastic forest algorithm based on classification accuracy and correlation is better than the traditional stochastic forest algorithm in classification accuracy. In this paper, the improved stochastic forest algorithm is implemented on the MATLAB platform, and the classification accuracy of the improved stochastic forest algorithm is compared with that of the traditional stochastic forest algorithm on four UCI datasets. The results show that the improved stochastic forest algorithm has a certain improvement in classification accuracy, but compared with the traditional stochastic forest algorithm, the improved stochastic forest algorithm has two more optimization steps, so the classification rate will be reduced. Moreover, the processing and iterative speed of the larger data on the single MATLAB platform will be very slow, so the improved stochastic forest algorithm is implemented on the Spark platform. The classification rate of the improved stochastic forest algorithm is greatly improved.
【學(xué)位授予單位】:太原理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP181

【參考文獻(xiàn)】

相關(guān)期刊論文 前5條

1 馬春來;單洪;馬濤;史英春;;隨機森林改進(jìn)算法在LBS用戶社會關(guān)系推斷中的應(yīng)用[J];小型微型計算機系統(tǒng);2016年12期

2 陳松景;楊林;吳思竹;李姣;;基于C4.5分類的呼吸系統(tǒng)疾病危險因素定量分析方法[J];中華醫(yī)學(xué)圖書情報雜志;2016年08期

3 張宇航;;微博社交網(wǎng)絡(luò)數(shù)據(jù)挖掘與用戶權(quán)重分析[J];中國高新技術(shù)企業(yè);2016年05期

4 李定啟;程遠(yuǎn)平;王海峰;王亮;周紅星;孫建華;;基于決策樹ID3改進(jìn)算法的煤與瓦斯突出預(yù)測[J];煤炭學(xué)報;2011年04期

5 鄭煒;沈文;張英鵬;;基于改進(jìn)樸素貝葉斯算法的垃圾郵件過濾器的研究[J];西北工業(yè)大學(xué)學(xué)報;2010年04期

相關(guān)博士學(xué)位論文 前1條

1 隋學(xué)深;基于時間序列數(shù)據(jù)挖掘的股票市場價格行為研究[D];哈爾濱工業(yè)大學(xué);2008年

相關(guān)碩士學(xué)位論文 前6條

1 車晉強;基于Spark平臺的高血壓藥物推薦及療效預(yù)測研究[D];太原理工大學(xué);2016年

2 陳秀芬;基于文獻(xiàn)挖掘的中藥治療糖尿病用藥篩選及作用機制研究[D];北京中醫(yī)藥大學(xué);2016年

3 萬飛;基于網(wǎng)格搜索的支持向量機在入侵檢測中的應(yīng)用[D];合肥工業(yè)大學(xué);2015年

4 陳金佑;數(shù)據(jù)挖掘在股票分析中的應(yīng)用研究[D];華南理工大學(xué);2014年

5 李貞貴;隨機森林改進(jìn)的若干研究[D];廈門大學(xué);2013年

6 盧明泰;WEB數(shù)據(jù)挖掘及其在社交網(wǎng)絡(luò)的應(yīng)用研究[D];電子科技大學(xué);2012年

,

本文編號:1865699

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1865699.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶a6632***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com