天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 自動(dòng)化論文 >

基于Spark分布式平臺(tái)的隨機(jī)森林分類(lèi)算法研究

發(fā)布時(shí)間:2018-03-25 23:15

  本文選題:高維大數(shù)據(jù) 切入點(diǎn):分類(lèi) 出處:《中國(guó)民航大學(xué)》2017年碩士論文


【摘要】:信息技術(shù)及網(wǎng)絡(luò)的高速發(fā)展,帶來(lái)了大量高維復(fù)雜數(shù)據(jù),如何有效地對(duì)這些數(shù)據(jù)進(jìn)行分類(lèi)以挖掘出有價(jià)值的信息是具有重大意義的課題。隨機(jī)森林是一種重要的分類(lèi)算法,對(duì)噪聲和異常值有較好的容忍性,能夠適用于并行化。原始隨機(jī)森林分類(lèi)算法及其改進(jìn)算法多是運(yùn)行在單機(jī)上,當(dāng)它們面對(duì)大量高維復(fù)雜數(shù)據(jù)時(shí),時(shí)間效率和空間資源都已無(wú)法滿(mǎn)足實(shí)際需求。Spark是一種高效的分布式計(jì)算框架,能夠提供性能與速率兼并的并行運(yùn)算,是解決這一問(wèn)題的有效方法。高維數(shù)據(jù)的很多特征信息量少、與類(lèi)別的相關(guān)性弱,影響了隨機(jī)森林的分類(lèi)正確率。因此,論文在Spark平臺(tái)上改進(jìn)隨機(jī)森林算法以提高大數(shù)據(jù)時(shí)代分類(lèi)高維數(shù)據(jù)的有效性。首先,隨機(jī)森林算法在集成決策樹(shù)和進(jìn)行分類(lèi)決策時(shí),無(wú)法區(qū)別對(duì)待每一棵決策樹(shù),導(dǎo)致分類(lèi)能力弱的決策樹(shù)會(huì)影響算法整體的分類(lèi)性能。針對(duì)此問(wèn)題,提出一種權(quán)重樹(shù)隨機(jī)森林算法,并在Spark平臺(tái)上實(shí)現(xiàn)該算法。算法采用權(quán)重樹(shù)集成策略,能夠加強(qiáng)分類(lèi)能力強(qiáng)的樹(shù)對(duì)于分類(lèi)決策的影響,同時(shí)削弱分類(lèi)能力弱的樹(shù)對(duì)分類(lèi)決策的影響,提高隨機(jī)森林整體的分類(lèi)能力。實(shí)驗(yàn)結(jié)果表明,相比原始隨機(jī)森林算法,所提算法分類(lèi)正確率更高,可擴(kuò)展性良好,能夠有效分類(lèi)高維大數(shù)據(jù)。其次,隨機(jī)森林算法在結(jié)點(diǎn)處生成特征子空間時(shí),所采用的簡(jiǎn)單隨機(jī)抽樣會(huì)導(dǎo)致生成的特征子空間中往往含有很多分類(lèi)能力弱的特征,影響了隨機(jī)森林算法的分類(lèi)性能。針對(duì)此問(wèn)題,通過(guò)改進(jìn)分層子空間的實(shí)施方式,提出了一種分層子空間隨機(jī)森林算法,并在Spark平臺(tái)上實(shí)現(xiàn)該算法。改進(jìn)的實(shí)施方式既保證了特征分層結(jié)果的正確性,又降低了計(jì)算成本,適合高維大數(shù)據(jù)。實(shí)驗(yàn)結(jié)果驗(yàn)證了所提算法能夠有效分類(lèi)高維大數(shù)據(jù)。相比原始隨機(jī)森林算法,所提算法具有更高的分類(lèi)正確率和更好的泛化能力,可擴(kuò)展性良好。最后,將權(quán)重樹(shù)隨機(jī)森林算法和分層子空間隨機(jī)森林算法應(yīng)用于航班延誤的預(yù)測(cè)中,在對(duì)數(shù)據(jù)集特征的詳細(xì)信息進(jìn)行分析的基礎(chǔ)上,通過(guò)最小-最大規(guī)范化和延誤等級(jí)劃分對(duì)數(shù)據(jù)進(jìn)行預(yù)處理,實(shí)驗(yàn)驗(yàn)證了權(quán)重樹(shù)隨機(jī)森林算法和分層子空間隨機(jī)森林算法能夠有效分類(lèi)和預(yù)測(cè)航班延誤的延誤等級(jí)。
[Abstract]:The rapid development of information technology and network has brought a large number of high-dimensional complex data. How to effectively classify these data to mine valuable information is of great significance. Random forest is an important classification algorithm. It has good tolerance for noise and outliers, and can be applied to parallelization. The original stochastic forest classification algorithms and their improved algorithms are mostly run on a single computer, when they face a large number of high dimensional complex data, Both time efficiency and space resources can no longer meet the actual demand. Park is an efficient distributed computing framework that provides parallel computation of performance and rate annexation. It is an effective method to solve this problem. Many features of high-dimensional data have little information and weak correlation with category, which affects the classification accuracy of random forest. In order to improve the effectiveness of classifying high-dimensional data in big data's time, this paper improves the stochastic forest algorithm on Spark platform. Firstly, the stochastic forest algorithm can not treat each decision tree differently when it integrates decision trees and makes classification decisions. The decision tree with weak classification ability will affect the whole classification performance of the algorithm. In order to solve this problem, a weighted tree stochastic forest algorithm is proposed and implemented on Spark platform. The effect of trees with strong classification ability on classification decision is strengthened, and the influence of trees with weak classification ability on classification decision is weakened. The experimental results show that compared with the original stochastic forest algorithm, the classification ability of the whole stochastic forest is improved. The proposed algorithm is more accurate and extensible, and can effectively classify high dimensional big data. Secondly, when the stochastic forest algorithm generates feature subspace at the node, The simple random sampling will lead to many features with weak classification ability in the generated feature subspace, which affects the classification performance of the stochastic forest algorithm. In order to solve this problem, the implementation of the hierarchical subspace is improved. A hierarchical subspace random forest algorithm is proposed and implemented on the Spark platform. The improved implementation not only ensures the correctness of the feature stratification results, but also reduces the computational cost. The experimental results show that the proposed algorithm can effectively classify the high-dimensional big data. Compared with the original stochastic forest algorithm, the proposed algorithm has higher classification accuracy and better generalization ability. Finally, The weighted tree stochastic forest algorithm and hierarchical subspace stochastic forest algorithm are applied to the prediction of flight delay. On the basis of analyzing the detailed information of the feature of the data set, the weight tree random forest algorithm and the hierarchical subspace random forest algorithm are applied to the prediction of flight delay. The data are preprocessed by minimum-maximum normalization and delay classification. The experimental results show that the weighted tree stochastic forest algorithm and the hierarchical subspace stochastic forest algorithm can effectively classify and predict the delay level of flight delays.
【學(xué)位授予單位】:中國(guó)民航大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP181

【參考文獻(xiàn)】

相關(guān)期刊論文 前4條

1 丁君美;劉貴全;李慧;;改進(jìn)隨機(jī)森林算法在電信業(yè)客戶(hù)流失預(yù)測(cè)中的應(yīng)用[J];模式識(shí)別與人工智能;2015年11期

2 姚明煌;駱炎民;;改進(jìn)的隨機(jī)森林及其在遙感圖像中的應(yīng)用[J];計(jì)算機(jī)工程與應(yīng)用;2016年04期

3 房曉南;張化祥;高爽;;基于SMOTE和隨機(jī)森林的Web spam檢測(cè)[J];山東大學(xué)學(xué)報(bào)(工學(xué)版);2013年01期

4 張華偉;王明文;甘麗新;;基于隨機(jī)森林的文本分類(lèi)模型研究[J];山東大學(xué)學(xué)報(bào)(理學(xué)版);2006年03期

相關(guān)博士學(xué)位論文 前1條

1 曹正鳳;隨機(jī)森林算法優(yōu)化研究[D];首都經(jīng)濟(jì)貿(mào)易大學(xué);2014年

相關(guān)碩士學(xué)位論文 前8條

1 羅元帥;基于隨機(jī)森林和Spark的并行文本分類(lèi)算法研究[D];西南交通大學(xué);2016年

2 王雪;面向高維不平衡數(shù)據(jù)的隨機(jī)森林算法及其并行化研究[D];遼寧大學(xué);2016年

3 蔣昆佑;基于Spark的海量數(shù)據(jù)計(jì)算平臺(tái)設(shè)計(jì)與實(shí)現(xiàn)[D];大連理工大學(xué);2016年

4 陳英芝;Spark Shuffle的內(nèi)存調(diào)度算法分析及優(yōu)化[D];浙江大學(xué);2016年

5 劉鵬;基于Spark的數(shù)據(jù)管理平臺(tái)的設(shè)計(jì)與實(shí)現(xiàn)[D];浙江大學(xué);2016年

6 唐振坤;基于Spark的機(jī)器學(xué)習(xí)平臺(tái)設(shè)計(jì)與實(shí)現(xiàn)[D];廈門(mén)大學(xué);2014年

7 馮琳;集群計(jì)算引擎Spark中的內(nèi)存優(yōu)化研究與實(shí)現(xiàn)[D];清華大學(xué);2013年

8 雍凱;隨機(jī)森林的特征選擇和模型優(yōu)化算法研究[D];哈爾濱工業(yè)大學(xué);2008年

,

本文編號(hào):1665285

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1665285.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶(hù)77253***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com