雙權(quán)重隨機(jī)森林預(yù)測(cè)算法及其并行化研究
發(fā)布時(shí)間:2018-12-21 20:03
【摘要】:隨著科技的發(fā)展,大數(shù)據(jù)時(shí)代已經(jīng)來臨,在大數(shù)據(jù)時(shí)代,數(shù)據(jù)呈現(xiàn)爆炸式的增長(zhǎng)。大數(shù)據(jù)給傳統(tǒng)的機(jī)器學(xué)習(xí)方法帶來很大的挑戰(zhàn),隨機(jī)森林算法由于其良好的表現(xiàn)受到各界廣泛關(guān)注。由于大數(shù)據(jù)的海量、復(fù)雜多樣、變化快的特性,大數(shù)據(jù)帶來兩個(gè)問題:一個(gè)是機(jī)器學(xué)習(xí)算法運(yùn)行時(shí)間長(zhǎng),不能在可接受的時(shí)間內(nèi)提供結(jié)果。二是:數(shù)據(jù)維度高,冗余大,傳統(tǒng)的隨機(jī)森林回歸算法沒法得到理想的效果。為了解決這些問題,本課題對(duì)傳統(tǒng)隨機(jī)森林回歸的改進(jìn)及其并行化展開了研究。針對(duì)數(shù)據(jù)維度高,冗余大,傳統(tǒng)的隨機(jī)森林回歸算法沒法取得理想的效果這一問題,有文獻(xiàn)提出改進(jìn)傳統(tǒng)隨機(jī)森林算法中隨機(jī)抽取特征為帶權(quán)重的特征抽取。但是我們通過分析發(fā)現(xiàn):大多數(shù)的相關(guān)研究都是針對(duì)分類問題,對(duì)于回歸問題鮮有討論,而很多針對(duì)分類的方法并不能直接應(yīng)用到回歸問題上;并且對(duì)特征權(quán)重衡量的方法,幾乎都默認(rèn)特征之間是獨(dú)立的,但是在現(xiàn)實(shí)環(huán)境中,往往不是這樣的。所以本課題針對(duì)回歸問題采用了一種能將特征之間關(guān)系考慮在內(nèi)的特征權(quán)重衡量算法,并且使用了兩種方法進(jìn)行特征抽取。同時(shí)我們進(jìn)一步分析發(fā)現(xiàn):將隨機(jī)抽取特征改為帶權(quán)重的特征抽取雖然提高了分類回歸樹模型的精度,但是同時(shí)增大了樹模型之間的相關(guān)性,樹模型之間的多樣性減小,進(jìn)而有可能影響隨機(jī)森林回歸算法整體的表現(xiàn)。針對(duì)這些問題,本文提出了一種雙權(quán)重隨機(jī)森林回歸算法,除了給特征加權(quán)重以提高分類回歸樹的精度,同時(shí)對(duì)生成的分類回歸樹模型加權(quán)重,以期通過雙權(quán)重的方法兼顧分類回歸樹的精度和多樣性,以改善隨機(jī)森林回歸算法最終的預(yù)測(cè)性能。為了解決給分類回歸樹模型加權(quán)重的問題,本課題提出了兩種新的能兼顧模型樹精度和模型樹之間多樣性的模型權(quán)重計(jì)算方法:有放回的向前搜索的方法和基于多樣性計(jì)算的方法。本文將這兩種模型權(quán)重計(jì)算方法與兩種特征抽取方法兩兩組合成四種雙權(quán)重隨機(jī)森林回歸算法,并通過實(shí)驗(yàn)分析效果。針對(duì)大數(shù)據(jù)環(huán)境下,機(jī)器學(xué)習(xí)算法運(yùn)行時(shí)間長(zhǎng),不能在可接受的時(shí)間內(nèi)提供結(jié)果的問題,本文對(duì)雙權(quán)重隨機(jī)森林回歸算法進(jìn)行并行化設(shè)計(jì)與實(shí)現(xiàn)并通過實(shí)驗(yàn)分析并行化效果。
[Abstract]:With the development of science and technology, big data's time has come. Big data brings great challenge to the traditional machine learning method, and stochastic forest algorithm is paid more and more attention because of its good performance. Due to big data's characteristics of magnanimity, complexity, variety and rapid change, big data brings two problems: one is that machine learning algorithm has a long running time and can not provide results in acceptable time. Second, because of high data dimension and large redundancy, the traditional stochastic forest regression algorithm can not get ideal results. In order to solve these problems, the improvement and parallelization of traditional stochastic forest regression are studied. In view of the problem that the traditional stochastic forest regression algorithm can not achieve the ideal effect because of the high data dimension and large redundancy, some literatures have proposed to improve the traditional stochastic forest algorithm that the random extraction features are the feature extraction with weight. But we find that: most of the related studies are focused on classification problems, but there is little discussion on regression problems, and many methods for classification can not be directly applied to regression problems; And almost all the methods to measure the weight of features are independent of each other, but in the real environment, this is not always the case. Therefore, this paper uses a feature weight measurement algorithm which can take the relationship between features into account, and uses two methods for feature extraction. At the same time, we find that changing random extraction features to weighted feature extraction improves the accuracy of classification regression tree model, but also increases the correlation between tree models and reduces the diversity between tree models. Then it may affect the whole performance of stochastic forest regression algorithm. To solve these problems, this paper proposes a double-weight stochastic forest regression algorithm, which not only adds weight to the feature to improve the accuracy of the classification regression tree, but also adds weight to the generated model of the classification regression tree. In order to improve the prediction performance of the stochastic forest regression algorithm, the precision and diversity of the classification regression tree can be taken into account by the method of double weights. In order to solve the problem of adding weight to the classifying regression tree model, In this paper, we propose two new methods to calculate the weight of the model which can take into account the accuracy of the model tree and the diversity of the model tree: the method of forward searching with return and the method based on diversity calculation. In this paper, we combine the two model weight calculation methods and two feature extraction methods into four double weight stochastic forest regression algorithms, and analyze the results through experiments. In order to solve the problem that machine learning algorithm has long running time and can not provide results in acceptable time under big data environment, this paper designs and implements a two-weight stochastic forest regression algorithm and analyzes the parallelization effect through experiments.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP18
本文編號(hào):2389388
[Abstract]:With the development of science and technology, big data's time has come. Big data brings great challenge to the traditional machine learning method, and stochastic forest algorithm is paid more and more attention because of its good performance. Due to big data's characteristics of magnanimity, complexity, variety and rapid change, big data brings two problems: one is that machine learning algorithm has a long running time and can not provide results in acceptable time. Second, because of high data dimension and large redundancy, the traditional stochastic forest regression algorithm can not get ideal results. In order to solve these problems, the improvement and parallelization of traditional stochastic forest regression are studied. In view of the problem that the traditional stochastic forest regression algorithm can not achieve the ideal effect because of the high data dimension and large redundancy, some literatures have proposed to improve the traditional stochastic forest algorithm that the random extraction features are the feature extraction with weight. But we find that: most of the related studies are focused on classification problems, but there is little discussion on regression problems, and many methods for classification can not be directly applied to regression problems; And almost all the methods to measure the weight of features are independent of each other, but in the real environment, this is not always the case. Therefore, this paper uses a feature weight measurement algorithm which can take the relationship between features into account, and uses two methods for feature extraction. At the same time, we find that changing random extraction features to weighted feature extraction improves the accuracy of classification regression tree model, but also increases the correlation between tree models and reduces the diversity between tree models. Then it may affect the whole performance of stochastic forest regression algorithm. To solve these problems, this paper proposes a double-weight stochastic forest regression algorithm, which not only adds weight to the feature to improve the accuracy of the classification regression tree, but also adds weight to the generated model of the classification regression tree. In order to improve the prediction performance of the stochastic forest regression algorithm, the precision and diversity of the classification regression tree can be taken into account by the method of double weights. In order to solve the problem of adding weight to the classifying regression tree model, In this paper, we propose two new methods to calculate the weight of the model which can take into account the accuracy of the model tree and the diversity of the model tree: the method of forward searching with return and the method based on diversity calculation. In this paper, we combine the two model weight calculation methods and two feature extraction methods into four double weight stochastic forest regression algorithms, and analyze the results through experiments. In order to solve the problem that machine learning algorithm has long running time and can not provide results in acceptable time under big data environment, this paper designs and implements a two-weight stochastic forest regression algorithm and analyzes the parallelization effect through experiments.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP18
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 何清;李寧;羅文娟;史忠植;;大數(shù)據(jù)下的機(jī)器學(xué)習(xí)算法綜述[J];模式識(shí)別與人工智能;2014年04期
,本文編號(hào):2389388
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2389388.html
最近更新
教材專著