基于原數(shù)據(jù)相關(guān)性特征選擇法
發(fā)布時(shí)間:2018-04-02 08:40
本文選題:Lasso 切入點(diǎn):最小角回歸 出處:《蘭州大學(xué)》2017年碩士論文
【摘要】:在特征選擇問(wèn)題中,Lasso、最小角回歸和逐步回歸(如向前逐步回歸),都可以描述特征選擇的過(guò)程,但是這些方法得出的特征選擇過(guò)程都有缺陷.最小角回歸以及修正最小角回歸只能描述變量選入和刪除點(diǎn)的情況,對(duì)于這些點(diǎn)以外的點(diǎn)的解無(wú)法知曉,所以最小角回歸在數(shù)據(jù)稀疏化過(guò)程中并不完全.逐步回歸方法如果前進(jìn)步長(zhǎng)過(guò)大則容易漏掉某些過(guò)程,步長(zhǎng)過(guò)小則運(yùn)算量太大.Lasso在取遍參數(shù)情況下所得稀疏化過(guò)程是完全的,然而Lasso的參數(shù)是連續(xù)的,所以要經(jīng)過(guò)大量的參數(shù)格點(diǎn)值運(yùn)算才能得到完全的稀疏過(guò)程,但這也會(huì)導(dǎo)致運(yùn)算量太大.而Lasso模型的求解本身也是一個(gè)難題.為了解決上述問(wèn)題,本文提出基于原始數(shù)據(jù)相關(guān)性的特征選擇法,該方法(公式法)應(yīng)用修正最小角回歸思想做特征選擇,運(yùn)算過(guò)程中不將響應(yīng)變量做中心化處理,這樣便可得到自變量與響應(yīng)變量相關(guān)性值與Lasso調(diào)整參數(shù)之間的對(duì)應(yīng)關(guān)系,在經(jīng)過(guò)一次類似修正最小角回歸算法后,可以通過(guò)這種對(duì)應(yīng)關(guān)系得到該數(shù)據(jù)下任意參數(shù)的Lasso的顯式解.公式法不但提高了Lasso解的精確度,而且在做Lasso參數(shù)的大量格點(diǎn)值試驗(yàn)中,要比其他算法更快.我們將公式法用在一個(gè)糖尿病數(shù)據(jù)研究中,比較了公式法、坐標(biāo)下降法和二次逼近算法,我們發(fā)現(xiàn)公式法的解精確度最高;我們也比較了這三種算法在不同維數(shù)、不同樣本量和不同參數(shù)格點(diǎn)數(shù)下的運(yùn)行時(shí)間,發(fā)現(xiàn)公式法花費(fèi)時(shí)間最少,而且隨著維數(shù)、樣本量和參數(shù)格點(diǎn)數(shù)的增加,運(yùn)行時(shí)間的增長(zhǎng)也比其他兩種方法緩慢很多.公式法思想也可以用于解釋一些如坐標(biāo)下降法等求解Lasso的其他算法.
[Abstract]:In feature selection problems, minimum angle regression and stepwise regression, such as forward stepwise regression, can all describe the process of feature selection. However, the feature selection process obtained by these methods is flawed. The minimum angle regression and the modified minimum angle regression can only describe the selection and deletion of the variables, but the solution of the points other than these points cannot be known. So the minimum angle regression is not complete in the process of data thinning. If the stepwise regression method is too large, it is easy to miss some processes, and if the step size is too small, the computation is too large. Lasso is complete in the case of searching through the parameters. However, the parameters of Lasso are continuous, so it is necessary to get a complete sparse process through a large number of parameter lattice value operations, but this will also lead to too much computation. The solution of Lasso model itself is also a difficult problem. In order to solve the above problem, In this paper, a feature selection method based on the correlation of raw data is proposed. This method (formula method) applies the idea of modified minimum angle regression to feature selection, and does not centralize the response variable in the operation. In this way, the corresponding relation between the correlation value of independent variable and response variable and the adjustment parameter of Lasso can be obtained. After a similar modified minimum angle regression algorithm, The explicit solution of Lasso of any parameter under this data can be obtained by this correspondence. The formula method not only improves the accuracy of Lasso solution, but also does a lot of lattice test of Lasso parameter. We compare the formula method, coordinate descent method and quadratic approximation algorithm. We find that the formula method has the highest accuracy. We also compare the running time of the three algorithms in different dimensions, different sample sizes and different parameter lattice points. It is found that the formula method takes the least time, and with the increase of dimension, sample size and parameter lattice points, The thought of formula method can also be used to explain some other algorithms for solving Lasso such as coordinate descent method.
【學(xué)位授予單位】:蘭州大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:C81
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 王冬梅,沈頌東;逐步回歸分析法[J];工業(yè)技術(shù)經(jīng)濟(jì);1997年03期
,本文編號(hào):1699605
本文鏈接:http://sikaile.net/shekelunwen/shgj/1699605.html
最近更新
教材專著