基于Lasso的高維數(shù)據(jù)線性回歸模型統(tǒng)計推斷方法比較

發(fā)布時間：2018-07-27 15:23

【摘要】：目的:本文將介紹五種基于Lasso的高維數(shù)據(jù)線性回歸模型統(tǒng)計推斷方法:Lasso-懲罰計分檢驗(Lasso Penalized Score Test,Lassoscore),多重樣本拆分(Multiple Sample-Splitting,MS-split)、穩(wěn)定選擇(Stability Selection)、低維投射(Low-Dimensional Projection Estimate,LDPE)、協(xié)方差檢驗(Covariance test,Covtest),并將這五種方法作比較,分析其在不同高維數(shù)據(jù)情形下的表現(xiàn)。方法:分別介紹Lasso-懲罰計分檢驗、多重樣本拆分、穩(wěn)定選擇、低維投射、協(xié)方差檢驗的基本原理。利用以下四個參數(shù)設(shè)置模擬數(shù)據(jù),分別為:7種樣本量n=50、75、100、150、200、300、400;兩種自變量個數(shù)p=100、300;兩種自變量間相關(guān)性,一是自變量間相互獨立,二是自變量間相關(guān)性為corr(Xi,Xj)=0.5|i-j|;兩種回歸系數(shù)大小,一是β1=β2=β3=β4=β5=5,βj=0,j5。二是β1=β2=β3=β4=β5=0.15,βj=0,j5。以上四個參數(shù)分別構(gòu)成不同情形的高維數(shù)據(jù)。采用R軟件模擬數(shù)據(jù)并用五種方法做統(tǒng)計推斷,最后以期望假陽性率(Expected False Positives,EFP)和檢驗效能(power)為評價指標(biāo),比較這五種方法在不同高維數(shù)據(jù)情形下的表現(xiàn)。結(jié)果:在理想高維數(shù)據(jù)情形下五種方法除協(xié)方差檢驗推斷結(jié)果保守外其余方法表現(xiàn)都較好,其中穩(wěn)定選擇的EFP最低而檢驗效能最高,在五種方法中表現(xiàn)最好。低維投射、穩(wěn)定選擇、多重樣本拆分都對βmin條件有要求,其中穩(wěn)定選擇過于其依賴βmin條件,所以在復(fù)雜高維數(shù)據(jù)情形下檢驗效能大幅度降低,表現(xiàn)差。在復(fù)雜高維數(shù)據(jù)情形下低維投射在大樣本和小樣本下表現(xiàn)都較保守,雖然在中等樣本量時檢驗效能很高,但是以引入極高的假陽性為代價的。無論在何種數(shù)據(jù)情形下協(xié)方差檢驗推斷結(jié)果都很保守。在復(fù)雜高維數(shù)據(jù)情形下Lasso-懲罰計分檢驗的檢驗效能是五種方法中最高的,其次為多重樣本拆分,而Lasso-懲罰計分檢驗的EFP也是最高的,多重樣本拆分的EFP基本接近0。結(jié)論:在常見復(fù)雜高維數(shù)據(jù)情形下Lasso-懲罰計分檢驗發(fā)現(xiàn)真實非零變量的能力優(yōu)于其余四種方法,且其對βmin的要求低,但期望假陽性率高。多重樣本拆分的發(fā)現(xiàn)真實非零變量的能力雖然依賴于數(shù)據(jù)對βmin條件的滿足與否,但當(dāng)條件不滿足時僅次于Lasso-懲罰計分檢驗,且其期望假陽性率極低。所以在常見復(fù)雜高維數(shù)據(jù)中Lasso-懲罰計分檢驗和多重樣本拆分是兩種較好的高維線性回歸模型統(tǒng)計推斷方法,兩者相對而言前者較寬松,后者較保守。在實際應(yīng)用中雖然無法得知真實數(shù)據(jù)是否滿足βmin條件,但可根據(jù)應(yīng)用需求來選擇合適的統(tǒng)計推斷方法。
[Abstract]:Objective: this paper will introduce five statistical inference methods of high-dimensional data linear regression model based on Lasso: Lasso-penalty score test (Lasso Penalized Score Test-Lassoscore), Multiple Sample-Spliting (MS-split), stable selection of (Stability Selection), low-dimensional projection (LDPE), Covariance test Cov test, and covariance test. Compare these five methods, Its performance under different high dimensional data is analyzed. Methods: the basic principles of Lasso-penalty score test, multiple sample splitting, stable selection, low dimensional projection and covariance test were introduced respectively. Using the following four parameters to set up the simulation data, the following four parameters are used to set up the simulation data, respectively, that is, the sample size of 7 kinds of samples n / 7 / 100150200300400; the number of two independent variables p / 100300; the correlation between the two independent variables, one is the independence of the independent variables, the other is the correlation between the independent variables is corr (Xianxj) 0.5i-j, and the two regression coefficients are 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5J _ 5, 尾 _ j _ 0J _ 5. the two kinds of regression coefficients are: 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5 ~ (5). The other is 尾 _ 1 = 尾 _ 2 = 尾 _ 3 = 尾 _ 4 = 尾 _ 5N _ (0.15), 尾 _ (JJ) _ (0) J _ (5). The above four parameters constitute high dimensional data in different cases. The R software was used to simulate the data and five methods were used to make statistical inference. Finally, the expected false positive rate (Expected False positive rate) and the test effectiveness (power) were used as evaluation indexes to compare the performance of the five methods in different high-dimensional data cases. Results: in the case of ideal high-dimensional data, all the five methods performed well except covariance test inference results. Among them, the stable selection of EFP was the lowest and the test efficiency was the highest, and the five methods performed best. In the case of low dimensional projection, stable selection and multiple sample splitting, the 尾 min condition is required, and the stable selection is too dependent on the 尾 min condition, so the test efficiency is greatly reduced and the performance is poor in the case of complex high dimensional data. In the case of complex high-dimensional data, the low-dimensional projection is conservative in both large and small samples. Although the test efficiency is very high in the case of medium sample size, it is at the cost of introducing extremely high false positives. Covariance test inferences are conservative regardless of the data. In the case of complex high-dimensional data, the test efficiency of Lasso-penalty score test is the highest among the five methods, followed by multi-sample splitting, while the EFP of Lasso-penalty score test is the highest, and the EFP of multi-sample splitting is close to zero. Conclusion: Lasso-penalty score test shows that the ability of real non-zero variables is superior to the other four methods in the case of complex high-dimensional data, and its demand for 尾 min is low, but the expected false positive rate is high. The ability of multi-sample split to find real non-zero variables depends on whether the data satisfies the 尾 min condition, but when the condition is not satisfied, it is second only to Lasso-penalty score test, and its expected false positive rate is very low. Therefore, Lasso-penalty score test and multi-sample splitting are two better statistical inference methods for high-dimensional linear regression model in common complex high-dimensional data. The former is relatively loose and the latter is more conservative. Although it is impossible to know whether the real data satisfies the 尾 min condition in practical application, we can select a suitable statistical inference method according to the application requirements.
【學(xué)位授予單位】：山西醫(yī)科大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2015
【分類號】：R195.1

【引證文獻】

相關(guān)會議論文前1條

1 閆麗娜;王彤;;懲罰COX模型和彈性網(wǎng)技術(shù)在高維數(shù)據(jù)生存分析中的應(yīng)用[A];2011年中國衛(wèi)生統(tǒng)計學(xué)年會會議論文集[C];2011年

，

本文編號：2148260

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/yysx/2148260.html

上一篇：線性規(guī)劃問題非有效變量判別定理的研究
下一篇：一類非線性二階差分方程Robin問題多個正解的存在性

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Lasso的高維數(shù)據(jù)線性回歸模型統(tǒng)計推斷方法比較