基于Lasso的高維數(shù)據(jù)線性回歸模型統(tǒng)計(jì)推斷方法比較
發(fā)布時(shí)間:2018-07-27 15:23
【摘要】:目的:本文將介紹五種基于Lasso的高維數(shù)據(jù)線性回歸模型統(tǒng)計(jì)推斷方法:Lasso-懲罰計(jì)分檢驗(yàn)(Lasso Penalized Score Test,Lassoscore),多重樣本拆分(Multiple Sample-Splitting,MS-split)、穩(wěn)定選擇(Stability Selection)、低維投射(Low-Dimensional Projection Estimate,LDPE)、協(xié)方差檢驗(yàn)(Covariance test,Covtest),并將這五種方法作比較,分析其在不同高維數(shù)據(jù)情形下的表現(xiàn)。方法:分別介紹Lasso-懲罰計(jì)分檢驗(yàn)、多重樣本拆分、穩(wěn)定選擇、低維投射、協(xié)方差檢驗(yàn)的基本原理。利用以下四個(gè)參數(shù)設(shè)置模擬數(shù)據(jù),分別為:7種樣本量n=50、75、100、150、200、300、400;兩種自變量個(gè)數(shù)p=100、300;兩種自變量間相關(guān)性,一是自變量間相互獨(dú)立,二是自變量間相關(guān)性為corr(Xi,Xj)=0.5|i-j|;兩種回歸系數(shù)大小,一是β1=β2=β3=β4=β5=5,βj=0,j5。二是β1=β2=β3=β4=β5=0.15,βj=0,j5。以上四個(gè)參數(shù)分別構(gòu)成不同情形的高維數(shù)據(jù)。采用R軟件模擬數(shù)據(jù)并用五種方法做統(tǒng)計(jì)推斷,最后以期望假陽性率(Expected False Positives,EFP)和檢驗(yàn)效能(power)為評(píng)價(jià)指標(biāo),比較這五種方法在不同高維數(shù)據(jù)情形下的表現(xiàn)。結(jié)果:在理想高維數(shù)據(jù)情形下五種方法除協(xié)方差檢驗(yàn)推斷結(jié)果保守外其余方法表現(xiàn)都較好,其中穩(wěn)定選擇的EFP最低而檢驗(yàn)效能最高,在五種方法中表現(xiàn)最好。低維投射、穩(wěn)定選擇、多重樣本拆分都對(duì)βmin條件有要求,其中穩(wěn)定選擇過于其依賴βmin條件,所以在復(fù)雜高維數(shù)據(jù)情形下檢驗(yàn)效能大幅度降低,表現(xiàn)差。在復(fù)雜高維數(shù)據(jù)情形下低維投射在大樣本和小樣本下表現(xiàn)都較保守,雖然在中等樣本量時(shí)檢驗(yàn)效能很高,但是以引入極高的假陽性為代價(jià)的。無論在何種數(shù)據(jù)情形下協(xié)方差檢驗(yàn)推斷結(jié)果都很保守。在復(fù)雜高維數(shù)據(jù)情形下Lasso-懲罰計(jì)分檢驗(yàn)的檢驗(yàn)效能是五種方法中最高的,其次為多重樣本拆分,而Lasso-懲罰計(jì)分檢驗(yàn)的EFP也是最高的,多重樣本拆分的EFP基本接近0。結(jié)論:在常見復(fù)雜高維數(shù)據(jù)情形下Lasso-懲罰計(jì)分檢驗(yàn)發(fā)現(xiàn)真實(shí)非零變量的能力優(yōu)于其余四種方法,且其對(duì)βmin的要求低,但期望假陽性率高。多重樣本拆分的發(fā)現(xiàn)真實(shí)非零變量的能力雖然依賴于數(shù)據(jù)對(duì)βmin條件的滿足與否,但當(dāng)條件不滿足時(shí)僅次于Lasso-懲罰計(jì)分檢驗(yàn),且其期望假陽性率極低。所以在常見復(fù)雜高維數(shù)據(jù)中Lasso-懲罰計(jì)分檢驗(yàn)和多重樣本拆分是兩種較好的高維線性回歸模型統(tǒng)計(jì)推斷方法,兩者相對(duì)而言前者較寬松,后者較保守。在實(shí)際應(yīng)用中雖然無法得知真實(shí)數(shù)據(jù)是否滿足βmin條件,但可根據(jù)應(yīng)用需求來選擇合適的統(tǒng)計(jì)推斷方法。
[Abstract]:Objective: this paper will introduce five statistical inference methods of high-dimensional data linear regression model based on Lasso: Lasso-penalty score test (Lasso Penalized Score Test-Lassoscore), Multiple Sample-Spliting (MS-split), stable selection of (Stability Selection), low-dimensional projection (LDPE), Covariance test Cov test, and covariance test. Compare these five methods, Its performance under different high dimensional data is analyzed. Methods: the basic principles of Lasso-penalty score test, multiple sample splitting, stable selection, low dimensional projection and covariance test were introduced respectively. Using the following four parameters to set up the simulation data, the following four parameters are used to set up the simulation data, respectively, that is, the sample size of 7 kinds of samples n / 7 / 100150200300400; the number of two independent variables p / 100300; the correlation between the two independent variables, one is the independence of the independent variables, the other is the correlation between the independent variables is corr (Xianxj) 0.5i-j, and the two regression coefficients are 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5J _ 5, 尾 _ j _ 0J _ 5. the two kinds of regression coefficients are: 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5 ~ (5). The other is 尾 _ 1 = 尾 _ 2 = 尾 _ 3 = 尾 _ 4 = 尾 _ 5N _ (0.15), 尾 _ (JJ) _ (0) J _ (5). The above four parameters constitute high dimensional data in different cases. The R software was used to simulate the data and five methods were used to make statistical inference. Finally, the expected false positive rate (Expected False positive rate) and the test effectiveness (power) were used as evaluation indexes to compare the performance of the five methods in different high-dimensional data cases. Results: in the case of ideal high-dimensional data, all the five methods performed well except covariance test inference results. Among them, the stable selection of EFP was the lowest and the test efficiency was the highest, and the five methods performed best. In the case of low dimensional projection, stable selection and multiple sample splitting, the 尾 min condition is required, and the stable selection is too dependent on the 尾 min condition, so the test efficiency is greatly reduced and the performance is poor in the case of complex high dimensional data. In the case of complex high-dimensional data, the low-dimensional projection is conservative in both large and small samples. Although the test efficiency is very high in the case of medium sample size, it is at the cost of introducing extremely high false positives. Covariance test inferences are conservative regardless of the data. In the case of complex high-dimensional data, the test efficiency of Lasso-penalty score test is the highest among the five methods, followed by multi-sample splitting, while the EFP of Lasso-penalty score test is the highest, and the EFP of multi-sample splitting is close to zero. Conclusion: Lasso-penalty score test shows that the ability of real non-zero variables is superior to the other four methods in the case of complex high-dimensional data, and its demand for 尾 min is low, but the expected false positive rate is high. The ability of multi-sample split to find real non-zero variables depends on whether the data satisfies the 尾 min condition, but when the condition is not satisfied, it is second only to Lasso-penalty score test, and its expected false positive rate is very low. Therefore, Lasso-penalty score test and multi-sample splitting are two better statistical inference methods for high-dimensional linear regression model in common complex high-dimensional data. The former is relatively loose and the latter is more conservative. Although it is impossible to know whether the real data satisfies the 尾 min condition in practical application, we can select a suitable statistical inference method according to the application requirements.
【學(xué)位授予單位】:山西醫(yī)科大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:R195.1
本文編號(hào):2148260
[Abstract]:Objective: this paper will introduce five statistical inference methods of high-dimensional data linear regression model based on Lasso: Lasso-penalty score test (Lasso Penalized Score Test-Lassoscore), Multiple Sample-Spliting (MS-split), stable selection of (Stability Selection), low-dimensional projection (LDPE), Covariance test Cov test, and covariance test. Compare these five methods, Its performance under different high dimensional data is analyzed. Methods: the basic principles of Lasso-penalty score test, multiple sample splitting, stable selection, low dimensional projection and covariance test were introduced respectively. Using the following four parameters to set up the simulation data, the following four parameters are used to set up the simulation data, respectively, that is, the sample size of 7 kinds of samples n / 7 / 100150200300400; the number of two independent variables p / 100300; the correlation between the two independent variables, one is the independence of the independent variables, the other is the correlation between the independent variables is corr (Xianxj) 0.5i-j, and the two regression coefficients are 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5J _ 5, 尾 _ j _ 0J _ 5. the two kinds of regression coefficients are: 尾 _ 1 = 尾 _ 2 = 尾 _ 2 = 尾 _ 4 = 尾 _ 5 ~ (5). The other is 尾 _ 1 = 尾 _ 2 = 尾 _ 3 = 尾 _ 4 = 尾 _ 5N _ (0.15), 尾 _ (JJ) _ (0) J _ (5). The above four parameters constitute high dimensional data in different cases. The R software was used to simulate the data and five methods were used to make statistical inference. Finally, the expected false positive rate (Expected False positive rate) and the test effectiveness (power) were used as evaluation indexes to compare the performance of the five methods in different high-dimensional data cases. Results: in the case of ideal high-dimensional data, all the five methods performed well except covariance test inference results. Among them, the stable selection of EFP was the lowest and the test efficiency was the highest, and the five methods performed best. In the case of low dimensional projection, stable selection and multiple sample splitting, the 尾 min condition is required, and the stable selection is too dependent on the 尾 min condition, so the test efficiency is greatly reduced and the performance is poor in the case of complex high dimensional data. In the case of complex high-dimensional data, the low-dimensional projection is conservative in both large and small samples. Although the test efficiency is very high in the case of medium sample size, it is at the cost of introducing extremely high false positives. Covariance test inferences are conservative regardless of the data. In the case of complex high-dimensional data, the test efficiency of Lasso-penalty score test is the highest among the five methods, followed by multi-sample splitting, while the EFP of Lasso-penalty score test is the highest, and the EFP of multi-sample splitting is close to zero. Conclusion: Lasso-penalty score test shows that the ability of real non-zero variables is superior to the other four methods in the case of complex high-dimensional data, and its demand for 尾 min is low, but the expected false positive rate is high. The ability of multi-sample split to find real non-zero variables depends on whether the data satisfies the 尾 min condition, but when the condition is not satisfied, it is second only to Lasso-penalty score test, and its expected false positive rate is very low. Therefore, Lasso-penalty score test and multi-sample splitting are two better statistical inference methods for high-dimensional linear regression model in common complex high-dimensional data. The former is relatively loose and the latter is more conservative. Although it is impossible to know whether the real data satisfies the 尾 min condition in practical application, we can select a suitable statistical inference method according to the application requirements.
【學(xué)位授予單位】:山西醫(yī)科大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:R195.1
【引證文獻(xiàn)】
相關(guān)會(huì)議論文 前1條
1 閆麗娜;王彤;;懲罰COX模型和彈性網(wǎng)技術(shù)在高維數(shù)據(jù)生存分析中的應(yīng)用[A];2011年中國衛(wèi)生統(tǒng)計(jì)學(xué)年會(huì)會(huì)議論文集[C];2011年
,本文編號(hào):2148260
本文鏈接:http://sikaile.net/kejilunwen/yysx/2148260.html
最近更新
教材專著