基于non-local先驗的貝葉斯變量選擇方法及其在極高維數(shù)據(jù)分析中的應(yīng)用
發(fā)布時間:2018-08-01 14:06
【摘要】:目的:本文通過模擬研究比較基于non-local先驗的貝葉斯變量選擇方法、ISIS-SCAD、ISIS-MCP在極高維數(shù)據(jù)分析中的表現(xiàn),并將其應(yīng)用到彌漫性大B細胞淋巴瘤(DLBCL,diffuse large B cell lymphoma)基因表達數(shù)據(jù)中,找出與DLBCL分型有關(guān)的基因,為臨床上DLBCL的診斷和治療提供依據(jù)。方法:介紹基于non-local先驗的貝葉斯變量選擇方法—乘積逆矩先驗(piMOM,product inverse moment)的基本原理,并將其與ISIS-SCAD、ISIS-MCP方法應(yīng)用到二分類logistic回歸中。模擬研究中,根據(jù)協(xié)方差結(jié)構(gòu)的不同將協(xié)變量間相關(guān)程度分為三種情況:相互獨立、復(fù)合對稱相關(guān)、自回歸相關(guān);樣本量n=50、100、200、400、600;自變量維數(shù)p=1000、3000,從模型相合性和模型預(yù)測準確性兩個方面,評價不同極高維情況下三種變量選擇方法的表現(xiàn)。實例分析中,將包含350個病人,3237個基因的DLBCL數(shù)據(jù)分為訓(xùn)練集(n=245)和測試集(n=105),分別運用piMOM、ISIS-SCAD、ISIS-MCP方法進行建模并驗證,用AUC評價三種模型的優(yōu)劣。結(jié)果:模擬研究發(fā)現(xiàn):在p=1000和p=3000情況下,三種方法篩出的變量平均真陽性數(shù)大致相等,ISIS-SCAD、ISIS-MCP方法的平均假陽性數(shù)和預(yù)測均方誤差、回歸系數(shù)均方誤差卻明顯高于non-local先驗方法,且non-local先驗方法隨著維數(shù)的增加波動較小,較ISIS-SCAD、ISIS-MCP方法穩(wěn)定。DLBCL基因表達數(shù)據(jù)經(jīng)piMOM分析發(fā)現(xiàn)4個有意義的基因(MYBL1,CYB5R2,MAML3,BTLA),AUC為0.989;ISIS-SCAD發(fā)現(xiàn)7個有意義的基因(MYBL1,CYB5R2,MAML3,TNFRSF13B,S1PR2,SLC25A27,GAB1),AUC為0.981;ISIS-MCP發(fā)現(xiàn)5個有意義的基因(MYBL1,CYB5R2,MAML3,CHST2,SUB1),AUC為0.962。三種方法均篩出的基因為:MYBL1,CYB5R2,MAML3。結(jié)論:基于non-local先驗的貝葉斯變量選擇方法在模型選擇和預(yù)測準確性方面優(yōu)于傳統(tǒng)的懲罰類方法,在一定程度上可以較好地控制假陽性率。MYBL1,BTLA,CYB5R2,MAML3可能與DLBCL分型有關(guān)。
[Abstract]:Objective: to compare the performance of non-local priori Bayesian variable selection method (ISIS-SCADADIS-MCP) in very high dimensional data analysis and to apply it to the expression data of diffuse large B-cell lymphoma (DLB) diffused large B cell lymphoma) gene. To find out the genes related to DLBCL typing and to provide evidence for the diagnosis and treatment of DLBCL. Methods: the basic principle of non-local priori Bayesian variable selection method, the product inverse moment), was introduced and applied to the two-class logistic regression with IS-SCADADIS-MCP method. In the simulation study, according to the structure of covariance, the correlation degree between covariables can be divided into three cases: mutual independence, compound symmetric correlation, autoregressive correlation; The sample size is 50100200400600 and the dimension of independent variable is p10000000.The performance of three variable selection methods under different extremely high dimensions is evaluated from two aspects of model consistency and model prediction accuracy. In the case study, the DLBCL data containing 350 patients with 3237 genes were divided into two sets: training set (nb245) and test set (nng105). The models were modeled and verified by the method of piMOM / IS-SCADADIS-MCP, and the advantages and disadvantages of the three models were evaluated by AUC. Results: the simulation results showed that the average true positive number of variables screened by the three methods was approximately equal to the average false positive number and the prediction mean square error of ISIS-SCADADIS-MCP method, but the mean square error of regression coefficient was significantly higher than that of non-local 's prior method. 涓攏on-local鍏堥獙鏂規(guī)硶闅忕潃緇存暟鐨勫鍔犳嘗鍔ㄨ緝?yōu)?
本文編號:2157819
[Abstract]:Objective: to compare the performance of non-local priori Bayesian variable selection method (ISIS-SCADADIS-MCP) in very high dimensional data analysis and to apply it to the expression data of diffuse large B-cell lymphoma (DLB) diffused large B cell lymphoma) gene. To find out the genes related to DLBCL typing and to provide evidence for the diagnosis and treatment of DLBCL. Methods: the basic principle of non-local priori Bayesian variable selection method, the product inverse moment), was introduced and applied to the two-class logistic regression with IS-SCADADIS-MCP method. In the simulation study, according to the structure of covariance, the correlation degree between covariables can be divided into three cases: mutual independence, compound symmetric correlation, autoregressive correlation; The sample size is 50100200400600 and the dimension of independent variable is p10000000.The performance of three variable selection methods under different extremely high dimensions is evaluated from two aspects of model consistency and model prediction accuracy. In the case study, the DLBCL data containing 350 patients with 3237 genes were divided into two sets: training set (nb245) and test set (nng105). The models were modeled and verified by the method of piMOM / IS-SCADADIS-MCP, and the advantages and disadvantages of the three models were evaluated by AUC. Results: the simulation results showed that the average true positive number of variables screened by the three methods was approximately equal to the average false positive number and the prediction mean square error of ISIS-SCADADIS-MCP method, but the mean square error of regression coefficient was significantly higher than that of non-local 's prior method. 涓攏on-local鍏堥獙鏂規(guī)硶闅忕潃緇存暟鐨勫鍔犳嘗鍔ㄨ緝?yōu)?
本文編號:2157819
本文鏈接:http://sikaile.net/yixuelunwen/zlx/2157819.html
最近更新
教材專著