Boosting方法在基因微陣列數(shù)據(jù)判別分析中的應(yīng)用
發(fā)布時間:2018-10-23 21:19
【摘要】:基于高通量的“微陣列(Microarray)”技術(shù)的迅速發(fā)展,給統(tǒng)計學(xué)專業(yè)人員提供了大量的微陣列數(shù)據(jù)。這類“小樣本、高維度”的資料(m>>n),給傳統(tǒng)的分類判別方法帶來了前所未有的挑戰(zhàn),Boosting方法作為集成算法中的一員,一直以其“完美”的分類能力吸引著眾多的研究者和應(yīng)用者。 本研究在系統(tǒng)介紹了Boosting的基本思想,以及它的兩種算法——AdaBoost和LogitBoost的基本過程的基礎(chǔ)上,,采用這兩種Boosting算法對模擬數(shù)據(jù)和維度較低的資料建立判別預(yù)測模型,并與另兩種集成算法(Bagging和Random-Forest)和三種傳統(tǒng)判別分析方法(Fisher’s線性判別、Fisher’s二次判別和logistic回歸判別)的預(yù)測效果進(jìn)行了比較。 本研究根據(jù)基因微陣列數(shù)據(jù)的特殊性,對兩個網(wǎng)絡(luò)數(shù)據(jù)庫——白血病數(shù)據(jù)和乳腺癌數(shù)據(jù)進(jìn)行了分析,思路如下:(1)使用FDR控制程序校正P值,以P≤0.05或P≤0.01為標(biāo)準(zhǔn)篩選基因變量,使得維度小于樣本含量,建立判別預(yù)測模型,將Boosting方法與兩種集成算法和三種傳統(tǒng)的方法相比較;(2)按照P值的排序選擇不同數(shù)目的基因預(yù)測變量,分別建立判別預(yù)測模型,考察Boosting的相對優(yōu)勢(包括預(yù)測精度和敏感性);(3)提取主成分,作主成分判別分析,考察Boosting方法的優(yōu)勢。以上均用交叉驗證思路考察模型的預(yù)測效果和預(yù)測結(jié)果的穩(wěn)定性。 本研究主要結(jié)論: 1.Boosting的總體預(yù)測效果普遍優(yōu)于Bagging、Random-Forest以及傳統(tǒng)的
[Abstract]:The rapid development of microarray (Microarray) technology based on high throughput provides a large amount of microarray data to statisticians. This kind of "small sample, high dimensional" data (m > n),) brings an unprecedented challenge to the traditional classification and discrimination methods. The Boosting method is a member of the ensemble algorithm. It has attracted many researchers and applicators for its perfect classification ability. Based on the systematic introduction of the basic idea of Boosting and the basic process of its two algorithms, AdaBoost and LogitBoost, the two Boosting algorithms are used to establish the discriminant prediction model for the simulated data and the low-dimensional data. The prediction results are compared with two other ensemble algorithms (Bagging and Random-Forest) and three traditional discriminant analysis methods (Fisher's linear discriminant, Fisher's quadratic discriminant and logistic regression discriminant). According to the particularity of gene microarray data, two network databases, leukemia data and breast cancer data, were analyzed in this study. The main ideas were as follows: (1) using FDR control program to correct P value, Using P 鈮
本文編號:2290491
[Abstract]:The rapid development of microarray (Microarray) technology based on high throughput provides a large amount of microarray data to statisticians. This kind of "small sample, high dimensional" data (m > n),) brings an unprecedented challenge to the traditional classification and discrimination methods. The Boosting method is a member of the ensemble algorithm. It has attracted many researchers and applicators for its perfect classification ability. Based on the systematic introduction of the basic idea of Boosting and the basic process of its two algorithms, AdaBoost and LogitBoost, the two Boosting algorithms are used to establish the discriminant prediction model for the simulated data and the low-dimensional data. The prediction results are compared with two other ensemble algorithms (Bagging and Random-Forest) and three traditional discriminant analysis methods (Fisher's linear discriminant, Fisher's quadratic discriminant and logistic regression discriminant). According to the particularity of gene microarray data, two network databases, leukemia data and breast cancer data, were analyzed in this study. The main ideas were as follows: (1) using FDR control program to correct P value, Using P 鈮
本文編號:2290491
本文鏈接:http://sikaile.net/yixuelunwen/binglixuelunwen/2290491.html
最近更新
教材專著