特征選擇與樣本選擇用于癌分類(lèi)與藥物構(gòu)效關(guān)系研究
發(fā)布時(shí)間:2018-04-21 17:24
本文選題:高維特征選擇 + 近鄰樣本選擇; 參考:《湖南農(nóng)業(yè)大學(xué)》2014年博士論文
【摘要】:對(duì)于大數(shù)據(jù)建模,特征選擇與樣本選擇能夠大幅度提升模型預(yù)測(cè)性能、降低建模時(shí)間,是構(gòu)建分類(lèi)或回歸模型的必要步驟與有效手段。本文從特征獲取及篩選、學(xué)習(xí)機(jī)器選擇、樣本選擇多角度優(yōu)化模型,并用于癌基因芯片數(shù)據(jù)分析(分類(lèi))、藥物定量構(gòu)效關(guān)系(Quantitative Structure-Activity Relationship, QSAR)研究(回歸)。首先,克服傳統(tǒng)F測(cè)驗(yàn)、最高得分對(duì)家族算法等僅單向比較、忽略互作等缺陷,基于不等次重復(fù)雙向方差分析,雙向比較多個(gè)基因,整體考慮了多基因與表型互作,經(jīng)綜合加權(quán)排序與去冗余獲取信息基因;結(jié)合轉(zhuǎn)導(dǎo)推理,構(gòu)建了無(wú)需訓(xùn)練的直接分類(lèi)器。10個(gè)多分類(lèi)腫瘤表達(dá)數(shù)據(jù)的信息基因選擇與獨(dú)立預(yù)測(cè)多角度比較結(jié)果表明:1)新方法以較少的信息基因獲得了優(yōu)于參比模型的平均預(yù)測(cè)精度(92.06%);2)優(yōu)于最高得分系列與基于相關(guān)性的基因選擇算法;3)與支持向量分類(lèi)相當(dāng),優(yōu)于線性邏輯斯蒂回歸與樸素貝葉斯。對(duì)白血病與乳腺癌數(shù)據(jù),實(shí)施多輪基因選擇并以基因本體分析生物學(xué)通路,發(fā)現(xiàn)若干重要生物學(xué)通路及致病基因。其次,針對(duì)方差分析不適用于回歸數(shù)據(jù)特征選擇的弊端,將二元矩陣混排過(guò)濾器(Binary Matrix Shuffling Filter, BMSF)用于RPMI8402與P388兩個(gè)細(xì)胞系的抗腫瘤藥物QSAR研究。以量子化學(xué)計(jì)算軟件PCLIENT獲取2923個(gè)高維分子描述符,以BMSF實(shí)施特征篩選,以支持向量回歸(Support Vector Regression, SVR)建模預(yù)測(cè),結(jié)果表明:基于文獻(xiàn)描述符的SVR模型優(yōu)于多元線性回歸、逐步線性回歸、偏最小二乘回歸,與人工神經(jīng)網(wǎng)絡(luò)相當(dāng);對(duì)高維描述符,經(jīng)特征篩選分別保留11個(gè)特征,基于保留描述符的SVR模型優(yōu)于其他參比模型,且非線性回歸極顯著,多數(shù)保留描述符的單因子重要性達(dá)顯著,對(duì)藥物活性的效應(yīng)分析等為高活性抗腫瘤藥物設(shè)計(jì)提供思路。進(jìn)一步,同時(shí)考慮特征篩選與樣本選擇,將BMSF與地統(tǒng)計(jì)學(xué)半變異函數(shù)用于血管緊張素轉(zhuǎn)化酶抑制劑與人類(lèi)白細(xì)胞抗原Ⅰ型分子結(jié)合肽QSAR建模。以531個(gè)氨基酸理化性質(zhì)表征肽序列,以BMSF篩選特征,以地統(tǒng)計(jì)學(xué)確定公共變程,對(duì)每個(gè)待測(cè)樣本,從訓(xùn)練集中選出小于公共變程的K個(gè)近鄰樣本,以SVR實(shí)施個(gè)性化預(yù)測(cè),結(jié)果表明:對(duì)1593與4779個(gè)高維描述符,經(jīng)特征篩選后5次樣本劃分中分別平均保留15.4與15.8個(gè)特征,獨(dú)立預(yù)測(cè)精度Q2pred分別為0.982與0.806,均優(yōu)于文獻(xiàn)參比及單向選擇模型。分析了多套描述符子集的殘基分布與偏好,為設(shè)計(jì)高活性肽提供理論指導(dǎo)。本文方法在生物標(biāo)記物篩選、模式分類(lèi)、分子活性預(yù)測(cè)等領(lǐng)域有較廣泛應(yīng)用前景。
[Abstract]:For big data modeling, feature selection and sample selection can greatly improve the performance of model prediction and reduce modeling time, which is a necessary step and an effective means to construct classification or regression model. In this paper, a multi-angle optimization model based on feature acquisition and screening, learning machine selection and sample selection is used in the analysis of oncogene chip data (classification, quantitative Structure-Activity relationship, QSAR). First of all, to overcome the traditional F test, the highest score of the family algorithm only one-way comparison, ignoring the interaction and other defects, based on unequal repeat bidirectional ANOVA, two-way comparison of multiple genes, the overall consideration of multi-gene and phenotypic interaction. Through comprehensive weighted sequencing and deredundancy to obtain information genes; combined with transduction reasoning, A direct classifier without training was constructed. The results of multi-angle comparison of information gene selection and independent prediction for 10 multi-classification tumor expression data show that the new method obtains average preconditioning with fewer information genes than the reference model. The accuracy of the test is 92.06 / 2) better than the highest score series and the correlation-based gene selection algorithm, which is comparable to the support vector classification. It is superior to linear logic Steeles regression and naive Bayes. Based on the data of leukemia and breast cancer, several important biological pathways and pathogenetic genes were found by multiple rounds of gene selection and gene ontology analysis. Secondly, the binary Matrix Shuffling filter (BMSF) was used to study the anticancer drug QSAR in RPMI8402 and P388 cell lines. 2923 high-dimensional molecular descriptors were obtained by quantum chemistry calculation software PCLIENT, and feature screening was carried out by BMSF. The support vector regression support Vector regression (SVR) model was used to model and predict. The results show that the SVR model based on the literature descriptor is superior to the multivariate linear regression model. Stepwise linear regression and partial least square regression are comparable to artificial neural networks. For high-dimensional descriptors, 11 features are retained by feature selection, and the SVR model based on retention descriptors is superior to other reference models, and nonlinear regression is extremely significant. The single factor importance of most retention descriptors is significant, and the effect analysis of drug activity provides ideas for the design of highly active antitumor drugs. Furthermore, BMSF and geostatistical semivariogram were used to model angiotensin-converting enzyme inhibitor (ACEI) and human leukocyte antigen type I molecular binding peptide (QSAR). The peptide sequence was characterized by 531 amino acid physicochemical properties, and the common variable was determined by geostatistics by BMSF screening. For each sample to be tested, K nearest neighbor samples were selected from the training set, and the individual prediction was carried out by SVR. The results show that for 1593 and 4 779 high dimensional descriptors, the average of 15.4 and 15.8 features are retained in the 5 samples after screening, and the independent prediction accuracy Q2pred is 0.982 and 0.806, respectively, which is superior to the reference ratio and one-way selection model. The residue distribution and preference of multiple sets of descriptor subsets are analyzed, which provides theoretical guidance for the design of highly active peptides. This method has been widely used in the fields of biomarker screening, pattern classification, molecular activity prediction and so on.
【學(xué)位授予單位】:湖南農(nóng)業(yè)大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:Q811.4;R96
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 張學(xué)工;關(guān)于統(tǒng)計(jì)學(xué)習(xí)理論與支持向量機(jī)[J];自動(dòng)化學(xué)報(bào);2000年01期
,本文編號(hào):1783372
本文鏈接:http://sikaile.net/yixuelunwen/swyx/1783372.html
最近更新
教材專(zhuān)著