腫瘤信息基因選擇與分類方法研究
[Abstract]:The development of large-scale gene expression profile and its rapid development provide a brand-new technology platform for tumor research. The data mining based on the gene expression profile is of great significance in the discovery of pathogenic genes, the clinical diagnosis of the tumor, the judgment of the curative effect of the drugs and the mechanism of the pathogenesis. The tumor gene expression profile data has the characteristics of high characteristic dimension, small sample size or relatively small sample background, large sample background difference, high redundancy, non-linearity, interaction effect between genes, and the like, and the traditional statistical method and the pattern recognition method are limited in application. In this paper, based on the characteristics of gene expression data, the research on the selection method of information gene and the construction of the classifier is carried out. The main results are as follows: (1) The binary matrix rearrangement filter BMSF (Binary Matrix Shift Filter) of high-dimensional feature selection is developed based on the support vector machine. Most of the information gene selection methods only take into account the action of a single gene or a pair of genes, but do not take into account the interaction between multiple genes. The BMSF algorithm proposed in this paper comprehensively considers the interaction between multi-genes, and transforms the classification problem into the regression problem by introducing an intermediate (0,1) binary matrix which is randomly generated, and realizes the high-dimensional feature selection based on the support vector machine under the premise of the optimization of the kernel function parameters. In the gene selection process, a subset of the genes remaining in the model is recursively optimized and updated repeatedly according to their contribution to other genes in the tumor classification. For 9 oncogene expression two-class data sets, BMSF is far superior to the one-way prediction accuracy of the literature report with a small subset of information genes, and the selected subset of information genes can improve the prediction accuracy of a plurality of classifiers at the same time. (2) The robust high-dimensional feature selection is developed based on the chi-square test and the new algorithm TSG (Top-scanning genes) without training is developed. The prediction accuracy is not only related to feature selection but also the influence of the classifier; the training is the main cause of the overfitting of most classifiers. The main stream algorithm (TSP) family is not only a feature selection method but also a classifier. In this paper, a TSG algorithm is proposed to overcome the defects such as the size of the sample, the constant number of the selected information genes and the fussy algorithm of the multi-classification. TSG puts forward and realizes the direct classification based on the transfer reasoning and does not need training, and the decision process comprises the following steps of: assuming that a sample to be detected belongs to a positive (+) class, combining the sample to be detected and the training sample to obtain a square value Chi +; and then, assuming that the sample to be tested belongs to a negative (-) class, And combining the sample to be detected and the training sample to obtain a square value Chi-; for example, Chi + Chi-, the sample to be tested belongs to a positive class, and vice versa. And so on. The characteristic selection process of the TSG is that the gene with the highest score is selected as a subset of the initial information genes, and then a gene with the best combination effect with the selected gene is selected from the remaining genes to be added to the information gene subset at a time, And the final information gene subset is automatically determined according to the retention-one method precision of the training set. TSG has obtained the results of independent prediction of 9 two-class and 10 multi-classification data, especially the prediction accuracy of the training set-keeping method is very close to that of the independent test set. The independent test precision on some data sets is even better than that of the training set, which shows that the TSG is unique, and the direct classification without training can effectively control the over-fitting. (3) The new method of selection of information gene was developed based on the interaction and the chi-square test (Chi-square test-based Integrated Rank Gene and Direct Classifier). the 1-2-IRG-DC feature selection process comprises the following steps of: firstly, using a single gene card square value and a pair of gene interaction card square values to calculate the comprehensive weighted score of the gene to obtain the importance of the gene; and sequentially introducing the sequencing gene based on the 1-2-DC classifier, and the first standard according to the retention-one method of the training set, The chi-square gain is the second standard deredundancy, and a more robust subset of information genes is obtained; and finally, independent prediction is carried out on the basis of the 1-2-DC and the information genes. In the meantime, the complexity of the algorithm is greatly reduced by the comprehensive weighted score of the gene, and the robustness of the feature selection is enhanced by introducing the second standard square-square gain. The independent prediction accuracy of 9 two-class and 10 multi-classified tumor gene expression profiles shows that the 2-2-IRG-DC model is better than that of the literature. As a feature selection method, the 1-2-IRG-DC is obviously superior to four reference feature selection methods such as mRMR, SVM-RFE, HC-K-TSP, TSG and the like; as a classifier, The 1 ~ 2-DC is better than that of NB, KNN and other reference classifiers. The method of this paper is of great theoretical and practical value for advancing high-dimensional data feature selection and tumor classification identification.
【學(xué)位授予單位】:湖南農(nóng)業(yè)大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2015
【分類號(hào)】:R730.2
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李鈞濤;賈英民;;用于癌癥分類與基因選擇的一種改進(jìn)的彈性網(wǎng)絡(luò)(英文)[J];自動(dòng)化學(xué)報(bào);2010年07期
2 黃海燕;;高矮胖瘦由你說[J];大眾科技;1999年08期
3 張樹波;賴劍煌;;基于融合信息的癌癥相關(guān)基因選擇方法[J];計(jì)算機(jī)科學(xué);2010年12期
4 姬翔;王安文;;一種基于SVM和相關(guān)性的基因選擇方法[J];計(jì)算機(jī)應(yīng)用與軟件;2007年06期
5 黃海燕;;胖瘦將由你掌握——人類未來飲食的重大變革[J];大科技;1999年05期
6 游偉;李樹濤;譚明奎;;基于SVM-RFE-SFS的基因選擇方法[J];中國(guó)生物醫(yī)學(xué)工程學(xué)報(bào);2010年01期
7 石修權(quán);王增珍;;多因子降維法在評(píng)價(jià)代謝酶基因-基因-環(huán)境交互作用中的應(yīng)用[J];環(huán)境與健康雜志;2010年12期
8 丁劍濤,黃濤,李蘭英,范鈺,沈巖,吳冠蕓;FMR1基因在人胚胎組織中的選擇剪接表達(dá)[J];中國(guó)醫(yī)學(xué)科學(xué)院學(xué)報(bào);1997年04期
9 孟超;;“瘋狂基因”:進(jìn)化的動(dòng)力?[J];中國(guó)新聞周刊;2011年46期
10 李鈞濤;賈英民;;PCD型自適應(yīng)彈性網(wǎng)絡(luò)在微陣列分類中的應(yīng)用[J];智能系統(tǒng)學(xué)報(bào);2010年03期
相關(guān)會(huì)議論文 前3條
1 任偉;閆桂英;;利用聚類算法來研究基因選擇問題[A];中國(guó)運(yùn)籌學(xué)會(huì)第八屆學(xué)術(shù)交流會(huì)論文集[C];2006年
2 張春美;;守望生命,關(guān)注人的尊嚴(yán)——基因倫理的若干熱點(diǎn)問題[A];中國(guó)的遺傳學(xué)研究——中國(guó)遺傳學(xué)會(huì)第七次代表大會(huì)暨學(xué)術(shù)討論會(huì)論文摘要匯編[C];2003年
3 李卉卉;袁谷;;血管內(nèi)皮生長(zhǎng)因子(VEGF)基因啟動(dòng)子區(qū)G-四鏈體識(shí)別的研究[A];第六屆全國(guó)化學(xué)生物學(xué)學(xué)術(shù)會(huì)議論文摘要集[C];2009年
相關(guān)重要報(bào)紙文章 前2條
1 鄭詩亮;薛人望談基因與生命[N];東方早報(bào);2011年
2 本報(bào)記者 章勇;基因選擇和飼養(yǎng)管理可改善羊肉顏色[N];中國(guó)畜牧獸醫(yī)報(bào);2014年
相關(guān)博士學(xué)位論文 前1條
1 張紅燕;腫瘤信息基因選擇與分類方法研究[D];湖南農(nóng)業(yè)大學(xué);2015年
相關(guān)碩士學(xué)位論文 前7條
1 周萍;基于頻度與聯(lián)合效應(yīng)的基因選擇[D];西安電子科技大學(xué);2009年
2 曹濤;基于聚類的混合基因選擇方法研究[D];湖南大學(xué);2011年
3 姬翔;基于SVM的多病類診斷基因選擇方法研究[D];西安電子科技大學(xué);2005年
4 吳希賢;基于優(yōu)化算法的基因選擇與癌癥分類[D];湖南大學(xué);2008年
5 劉申嶺;基于SVM的基因選擇[D];西安電子科技大學(xué);2004年
6 高紅超;基于聚類的基因選擇算法和DPC聚類算法研究[D];陜西師范大學(xué);2015年
7 陸燕;基于啟發(fā)式聚類的混合特征基因選擇方法研究[D];湖南大學(xué);2010年
,本文編號(hào):2453761
本文鏈接:http://sikaile.net/yixuelunwen/zlx/2453761.html