腫瘤信息基因選擇與分類方法研究

發(fā)布時間：2019-04-04 11:23

【摘要】：腫瘤是多基因與環(huán)境共同作用的結(jié)果,大規(guī)�；虮磉_(dá)譜技術(shù)的出現(xiàn)及其飛速發(fā)展為腫瘤研究提供了一種全新的技術(shù)平臺�；诨虮磉_(dá)譜的數(shù)據(jù)挖掘?qū)χ虏』虬l(fā)現(xiàn)、腫瘤臨床診斷、藥物療效判斷和發(fā)病機(jī)理闡明等意義重大。腫瘤基因表達(dá)譜數(shù)據(jù)多具特征維數(shù)高、樣本小或相對小、樣本背景差異大、存在批次效應(yīng)等非隨機(jī)噪聲、冗余度高、非線性、基因間存在互作效應(yīng)等特點,傳統(tǒng)的統(tǒng)計方法和模式識別方法應(yīng)用受限。本文針對基因表達(dá)數(shù)據(jù)特點,圍繞信息基因選擇方法和分類器構(gòu)建展開研究,主要結(jié)果如下：(1)基于支持向量機(jī)發(fā)展了高維特征選擇新方法二元矩陣重排過濾器BMSF (Binary Matrix Shift Filter)。大多數(shù)信息基因選擇方法只考慮單個基因或成對基因的作用,卻未考慮多個基因之間的相互作用。本文提出的BMSF算法綜合考慮了多基因間的互作關(guān)系,通過引入隨機(jī)產(chǎn)生的一個中間(0,1)二元矩陣,將分類問題轉(zhuǎn)化為回歸問題,實現(xiàn)了核函數(shù)參數(shù)尋優(yōu)前提下基于支持向量機(jī)的高維特征選擇。在基因選擇過程中,對保留在模型中的基因子集根據(jù)其在腫瘤分類中對其他基因的貢獻(xiàn)情況進(jìn)行遞歸優(yōu)化并反復(fù)更新。對9個癌基因表達(dá)二分類數(shù)據(jù)集, BMSF均以較小的信息基因子集獲得了遠(yuǎn)優(yōu)于文獻(xiàn)報道的留一法預(yù)測精度,所選信息基因子集能同時提高多個分類器的留一法預(yù)測精度。(2)基于卡方測驗發(fā)展了魯棒的高維特征選擇與無需訓(xùn)練的直接分類新算法TSG(Top-scoring genes)。預(yù)測精度既與特征選擇有關(guān),又受分類器的影響；訓(xùn)練是多數(shù)分類器產(chǎn)生過擬合的主要原因。主流算法TSP (Top score pairs)家族既是特征選擇方法又是分類器,本文克服TSP不能反映樣本大小、所選信息基因恒為偶數(shù)個、多分類時算法繁瑣等缺陷,提出TSG算法。TSG提出并實現(xiàn)了基于轉(zhuǎn)導(dǎo)推理、無需訓(xùn)練的直接分類,其決策過程為：先假定某個待測樣本屬于正(+)類,合并待測樣本與訓(xùn)練樣本得卡方值Chi+；再假定待測樣本屬于負(fù)(-)類,合并待測樣本與訓(xùn)練樣本得卡方值Chi-；如Chi+ Chi-,則待測樣本屬于正類,反之,則判為負(fù)類。多分類類推。TSG的特征選擇過程為：先選取出得分最高的基因?qū)S2作為初始信息基因子集,接著每次從剩余的基因中挑選一個與已入選基因聯(lián)合效應(yīng)最好的基因添加到信息基因子集中,并根據(jù)訓(xùn)練集的留一法精度自動確定最終的信息基因子集。TSG對9個二分類和10個多分類數(shù)據(jù)獨立預(yù)測均獲得了明顯優(yōu)于文獻(xiàn)報道的結(jié)果,特別是其訓(xùn)練集留一法預(yù)測精度與獨立測試集預(yù)測精度相當(dāng)接近,在部分?jǐn)?shù)據(jù)集上獨立測試精度甚至優(yōu)于訓(xùn)練集留一法預(yù)測精度,顯示TSG獨特的、無需訓(xùn)練的直接分類能有效控制過擬合。(3)基于互作與卡方測驗發(fā)展了信息基因選擇新方法χ~2-IRG-DC (Chi-square test-based Integrated Rank Gene and Direct Classifier).χ~2-IRG-DC特征選擇過程為：先利用單基因卡方值和成對基因互作卡方值,計算基因的綜合加權(quán)得分,得基因的重要性排序；再基于χ~2-DC分類器序貫引入排序基因,并依訓(xùn)練集的留一法精度為第一標(biāo)準(zhǔn)、卡方增益為第二標(biāo)準(zhǔn)去冗余,獲得了更為魯棒的信息基因子集；最后基于χ~2-DC和信息基因?qū)嵤┆毩㈩A(yù)測。χ~2-IRG-DC繼承TSG優(yōu)點的同時,進(jìn)一步通過基因綜合加權(quán)評分大幅降低了算法復(fù)雜度,通過引入第二標(biāo)準(zhǔn)卡方增益增強(qiáng)了特征選擇的魯棒性。對9個二分類和10個多分類腫瘤基因表達(dá)譜數(shù)據(jù)集的獨立預(yù)測精度表明,χ~2-IRG-DC模型明顯優(yōu)于文獻(xiàn)報道；作為特征選擇方法,χ~2-IRG-DC明顯優(yōu)于mRMR、SVM-RFE、HC-K-TSP、TSG等四種參比特征選擇方法；作為分類器,χ~2-DC明顯優(yōu)于NB、KNN等參比分類器,與SVM分類器性能可比。本文方法對于推進(jìn)高維數(shù)據(jù)特征選擇和腫瘤分類識別具有重要理論意義和實用價值。
[Abstract]:The development of large-scale gene expression profile and its rapid development provide a brand-new technology platform for tumor research. The data mining based on the gene expression profile is of great significance in the discovery of pathogenic genes, the clinical diagnosis of the tumor, the judgment of the curative effect of the drugs and the mechanism of the pathogenesis. The tumor gene expression profile data has the characteristics of high characteristic dimension, small sample size or relatively small sample background, large sample background difference, high redundancy, non-linearity, interaction effect between genes, and the like, and the traditional statistical method and the pattern recognition method are limited in application. In this paper, based on the characteristics of gene expression data, the research on the selection method of information gene and the construction of the classifier is carried out. The main results are as follows: (1) The binary matrix rearrangement filter BMSF (Binary Matrix Shift Filter) of high-dimensional feature selection is developed based on the support vector machine. Most of the information gene selection methods only take into account the action of a single gene or a pair of genes, but do not take into account the interaction between multiple genes. The BMSF algorithm proposed in this paper comprehensively considers the interaction between multi-genes, and transforms the classification problem into the regression problem by introducing an intermediate (0,1) binary matrix which is randomly generated, and realizes the high-dimensional feature selection based on the support vector machine under the premise of the optimization of the kernel function parameters. In the gene selection process, a subset of the genes remaining in the model is recursively optimized and updated repeatedly according to their contribution to other genes in the tumor classification. For 9 oncogene expression two-class data sets, BMSF is far superior to the one-way prediction accuracy of the literature report with a small subset of information genes, and the selected subset of information genes can improve the prediction accuracy of a plurality of classifiers at the same time. (2) The robust high-dimensional feature selection is developed based on the chi-square test and the new algorithm TSG (Top-scanning genes) without training is developed. The prediction accuracy is not only related to feature selection but also the influence of the classifier; the training is the main cause of the overfitting of most classifiers. The main stream algorithm (TSP) family is not only a feature selection method but also a classifier. In this paper, a TSG algorithm is proposed to overcome the defects such as the size of the sample, the constant number of the selected information genes and the fussy algorithm of the multi-classification. TSG puts forward and realizes the direct classification based on the transfer reasoning and does not need training, and the decision process comprises the following steps of: assuming that a sample to be detected belongs to a positive (+) class, combining the sample to be detected and the training sample to obtain a square value Chi +; and then, assuming that the sample to be tested belongs to a negative (-) class, And combining the sample to be detected and the training sample to obtain a square value Chi-; for example, Chi + Chi-, the sample to be tested belongs to a positive class, and vice versa. And so on. The characteristic selection process of the TSG is that the gene with the highest score is selected as a subset of the initial information genes, and then a gene with the best combination effect with the selected gene is selected from the remaining genes to be added to the information gene subset at a time, And the final information gene subset is automatically determined according to the retention-one method precision of the training set. TSG has obtained the results of independent prediction of 9 two-class and 10 multi-classification data, especially the prediction accuracy of the training set-keeping method is very close to that of the independent test set. The independent test precision on some data sets is even better than that of the training set, which shows that the TSG is unique, and the direct classification without training can effectively control the over-fitting. (3) The new method of selection of information gene was developed based on the interaction and the chi-square test (Chi-square test-based Integrated Rank Gene and Direct Classifier). the 1-2-IRG-DC feature selection process comprises the following steps of: firstly, using a single gene card square value and a pair of gene interaction card square values to calculate the comprehensive weighted score of the gene to obtain the importance of the gene; and sequentially introducing the sequencing gene based on the 1-2-DC classifier, and the first standard according to the retention-one method of the training set, The chi-square gain is the second standard deredundancy, and a more robust subset of information genes is obtained; and finally, independent prediction is carried out on the basis of the 1-2-DC and the information genes. In the meantime, the complexity of the algorithm is greatly reduced by the comprehensive weighted score of the gene, and the robustness of the feature selection is enhanced by introducing the second standard square-square gain. The independent prediction accuracy of 9 two-class and 10 multi-classified tumor gene expression profiles shows that the 2-2-IRG-DC model is better than that of the literature. As a feature selection method, the 1-2-IRG-DC is obviously superior to four reference feature selection methods such as mRMR, SVM-RFE, HC-K-TSP, TSG and the like; as a classifier, The 1 ~ 2-DC is better than that of NB, KNN and other reference classifiers. The method of this paper is of great theoretical and practical value for advancing high-dimensional data feature selection and tumor classification identification.
【學(xué)位授予單位】：湖南農(nóng)業(yè)大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2015
【分類號】：R730.2

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 李鈞濤;賈英民;;用于癌癥分類與基因選擇的一種改進(jìn)的彈性網(wǎng)絡(luò)(英文)[J];自動化學(xué)報;2010年07期

2 黃海燕;;高矮胖瘦由你說[J];大眾科技;1999年08期

3 張樹波;賴劍煌;;基于融合信息的癌癥相關(guān)基因選擇方法[J];計算機(jī)科學(xué);2010年12期

4 姬翔;王安文;;一種基于SVM和相關(guān)性的基因選擇方法[J];計算機(jī)應(yīng)用與軟件;2007年06期

5 黃海燕;;胖瘦將由你掌握——人類未來飲食的重大變革[J];大科技;1999年05期

6 游偉;李樹濤;譚明奎;;基于SVM-RFE-SFS的基因選擇方法[J];中國生物醫(yī)學(xué)工程學(xué)報;2010年01期

7 石修權(quán);王增珍;;多因子降維法在評價代謝酶基因-基因-環(huán)境交互作用中的應(yīng)用[J];環(huán)境與健康雜志;2010年12期

8 丁劍濤,黃濤,李蘭英,范鈺,沈巖,吳冠蕓;FMR1基因在人胚胎組織中的選擇剪接表達(dá)[J];中國醫(yī)學(xué)科學(xué)院學(xué)報;1997年04期

9 孟超;;“瘋狂基因”:進(jìn)化的動力?[J];中國新聞周刊;2011年46期

10 李鈞濤;賈英民;;PCD型自適應(yīng)彈性網(wǎng)絡(luò)在微陣列分類中的應(yīng)用[J];智能系統(tǒng)學(xué)報;2010年03期

相關(guān)會議論文前3條

1 任偉;閆桂英;;利用聚類算法來研究基因選擇問題[A];中國運籌學(xué)會第八屆學(xué)術(shù)交流會論文集[C];2006年

2 張春美;;守望生命,關(guān)注人的尊嚴(yán)——基因倫理的若干熱點問題[A];中國的遺傳學(xué)研究——中國遺傳學(xué)會第七次代表大會暨學(xué)術(shù)討論會論文摘要匯編[C];2003年

3 李卉卉;袁谷;;血管內(nèi)皮生長因子(VEGF)基因啟動子區(qū)G-四鏈體識別的研究[A];第六屆全國化學(xué)生物學(xué)學(xué)術(shù)會議論文摘要集[C];2009年

相關(guān)重要報紙文章前2條

1 鄭詩亮;薛人望談基因與生命[N];東方早報;2011年

2 本報記者章勇;基因選擇和飼養(yǎng)管理可改善羊肉顏色[N];中國畜牧獸醫(yī)報;2014年

相關(guān)博士學(xué)位論文前1條

1 張紅燕;腫瘤信息基因選擇與分類方法研究[D];湖南農(nóng)業(yè)大學(xué);2015年

相關(guān)碩士學(xué)位論文前7條

1 周萍;基于頻度與聯(lián)合效應(yīng)的基因選擇[D];西安電子科技大學(xué);2009年

2 曹濤;基于聚類的混合基因選擇方法研究[D];湖南大學(xué);2011年

3 姬翔;基于SVM的多病類診斷基因選擇方法研究[D];西安電子科技大學(xué);2005年

4 吳希賢;基于優(yōu)化算法的基因選擇與癌癥分類[D];湖南大學(xué);2008年

5 劉申嶺;基于SVM的基因選擇[D];西安電子科技大學(xué);2004年

6 高紅超;基于聚類的基因選擇算法和DPC聚類算法研究[D];陜西師范大學(xué);2015年

7 陸燕;基于啟發(fā)式聚類的混合特征基因選擇方法研究[D];湖南大學(xué);2010年

，

本文編號：2453761

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/yixuelunwen/zlx/2453761.html

上一篇：FAM172A在大腸癌轉(zhuǎn)移的作用機(jī)制研究
下一篇：甲基化CpG結(jié)合蛋白2對骨肉瘤生物學(xué)特性的影響和機(jī)制研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

腫瘤信息基因選擇與分類方法研究