基于TCGA和PubMed數(shù)據(jù)庫的高維生物醫(yī)學數(shù)據(jù)的數(shù)據(jù)挖掘和特征選擇研究
[Abstract]:With the rapid development of technology in the field of life sciences, especially the development of sequencing technology, biomedical data exhibits a dramatic expansion. Biomedical data not only has huge data volume, but also has the characteristics of high dimension, and the feature quantity is much larger than that of observation volume (sample size). Therefore, the appearance of these data not only brings new opportunities to researchers, but also brings new challenges. How to excavate the relationship chain of mass data has become the focus of the research work. Feature selection means that a subset of the original data is selected to represent the features of the original data, and the well-designed feature selection method enables these features to be used for subsequent data mining operations. It's no exaggeration to say that feature selection is based on data mining as yellow sand takes gold, almost any complete data mining effort avoids this step. Therefore, using feature selection technique as carrier point, this paper explores the biological informatics research methods related to high-dimensional biomedical data using two important biomedical questions as vectors. Through this study, we will put forward different features and strategies from multiple levels, and further study the characterization and prediction ability of these strategies in practical biomedical questions. The feature selection methods and results developed in this paper can provide important references for the processing and analysis of high-dimensional biomedical data. Feature selection mainly occurs in the field of machine learning and statistics, referring to the selection of closely related variables from a large number of variables for model construction. Feature selection has three main advantages: simplified model makes it easier to understand, shorten model training time, and increase model generalization ability by reducing overfitting. In practical research, most of the variables in the variable set are redundant information, and they do not cause loss of information. Therefore, feature selection is an indispensable step for dealing with massive high-dimensional biomedical data. As the 14 th century philosopher Augustan put forward "Occam Razor" Law: If not necessary, do not increase the entity. It can be said that the characteristic screening, the simplified model is the soul of mass data processing. Therefore, feature selection is a key step for the processing of mass biomedical data, which is also the starting point of this paper. At present, feature selection mainly has two kinds of methods, one is to use the topological structure of the data itself, the statistical signal is screened, and the other is the introduction of external knowledge, such as background knowledge in some specific fields. In this paper, using the data in the Cancer Genome Atlas database, the two methods are used to predict the prognosis of the tumor. First, in terms of utilizing the topological structure of data itself, we focus on the screening and discovery of gene and small RNA diagnostic markers of hepatocellular carcinoma. in one network, a relatively high degree of node is referred to as Hub We have found that these Hub nodes in these Hub nodes are more enriched with genes associated with the prognosis of HCC, indicating that these Hub nodes in complex molecular networks are more likely to be a potential feature of determining the prognosis of HCC, in combination with survival analysis techniques and studying the topological properties of prognostic-based survival-related molecules. i.e. molecular markers. Secondly, in the field of knowledge, we focus on the prediction of drug response after multiple tumor chemotherapy interventions. The main cause of tumor chemotherapy failure is due to multiple drug resistance (MDR) in the body. Drug resistance is a relatively complex process, usually due to the overexpression of the associated protein encoded by the drug-resistant gene, the chemotherapeutic agent being pumped out of the cell by the action of the energy-dependent elution pump, thereby reducing the aggregation of chemotherapeutic agents within the cells, leading to the occurrence of drug resistance in the body. For this reason, we use the gene mutation as the exposure factor, the drug resistance of the tumor is the exposure result, the relative risk ratio (RR) and the statistical significance P-value are combined to screen, and the drug resistance-related mutation gene of eight tumors is obtained as the feature set of the prognosis prediction model. Using this feature set, we used three kinds of machine learning methods to predict the drug resistance of eight kinds of tumor samples. Especially in the head and neck squamous cell carcinoma (HNSC), the area under the ROC curve (AUC) can reach 0. 980, indicating that the model which can be characterized by the knowledge in the field can be used for drug-resistant patients and drug-sensitive patients after drug intervention. Important references are provided to help the patient choose the appropriate treatment modality. In addition to drug intervention, more and more studies have shown that dietary intervention is also an important means of regulating human health, and therefore, in addition to studying the prognosis of tumor therapy, We also try to predict potential health-beneficial carbohydrates, also known as prebiotics, based on mass text data from PubMed databases. We downloaded 15 known prebiotics from PubMed database and extracted features, modeled and analyzed the predicted carbohydrate by using the feature set, and calculated a list of potential prebiotics names. This mining method can not only provide references for other data mining scholars, but also provide an important reference list for scholars studying prebiotics. Data mining is becoming more and more important with the opening of large-scale data in the field of biomedicine. Data mining method helps to understand life from system level, is an important method to study life science, and feature selection is the soul of data mining. On this basis, we will consider the whole text data and the biological expression data in future research to make some meaningful attempts to improve the human health.
【學位授予單位】:中國人民解放軍軍事醫(yī)學科學院
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13;R318
【相似文獻】
相關(guān)期刊論文 前10條
1 張葛祥;金煒東;胡來招;;滿意特征選擇及其應用[J];控制理論與應用;2006年01期
2 付濤;;基于特征選擇的多示例學習算法研究[J];科技通報;2013年08期
3 楊打生,郭延芬;一種特征選擇的信息論算法[J];內(nèi)蒙古大學學報(自然科學版);2005年03期
4 張永;曹東俠;;一種高效的特征選擇機制應用于入侵檢測[J];甘肅科學學報;2011年03期
5 楊錦英;王碧泉;;K—W檢驗和熵法在單個特征選擇中的應用[J];華北地震科學;1989年02期
6 劉代志,李夕海,張斌;基于序優(yōu)化方法的特征選擇研究[J];核電子學與探測技術(shù);2004年06期
7 劉開第,薛俊鋒,龐彥軍;特征選擇及其常用算法[J];河北建筑科技學院學報;2004年04期
8 喻軍;孟曉玲;;一種基于層次分析的特征選擇法[J];中國科技信息;2006年10期
9 南重漢;鄒凌云;;基于分組重量編碼和特征選擇技術(shù)預測外膜蛋白[J];第三軍醫(yī)大學學報;2013年13期
10 苗玉杰;;差分進化在圖像特征選擇中的應用研究[J];科技通報;2013年08期
相關(guān)會議論文 前10條
1 靖紅芳;王斌;楊雅輝;;基于類別分布的特征選擇框架[A];第四屆全國信息檢索與內(nèi)容安全學術(shù)會議論文集(上)[C];2008年
2 李長升;盧漢清;;排序?qū)W習模型中的特征選擇[A];第六屆和諧人機環(huán)境聯(lián)合學術(shù)會議(HHME2010)、第19屆全國多媒體學術(shù)會議(NCMT2010)、第6屆全國人機交互學術(shù)會議(CHCI2010)、第5屆全國普適計算學術(shù)會議(PCC2010)論文集[C];2010年
3 史東輝;蔡慶生;張春陽;;一種新的數(shù)據(jù)挖掘多策略方法研究[A];第十七屆全國數(shù)據(jù)庫學術(shù)會議論文集(研究報告篇)[C];2000年
4 張弦;;數(shù)據(jù)挖掘在農(nóng)業(yè)中的應用[A];紀念中國農(nóng)業(yè)工程學會成立30周年暨中國農(nóng)業(yè)工程學會2009年學術(shù)年會(CSAE 2009)論文集[C];2009年
5 魏順平;;教育數(shù)據(jù)挖掘:現(xiàn)狀與趨勢[A];信息化、工業(yè)化融合與服務創(chuàng)新——第十三屆計算機模擬與信息技術(shù)學術(shù)會議論文集[C];2011年
6 關(guān)清平;沉培輝;;概率網(wǎng)絡在數(shù)據(jù)挖掘上的應用[A];科技、工程與經(jīng)濟社會協(xié)調(diào)發(fā)展——中國科協(xié)第五屆青年學術(shù)年會論文集[C];2004年
7 丁瑾;;基于Web數(shù)據(jù)挖掘的綜述[A];山西省科學技術(shù)情報學會學術(shù)年會論文集[C];2004年
8 劉功申;李建華;李生紅;;基于類信息的特征選擇和加權(quán)方法[A];NCIRCS2004第一屆全國信息檢索與內(nèi)容安全學術(shù)會議論文集[C];2004年
9 聶茹;田森平;;Web數(shù)據(jù)挖掘及其在電子商務中的應用[A];中南六省(區(qū))自動化學會第24屆學術(shù)年會會議論文集[C];2006年
10 李菊;王軍;;數(shù)據(jù)挖掘在客戶關(guān)系管理的應用[A];計算機技術(shù)與應用進展·2007——全國第18屆計算機技術(shù)與應用(CACIS)學術(shù)會議論文集[C];2007年
相關(guān)重要報紙文章 前10條
1 本報記者褚寧;數(shù)據(jù)挖掘如“挖金”[N];解放日報;2002年
2 周蓉蓉;數(shù)據(jù)挖掘需要點想像力[N];計算機世界;2004年
3 □中國電信股份有限公司北京研究院 張舒博 □北京郵電大學計算機科學與技術(shù)學院 牛琨;走出數(shù)據(jù)挖掘的誤區(qū)[N];人民郵電;2006年
4 《網(wǎng)絡世界》記者 王瑩;數(shù)據(jù)挖掘保險業(yè)的新藍海[N];網(wǎng)絡世界;2012年
5 劉俊麗;基于地理化的網(wǎng)絡數(shù)據(jù)挖掘與分析提升投資有效性[N];人民郵電;2014年
6 本報記者 連曉東;數(shù)據(jù)挖掘:金融信息化新熱點[N];中國電子報;2002年
7 本報記者 鳳小華 朱仁康;“數(shù)字挖掘軟件”引領(lǐng)中國信息化新浪潮[N];中國電子報;2003年
8 本報記者 史延廷;“成功企業(yè)數(shù)據(jù)挖掘暨數(shù)量化管理論壇”在京舉辦[N];中國旅游報;2002年
9 朱小寧;數(shù)據(jù)挖掘:信息化戰(zhàn)爭的基礎工程[N];解放軍報;2005年
10 本報記者 王小平;從“大集中”走向數(shù)據(jù)挖掘[N];金融時報;2002年
相關(guān)博士學位論文 前10條
1 李靜;高維數(shù)據(jù)交互特征選擇和分類研究[D];燕山大學;2015年
2 劉風;基于磁共振成像的多變量模式分析方法學與應用研究[D];電子科技大學;2014年
3 王石平;粗糙擬陣及其在高維數(shù)據(jù)降維中的應用研究[D];電子科技大學;2014年
4 代琨;基于支持向量機的網(wǎng)絡數(shù)據(jù)特征選擇技術(shù)研究[D];解放軍信息工程大學;2013年
5 王愛國;微陣列基因表達數(shù)據(jù)的特征分析方法研究[D];合肥工業(yè)大學;2015年
6 楊峻山;生物組學數(shù)據(jù)的集成特征選擇研究[D];深圳大學;2017年
7 王博;文本分類中特征選擇技術(shù)的研究[D];國防科學技術(shù)大學;2009年
8 張明錦;基于特征選擇的多變量數(shù)據(jù)分析方法及其在譜學研究中的應用[D];華東理工大學;2011年
9 高青斌;蛋白質(zhì)亞細胞定位預測相關(guān)問題研究[D];國防科學技術(shù)大學;2006年
10 馮國忠;文本分類中的貝葉斯特征選擇[D];東北師范大學;2011年
相關(guān)碩士學位論文 前10條
1 單光宇;基于TCGA和PubMed數(shù)據(jù)庫的高維生物醫(yī)學數(shù)據(jù)的數(shù)據(jù)挖掘和特征選擇研究[D];中國人民解放軍軍事醫(yī)學科學院;2017年
2 周瑞;基于支持向量機特征選擇的移動通信網(wǎng)絡問題分析[D];華南理工大學;2015年
3 張金蕾;蛋白質(zhì)SUMO化修飾位點預測的數(shù)據(jù)挖掘技術(shù)研究[D];西北農(nóng)林科技大學;2015年
4 陳云風;基于聚類集成技術(shù)的高鐵信號故障診斷研究[D];西南交通大學;2015年
5 張斌斌;網(wǎng)絡股評的傾向性分析[D];中央民族大學;2015年
6 季金勝;高分辨率遙感影像典型地物目標的特征選擇及其穩(wěn)定性研究[D];上海交通大學;2015年
7 袁玉錄;基于數(shù)據(jù)分類的網(wǎng)絡通信行為建模方法研究[D];電子科技大學;2015年
8 王虎;基于試驗設計的白酒譜圖特征選擇及支持向量機參數(shù)優(yōu)化研究[D];南京財經(jīng)大學;2015年
9 王維智;基于特征提取和特征選擇的級聯(lián)深度學習模型研究[D];哈爾濱工業(yè)大學;2015年
10 皮陽;基于聲音的生物種群識別[D];電子科技大學;2015年
,本文編號:2252235
本文鏈接:http://sikaile.net/yixuelunwen/swyx/2252235.html