生物疾病數(shù)據(jù)挖掘與系統(tǒng)建模
發(fā)布時(shí)間:2018-03-05 15:18
本文選題:降維 切入點(diǎn):模型選擇 出處:《上海交通大學(xué)》2014年博士論文 論文類型:學(xué)位論文
【摘要】:在后基因組時(shí)代,處理各個(gè)層次的生物數(shù)據(jù),是當(dāng)前生物信息學(xué)發(fā)展的重要任務(wù)。在海量數(shù)據(jù)中學(xué)習(xí)并選擇有效的信息,來鑒別及分析一系列特定疾病的分子特征與規(guī)律,對(duì)于疾病的診斷與預(yù)后至關(guān)重要。更加關(guān)鍵的,從系統(tǒng)生物學(xué)的角度去研究疾病的分子機(jī)理,建立定量的調(diào)控網(wǎng)絡(luò)模型,已經(jīng)成為研究重大疾病分子機(jī)理的關(guān)鍵步驟。然而,現(xiàn)有的學(xué)習(xí)算法沒能針對(duì)疾病相關(guān)數(shù)據(jù)自身的特點(diǎn),為特定疾病設(shè)計(jì)學(xué)習(xí)高通量數(shù)據(jù)的計(jì)算方法,以至于未能充分反映疾病的全部關(guān)鍵特征;特別是定量模型的缺乏,使得一些基因表達(dá)調(diào)控網(wǎng)絡(luò)沒有得到有效的建立與分析。疾病相關(guān)的特征過多而生物實(shí)驗(yàn)數(shù)據(jù)不足所造成的“小樣本問題”則是造成上述問題的主要原因之一。本文著眼于學(xué)習(xí)一系列疾病的關(guān)鍵特征,以及疾病相關(guān)定量的分子動(dòng)力學(xué)機(jī)制,特別針對(duì)處理“小樣本問題”為不同的生物醫(yī)學(xué)問題設(shè)計(jì)了專門的算法。本文的主要工作任務(wù)包含三個(gè)部分:1,為肺炎以及齲齒的元基因組16s rRNA數(shù)據(jù)設(shè)計(jì)“特征合并選擇算法”,學(xué)習(xí)并提取關(guān)于微生物種類的特征組合。該算法在充分降維壓縮特征空間的同時(shí)保留了充足的原始特征數(shù)量,并且轉(zhuǎn)化后的新特征組合之間沒有重疊,使之更具有可理解性。經(jīng)過兩種不同疾病元基因組數(shù)據(jù)的驗(yàn)證,該算法不僅比其他方法擁有較高的識(shí)別率,同時(shí)也保證了較低的維數(shù),使得模型更加穩(wěn)定。2,針對(duì)白血病小鼠體內(nèi)正常的造血干細(xì)胞Maff與Egr3兩種基因高表達(dá),并且以相反方式影響細(xì)胞周期的生物實(shí)驗(yàn)結(jié)果,本文通過生物信息網(wǎng)絡(luò)資源,經(jīng)過“窮舉——模型選擇”的方式篩選出Maff與Egr3調(diào)控細(xì)胞周期的定量模型。在模擬細(xì)胞周期一系列關(guān)鍵分子表達(dá)量以及結(jié)合位點(diǎn)序列掃描等方式驗(yàn)證模型之后,通過動(dòng)力學(xué)模擬,計(jì)算得到Egr3強(qiáng)烈抑制細(xì)胞周期,而Maff促進(jìn)細(xì)胞周期則要受到前者約束的一系列結(jié)論,同時(shí)也印證了白血病環(huán)境下的正常細(xì)胞“癌化——自我保護(hù)”的機(jī)制。3,針對(duì)脂肪細(xì)胞分化過程中的基因表達(dá)調(diào)控網(wǎng)絡(luò),為基因表達(dá)數(shù)據(jù)的小樣本問題,設(shè)計(jì)了基因定量調(diào)控網(wǎng)絡(luò)的參數(shù)估計(jì)算法——“小樣本迭代優(yōu)化算法”。該算法能夠在樣本量明顯不足的情況下,正確而又準(zhǔn)確地估計(jì)合理的參數(shù),從而實(shí)現(xiàn)定量調(diào)控網(wǎng)絡(luò)的構(gòu)建,并且在人類與小鼠兩個(gè)物種的調(diào)控網(wǎng)絡(luò)得到了驗(yàn)證。此外,通過尋找分化前后差異表達(dá)較大的基因,對(duì)比計(jì)算發(fā)現(xiàn)了一系列額外的反饋結(jié)構(gòu)并且得到了驗(yàn)證。在估算定量網(wǎng)絡(luò)的基礎(chǔ)上分別在參數(shù)大小,動(dòng)力學(xué)結(jié)果,以及統(tǒng)計(jì)調(diào)控強(qiáng)度差異等方面比較了人類與小鼠脂肪分化的異同之處。得出了兩物種在基因表達(dá)調(diào)控細(xì)節(jié)上的諸多差異與人類和小鼠脂肪分化系統(tǒng)的效率差異之間的潛在關(guān)系。
[Abstract]:In the post-genome era, processing biological data at all levels is an important task in the development of bioinformatics. Learning and selecting effective information from massive data to identify and analyze the molecular characteristics and laws of a series of specific diseases. It is very important for the diagnosis and prognosis of disease. More importantly, studying the molecular mechanism of disease from the point of view of system biology and establishing a quantitative regulatory network model have become the key steps to study the molecular mechanism of major diseases. The existing learning algorithms have not been able to design the calculation method of high-throughput data for specific diseases according to the characteristics of disease-related data, so that they can not fully reflect all the key characteristics of disease, especially the lack of quantitative models. Some gene expression regulatory networks have not been effectively established and analyzed. The "small sample problem" caused by too many disease-related characteristics and insufficient biological experimental data is one of the main reasons for these problems. This article focuses on learning the key features of a range of diseases, And disease related quantitative molecular dynamics mechanisms, Special algorithms are designed to deal with "small sample problem" for different biomedical problems. The main task of this paper includes three parts: 1, designed for pneumonia and dental caries meta-genome 16s rRNA data. And select the algorithm ", learn and extract the feature combination of microbial species. This algorithm reduces the dimension of the feature space and retains sufficient number of original features." And the transformed new feature combination has no overlap, which makes it more comprehensible. After the verification of two different disease metadata, the algorithm not only has a higher recognition rate than other methods, but also ensures lower dimension. Make the model more stable. 2. In view of the high expression of Maff and Egr3 genes of normal hematopoietic stem cells in leukemia mice, and affect the cell cycle in the opposite way, this paper through the biological information network resources, A quantitative model of cell cycle regulation by Maff and Egr3 was selected by exhaustive model selection. After simulating cell cycle with a series of key molecules expression and binding site sequence scanning, the model was verified by kinetic simulation. It was calculated that Egr3 strongly inhibited cell cycle, while Maff inhibited cell cycle by a series of conclusions. At the same time, it also confirms the mechanism of "carcinogenesis-self-protection" of normal cells in leukemia environment. It aims at the gene expression regulatory network during adipocyte differentiation, which is a small sample of gene expression data. The parameter estimation algorithm of gene quantitative control network, "small sample iterative optimization algorithm", is designed. This algorithm can correctly and accurately estimate reasonable parameters under the condition of obvious shortage of sample size, so as to realize the construction of quantitative control network. And the regulatory networks of both human and mouse species were verified. In addition, by looking for genes that were differentially expressed before and after differentiation, A series of additional feedback structures are found and verified by the comparative calculation. Based on the estimation of the quantitative network, the size of the parameters and the dynamic results are obtained, respectively. The differences between human and mouse adipose differentiation were compared in terms of statistical regulation intensity, and the potential relationship between the differences in gene expression and regulation details and the efficiency of adipose differentiation system in human and mouse was obtained.
【學(xué)位授予單位】:上海交通大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:R318
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 申偉科;鐘理;;基因表達(dá)聚類分析及在腫瘤研究中的應(yīng)用[J];腫瘤學(xué)雜志;2008年05期
2 Amr M.GHALEB,Mandayam O.NANDAN,Sengthong CHANCHEVALAP,W.Brian DALTON,Irfan M.HISAMUDDIN,Vincent W.YANG;Krüppel-like factors 4 and 5:the yin and yang regulators of cellular proliferation[J];Cell Research;2005年02期
,本文編號(hào):1570787
本文鏈接:http://sikaile.net/yixuelunwen/swyx/1570787.html
最近更新
教材專著