天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于隨機森林和梯度提升模型的上位效應檢測算法研究

發(fā)布時間:2018-06-21 02:11

  本文選題:全基因關聯(lián)分析 + 上位效應; 參考:《哈爾濱工業(yè)大學》2016年碩士論文


【摘要】:過去十年中,全基因組關聯(lián)分析(GWAS)研究提高我們對疾病遺傳學的認知和理解,對于發(fā)現(xiàn)基因型-表型關系起到關鍵作用。在GWAS分析中,遺傳學家依靠DNA多態(tài)性標記來檢測這些關聯(lián)關系。單核苷酸多態(tài)性是其中最流行的一類遺傳標記,可以用來挖掘疾病的致病原因和潛在的生物機理。迄今為止,大多數(shù)遺傳關聯(lián)研究使用單基因位點分析策略,其中每個基因變體單獨和特定的表型關聯(lián)測試。但是這種策略在復雜疾病中則表現(xiàn)不成功,例如高血壓、糖尿病和哮喘等,這是由于單位點分析忽略上位效應,有些位點僅能夠通過與其他基因的相互作用而影響疾病,而該基因位點的主效應的影響非常小或者不存在,這一現(xiàn)象也被稱為“丟失的遺傳性”。研究表明,上位性是復雜的人類疾病病因中普遍存在的成分,在許多性狀的遺傳控制起到至關重要的作用。隨著高通量測序技術的出現(xiàn),使得研究人員能夠在全基因組范圍內(nèi)檢測上位效應,能夠更好的揭露出復雜疾病潛在的遺傳機理。而在全基因組范圍檢測上位效應所遭遇到的第一個困難和挑戰(zhàn)是計算負擔。在本文研究中,提出一種基于混合隨機森林框架的預篩選模型,來選擇最佳候選集合,然后在候選集合中使用MDR算法來檢測上位效應;旌想S機森林模型能夠篩選出主效應顯著的上位效應模型和主效應微弱而組合效應顯著的純上位效應。在相加模型、相乘模型、閾值模型和純上位模型四種類型的實驗中驗證了我們的算法,實驗結果表明該算法具有一定的實際意義。另外我們提出一種基于梯度提升模型的置換方法,用來檢測主效應微弱的純上位效應。所提出的置換梯度提升模型p GBM,通過移除SNP相互作用對GBM模型分類器的影響,來檢測最有可能發(fā)生相互作用的SNP組合對。我們采用平均AUC差值來定義相互作用,進而將模型應用到非平衡數(shù)據(jù)集上。在實驗驗證中當遺傳互質(zhì)性大于0.01的時候,該算法的檢測能力能夠達到百分之百,遺傳互質(zhì)性取值小于0.01的時候,其檢測能力也遠高于p RF算法。同時采用CPU并行計算的思想,提升模型的運算速度,進而縮短計算時間。p GBM算法采用6個CPU并行計算時,要比p RF算法快4.78倍。這種方法表現(xiàn)出很大的潛力,通過檢測基因-基因相互作用來研究潛在的遺傳結構,有利于揭示復雜的疾病機制。
[Abstract]:Over the past decade, Genome-wide Association Analysis (GWAS) has improved our understanding of disease genetics and played a key role in the discovery of genotypic relationships. In Gwas analysis, geneticists rely on polymorphic DNA markers to detect these associations. Single nucleotide polymorphism (SNP) is one of the most popular genetic markers, which can be used to explore the causes and potential biological mechanisms of disease. To date, most genetic association studies use single locus analysis strategies, in which each gene variant is individually and specifically tested for phenotypic association. But this strategy is not successful in complex diseases, such as hypertension, diabetes and asthma, because unit point analysis ignores epistatic effects, and some loci can affect disease only by interacting with other genes. The main effect of the locus is very small or nonexistent, a phenomenon also known as "lost heredity". Studies have shown that epistasis is a common component of complex human disease and plays an important role in genetic control of many traits. With the development of high-throughput sequencing technology, researchers can detect epistatic effect in the whole genome and reveal the potential genetic mechanism of complex diseases. The first difficulty and challenge in detecting epistatic effects across genomes is computational burden. In this paper, a prescreening model based on mixed stochastic forest framework is proposed to select the best candidate set, and then MDR algorithm is used to detect the epistatic effect in the candidate set. The mixed stochastic forest model can screen the epistatic effect model with significant main effect and pure epistatic effect with weak main effect and significant combination effect. Our algorithm is verified in four kinds of experiments: additive model, multiplication model, threshold model and pure epigynous model. The experimental results show that the algorithm has some practical significance. In addition, we propose a replacement method based on gradient lifting model to detect the pure epistatic effect with weak main effect. By removing the influence of SNP interaction on the classifier of GBM model, the proposed displacement gradient lifting model p GBM is used to detect the SNP combination pairs which are most likely to interact with each other. We use the average AUC difference to define the interaction and then apply the model to the non-equilibrium data set. In the experiment, when the genetic mutuality is greater than 0.01, the detection ability of the algorithm can reach 100%, and when the value of genetic mutuality is less than 0.01, the detection ability of the algorithm is much higher than that of the p RF algorithm. At the same time, using the idea of CPU parallel computing, the calculation speed of the model is improved, and the computing time is shortened by 4.78 times faster than that of the p RF algorithm when 6 CPUs are used for parallel computation. This method shows great potential, and it is helpful to reveal the complex disease mechanism by detecting gene-gene interaction to study the potential genetic structure.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:R440

【相似文獻】

相關碩士學位論文 前2條

1 張俊威;基于隨機森林和梯度提升模型的上位效應檢測算法研究[D];哈爾濱工業(yè)大學;2016年

2 孫安;上位效應檢測算法及其在MapReduce框架下實現(xiàn)的研究[D];吉林大學;2014年

,

本文編號:2046717

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/linchuangyixuelunwen/2046717.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶10cd1***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com