天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于隨機(jī)森林和梯度提升模型的上位效應(yīng)檢測(cè)算法研究

發(fā)布時(shí)間:2018-06-21 02:11

  本文選題:全基因關(guān)聯(lián)分析 + 上位效應(yīng)。 參考:《哈爾濱工業(yè)大學(xué)》2016年碩士論文


【摘要】:過(guò)去十年中,全基因組關(guān)聯(lián)分析(GWAS)研究提高我們對(duì)疾病遺傳學(xué)的認(rèn)知和理解,對(duì)于發(fā)現(xiàn)基因型-表型關(guān)系起到關(guān)鍵作用。在GWAS分析中,遺傳學(xué)家依靠DNA多態(tài)性標(biāo)記來(lái)檢測(cè)這些關(guān)聯(lián)關(guān)系。單核苷酸多態(tài)性是其中最流行的一類遺傳標(biāo)記,可以用來(lái)挖掘疾病的致病原因和潛在的生物機(jī)理。迄今為止,大多數(shù)遺傳關(guān)聯(lián)研究使用單基因位點(diǎn)分析策略,其中每個(gè)基因變體單獨(dú)和特定的表型關(guān)聯(lián)測(cè)試。但是這種策略在復(fù)雜疾病中則表現(xiàn)不成功,例如高血壓、糖尿病和哮喘等,這是由于單位點(diǎn)分析忽略上位效應(yīng),有些位點(diǎn)僅能夠通過(guò)與其他基因的相互作用而影響疾病,而該基因位點(diǎn)的主效應(yīng)的影響非常小或者不存在,這一現(xiàn)象也被稱為“丟失的遺傳性”。研究表明,上位性是復(fù)雜的人類疾病病因中普遍存在的成分,在許多性狀的遺傳控制起到至關(guān)重要的作用。隨著高通量測(cè)序技術(shù)的出現(xiàn),使得研究人員能夠在全基因組范圍內(nèi)檢測(cè)上位效應(yīng),能夠更好的揭露出復(fù)雜疾病潛在的遺傳機(jī)理。而在全基因組范圍檢測(cè)上位效應(yīng)所遭遇到的第一個(gè)困難和挑戰(zhàn)是計(jì)算負(fù)擔(dān)。在本文研究中,提出一種基于混合隨機(jī)森林框架的預(yù)篩選模型,來(lái)選擇最佳候選集合,然后在候選集合中使用MDR算法來(lái)檢測(cè)上位效應(yīng)。混合隨機(jī)森林模型能夠篩選出主效應(yīng)顯著的上位效應(yīng)模型和主效應(yīng)微弱而組合效應(yīng)顯著的純上位效應(yīng)。在相加模型、相乘模型、閾值模型和純上位模型四種類型的實(shí)驗(yàn)中驗(yàn)證了我們的算法,實(shí)驗(yàn)結(jié)果表明該算法具有一定的實(shí)際意義。另外我們提出一種基于梯度提升模型的置換方法,用來(lái)檢測(cè)主效應(yīng)微弱的純上位效應(yīng)。所提出的置換梯度提升模型p GBM,通過(guò)移除SNP相互作用對(duì)GBM模型分類器的影響,來(lái)檢測(cè)最有可能發(fā)生相互作用的SNP組合對(duì)。我們采用平均AUC差值來(lái)定義相互作用,進(jìn)而將模型應(yīng)用到非平衡數(shù)據(jù)集上。在實(shí)驗(yàn)驗(yàn)證中當(dāng)遺傳互質(zhì)性大于0.01的時(shí)候,該算法的檢測(cè)能力能夠達(dá)到百分之百,遺傳互質(zhì)性取值小于0.01的時(shí)候,其檢測(cè)能力也遠(yuǎn)高于p RF算法。同時(shí)采用CPU并行計(jì)算的思想,提升模型的運(yùn)算速度,進(jìn)而縮短計(jì)算時(shí)間。p GBM算法采用6個(gè)CPU并行計(jì)算時(shí),要比p RF算法快4.78倍。這種方法表現(xiàn)出很大的潛力,通過(guò)檢測(cè)基因-基因相互作用來(lái)研究潛在的遺傳結(jié)構(gòu),有利于揭示復(fù)雜的疾病機(jī)制。
[Abstract]:Over the past decade, Genome-wide Association Analysis (GWAS) has improved our understanding of disease genetics and played a key role in the discovery of genotypic relationships. In Gwas analysis, geneticists rely on polymorphic DNA markers to detect these associations. Single nucleotide polymorphism (SNP) is one of the most popular genetic markers, which can be used to explore the causes and potential biological mechanisms of disease. To date, most genetic association studies use single locus analysis strategies, in which each gene variant is individually and specifically tested for phenotypic association. But this strategy is not successful in complex diseases, such as hypertension, diabetes and asthma, because unit point analysis ignores epistatic effects, and some loci can affect disease only by interacting with other genes. The main effect of the locus is very small or nonexistent, a phenomenon also known as "lost heredity". Studies have shown that epistasis is a common component of complex human disease and plays an important role in genetic control of many traits. With the development of high-throughput sequencing technology, researchers can detect epistatic effect in the whole genome and reveal the potential genetic mechanism of complex diseases. The first difficulty and challenge in detecting epistatic effects across genomes is computational burden. In this paper, a prescreening model based on mixed stochastic forest framework is proposed to select the best candidate set, and then MDR algorithm is used to detect the epistatic effect in the candidate set. The mixed stochastic forest model can screen the epistatic effect model with significant main effect and pure epistatic effect with weak main effect and significant combination effect. Our algorithm is verified in four kinds of experiments: additive model, multiplication model, threshold model and pure epigynous model. The experimental results show that the algorithm has some practical significance. In addition, we propose a replacement method based on gradient lifting model to detect the pure epistatic effect with weak main effect. By removing the influence of SNP interaction on the classifier of GBM model, the proposed displacement gradient lifting model p GBM is used to detect the SNP combination pairs which are most likely to interact with each other. We use the average AUC difference to define the interaction and then apply the model to the non-equilibrium data set. In the experiment, when the genetic mutuality is greater than 0.01, the detection ability of the algorithm can reach 100%, and when the value of genetic mutuality is less than 0.01, the detection ability of the algorithm is much higher than that of the p RF algorithm. At the same time, using the idea of CPU parallel computing, the calculation speed of the model is improved, and the computing time is shortened by 4.78 times faster than that of the p RF algorithm when 6 CPUs are used for parallel computation. This method shows great potential, and it is helpful to reveal the complex disease mechanism by detecting gene-gene interaction to study the potential genetic structure.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:R440

【相似文獻(xiàn)】

相關(guān)碩士學(xué)位論文 前2條

1 張俊威;基于隨機(jī)森林和梯度提升模型的上位效應(yīng)檢測(cè)算法研究[D];哈爾濱工業(yè)大學(xué);2016年

2 孫安;上位效應(yīng)檢測(cè)算法及其在MapReduce框架下實(shí)現(xiàn)的研究[D];吉林大學(xué);2014年



本文編號(hào):2046717

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/linchuangyixuelunwen/2046717.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶10cd1***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com