基于Hadoop的全基因組關聯(lián)研究系統(tǒng)設計與實現(xiàn)
本文選題:全基因組關聯(lián)研究 切入點:Hadoop 出處:《天津大學》2012年碩士論文
【摘要】:隨著人類基因組精細圖譜的發(fā)布,全基因組關聯(lián)研究(Genome-wideassociation study,GWAS)得到了快速發(fā)展并成為研究人類復雜性疾病遺傳因素的重要手段。基因填補(genotype imputation)能夠增加研究數(shù)據(jù)中單核苷酸多態(tài)性(single nucleotide polymorphism,SNP)的密度,提高GWAS發(fā)現(xiàn)致病基因的能力,因此基于基因填補的GWAS方法得到了廣泛應用。然而,這種方法目前在實際應用中存在著兩方面的問題:(1)缺少綜合的系統(tǒng)工具來完成整個GWAS的數(shù)據(jù)處理以及分析工作;(2)當前用于基因填補和關聯(lián)檢測的GWAS工具不能有效地應對由參考數(shù)據(jù)增加而導致的數(shù)據(jù)量和計算量大幅的增加。 本文在對基于基因填補的GWAS方法和Hadoop平臺進行研究的基礎上,實現(xiàn)了一個基于Hadoop平臺的全基因組關聯(lián)研究系統(tǒng)——CloudAssoc,該系統(tǒng)主要包括數(shù)據(jù)預處理、基因填補和SNPs關聯(lián)檢測三個功能模塊。數(shù)據(jù)預處理模塊能夠實現(xiàn)常用數(shù)據(jù)轉換和質量控制功能;基因填補模塊基于Hadoop平臺設計實現(xiàn),用于根據(jù)公共數(shù)據(jù)預測研究數(shù)據(jù)中沒有分型的SNPs位點的基因型;關聯(lián)檢測模塊同樣基于Hadoop平臺實現(xiàn),,用于對填補后的研究數(shù)據(jù)進行SNPs的關聯(lián)檢測。 CloudAssoc能夠提高GWAS效率的關鍵在于基因填補模塊和關聯(lián)檢測模塊的并行化實現(xiàn)。本文根據(jù)對基因填補軟件IMPUTE2所用模型和算法的分析研究,使用分割數(shù)據(jù)分析區(qū)間的方法,將時間和資源消耗巨大的計算任務切分為眾多在Hadoop集群上分布式執(zhí)行的小任務,基于Hadoop streaming框架實現(xiàn)了基因填補的并行化;并采用類似的方法,實現(xiàn)了關聯(lián)檢測模塊的并行化。 本文最后對系統(tǒng)進行了測試。首先對CloudAssoc中并行化軟件的可擴展性、高效性、運行時間與數(shù)據(jù)分割窗口大小的關系進行了測試。測試表明,系統(tǒng)中并行化軟件具有接近線性的加速比,具有良好的可擴展性以及高效性。最后,對CloudAssoc進行了整體測試,測試結果表明本系統(tǒng)能夠高效完成對全基因組數(shù)據(jù)的基于基因填補的GWAS分析。
[Abstract]:With the release of the detailed map of the human genome, Genome-wide Association study (Genome-wide Association) has been developed rapidly and become an important means to study the genetic factors of human complex diseases.Gene filling can increase the density of single nucleotide polymorphisms (SNPs) and enhance the ability of GWAS to detect pathogenic genes. Therefore, the GWAS method based on gene filling has been widely used.However,There are two problems in the practical application of this method. (1) lack of comprehensive system tools to complete the data processing and analysis of the whole GWAS. The current GWAS tools for gene filling and association detection are not effective.A large increase in the amount of data and computation resulting from an increase in reference data.Based on the research of GWAS method and Hadoop platform based on gene filling, a genome association research system based on Hadoop platform, CloudAssoc-based, is implemented in this paper. The system mainly includes data preprocessing.There are three functional modules of gene filling and SNPs association detection.The data preprocessing module can realize the function of data conversion and quality control, the gene filling module is designed and implemented based on Hadoop platform, which is used to predict the genotypes of SNPs loci that are not typed in the data according to the common data.The association detection module is also implemented based on Hadoop platform, which is used for SNPs association detection of the research data after filling.The key to improve the efficiency of GWAS by CloudAssoc lies in the parallelization of gene filling module and association detection module.Based on the analysis of the models and algorithms used in the gene filling software IMPUTE2, this paper uses the method of dividing the data analysis interval to divide the computation tasks which consume a great deal of time and resources into many small tasks that are distributed on the Hadoop cluster.The parallelization of gene filling is realized based on Hadoop streaming framework, and the parallelization of association detection module is realized by using a similar method.Finally, the system is tested.Firstly, the relationship between the expansibility, high efficiency, running time and the size of the data partition window in CloudAssoc parallel software is tested.The test results show that the parallelized software has a linear speedup, good scalability and high efficiency.Finally, the overall test of CloudAssoc is carried out, and the results show that the system can efficiently complete the gene-filled GWAS analysis of the whole genome data.
【學位授予單位】:天津大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:R394;TP311.52
【參考文獻】
相關期刊論文 前10條
1 吳瑞卿;曾昕;王智;;人類基因組計劃和人類基因組單體型圖計劃:口腔醫(yī)學的機遇、挑戰(zhàn)與對策思考[J];華西口腔醫(yī)學雜志;2010年04期
2 江玉梅,楊桂玲;連鎖不平衡的研究與應用[J];江西植保;2004年02期
3 何云剛,金力,黃薇;單核苷酸多態(tài)性與連鎖不平衡研究進展[J];基礎醫(yī)學與臨床;2004年05期
4 鄭欣杰;朱程榮;熊齊邦;;基于MapReduce的分布式光線跟蹤的設計與實現(xiàn)[J];計算機工程;2007年22期
5 王勇華;滕少華;;基于染色體遺傳規(guī)律的遺傳算法[J];計算機應用與軟件;2008年06期
6 俞黎敏;;函數(shù)式編程思想[J];程序員;2010年09期
7 李杰輝;張亮;陳健;南蓬;;基于Hadoop的化合物生物活性分析系統(tǒng)[J];計算機工程;2012年13期
8 林文婷;袁洪;黃志軍;李瑩;;全基因組關聯(lián)研究在高血壓研究中的應用[J];現(xiàn)代診斷與治療;2012年03期
9 嚴衛(wèi)麗;;復雜疾病全基因組關聯(lián)研究進展——遺傳統(tǒng)計分析[J];遺傳;2008年05期
10 權晟;張學軍;;全基因組關聯(lián)研究的深度分析策略[J];遺傳;2011年02期
相關碩士學位論文 前2條
1 朱克峰;基于隱馬爾科夫模型的人臉認證算法研究[D];北京交通大學;2009年
2 陳娜;基于Hadoop平臺的海量數(shù)據(jù)處理應用[D];吉林大學;2012年
本文編號:1729007
本文鏈接:http://sikaile.net/xiyixuelunwen/1729007.html