組學(xué)大數(shù)據(jù)變異分析關(guān)鍵技術(shù)與系統(tǒng)研發(fā)
發(fā)布時(shí)間:2018-04-27 06:05
本文選題:基因組學(xué) + 變異 ; 參考:《哈爾濱工業(yè)大學(xué)》2017年碩士論文
【摘要】:下一代測序技術(shù)的迅猛發(fā)展,給生物信息學(xué)的領(lǐng)域研究帶來深刻的變革。人類基因組數(shù)據(jù)量愈來愈大,產(chǎn)生的變異信息越來越多,這為精準(zhǔn)醫(yī)療提供了探索疾病內(nèi)在成因的機(jī)會。但是隨之而來的是給計(jì)算設(shè)施帶來前所未有的壓力,當(dāng)前基因組數(shù)據(jù)生成與處理方法之間存在巨大差距,與之配套的數(shù)據(jù)分析、存儲與檢索技術(shù)較為落后,這成為制約組學(xué)大數(shù)據(jù)知識挖掘的瓶頸。旨在處理PB級數(shù)據(jù)的云計(jì)算的出現(xiàn),為這些不斷增長的需求提供了一個(gè)令人振奮的解決方案。本文探討的就是如何利用大數(shù)據(jù)技術(shù)對組學(xué)變異大數(shù)據(jù)實(shí)現(xiàn)高效分析、安全存儲和快速檢索。在本課題中,我們在研究基因組變異檢測分析過程的基礎(chǔ)上,充分結(jié)合大數(shù)據(jù)相關(guān)技術(shù),對變異檢測工具GATK進(jìn)行分布式并行化,實(shí)現(xiàn)了基于內(nèi)存計(jì)算模式的GATK-Spark,然后利用分布式數(shù)據(jù)庫HBase存儲GATK-Spark產(chǎn)生的高度注釋的VCF變異文件,接著針對存儲的變異信息利用Fisher精確檢驗(yàn)進(jìn)行等位基因頻率分析,形成了完整的組學(xué)變異大數(shù)據(jù)分析管道。我們開發(fā)的基因組變異大數(shù)據(jù)管理分析平臺,集成了變異檢測、查詢和分析模塊。其中變異檢測工具GATK-Spark,相比GATK有很大性能提升,在28核的Spark集群下,對于個(gè)人全基因組重測序數(shù)據(jù)的分析時(shí)間由3天降至4小時(shí)。此外,由GATK-Spark產(chǎn)生的變異直接存儲到查詢引擎,供后續(xù)變異分析。查詢引擎提供了一個(gè)可編程和交互式查詢接口,支持集成各種廣泛使用的基因組瀏覽器和工具。為了彌補(bǔ)HBase僅支持一級索引的短板,我們利用Elastic Search為HBase提供二級索引機(jī)制,使基于非Row Key的查詢性能提高近百倍。此外,本文給出了基于Fisher精確檢驗(yàn)的等位基因頻率分析算法,為存儲在HBase中的變異信息的后續(xù)分析提供了思路。與現(xiàn)有工具的良好集成以及可擴(kuò)展的數(shù)據(jù)庫,使得該系統(tǒng)適合日益增長基因組大數(shù)據(jù)的存儲、搜索和分析的需求,使變異分析過程得到極大簡化,為后續(xù)探索變異與疾病成因提供了有力支持。
[Abstract]:The rapid development of next generation sequencing technology has brought profound changes to the field of bioinformatics. The amount of human genome data is increasing and the variation information is becoming more and more, which provides an opportunity for accurate medical treatment to explore the intrinsic causes of disease. However, with the unprecedented pressure on computing facilities, there is a huge gap between the methods of generation and processing of genome data, and the data analysis, storage and retrieval techniques are relatively backward. This becomes the bottleneck of knowledge mining of big data. The emergence of cloud computing to handle PB-level data provides an exciting solution to these growing demands. This paper discusses how to use big data technology to realize efficient analysis, safe storage and fast retrieval of genetic variation big data. In this paper, on the basis of studying the process of genomic mutation detection and analysis, we fully combine big data's related technology to implement distributed parallelization of mutation detection tool GATK. The GATK-Spark-based memory computing model is implemented, and then the highly annotated VCF mutation file generated by GATK-Spark is stored by distributed database HBase, and the frequency of allele is analyzed by using Fisher accurate test for the stored mutation information. Formed a complete formation of variation big data analysis pipeline. We have developed big data Management Analysis platform for Genomic variation, which integrates mutation detection, query and analysis modules. The mutation detection tool GATK-Spark has a better performance than GATK. In the 28 core Spark cluster, the analysis time for individual genome resequencing data is reduced from 3 days to 4 hours. In addition, the mutation generated by GATK-Spark is stored directly into the query engine for subsequent mutation analysis. The query engine provides a programmable and interactive query interface that supports the integration of a variety of widely used genomic browsers and tools. In order to make up for the short board of HBase which only supports the first-level index, we use Elastic Search to provide the second-level index mechanism for HBase, which can improve the query performance of non- Key nearly a hundred times. In addition, this paper presents an algorithm of allele frequency analysis based on Fisher precise test, which provides a way for the subsequent analysis of variation information stored in HBase. Good integration with existing tools and extensible databases make the system suitable for the growing needs of big data for storage, search and analysis, and greatly simplify the process of mutation analysis. It provides a strong support for further exploring the causes of variation and disease.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:Q811.4;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 陳健;陳啟龍;蘇式兵;;中醫(yī)藥精準(zhǔn)醫(yī)療的思考與探索[J];世界科學(xué)技術(shù)-中醫(yī)藥現(xiàn)代化;2016年04期
2 趙輝;趙方慶;;基于千人基因組譜系數(shù)據(jù)的拷貝數(shù)變異識別與分析[J];南方醫(yī)科大學(xué)學(xué)報(bào);2015年06期
,本文編號:1809499
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1809499.html
最近更新
教材專著