當(dāng)前位置：主頁(yè) > 碩博論文 > 信息類(lèi)碩士論文 >

組學(xué)大數(shù)據(jù)環(huán)境下的基因信息并行處理與分析方法研究

發(fā)布時(shí)間：2017-12-28 20:00

本文關(guān)鍵詞：組學(xué)大數(shù)據(jù)環(huán)境下的基因信息并行處理與分析方法研究　出處：《中國(guó)科學(xué)技術(shù)大學(xué)》2017年碩士論文　論文類(lèi)型：學(xué)位論文

【摘要】：隨著下一代測(cè)序技術(shù)的不斷發(fā)展和逐漸成熟,高通量測(cè)序已經(jīng)成為生物、醫(yī)學(xué)研究中的常規(guī)工具,也即將在農(nóng)業(yè)和醫(yī)療等行業(yè)中得到廣泛應(yīng)用,促生了精準(zhǔn)醫(yī)療和分子育種等新興產(chǎn)業(yè)。不同以往的低通量技術(shù),高通量測(cè)序技術(shù)所產(chǎn)生的多種組學(xué)(全基因組、全外顯子組、轉(zhuǎn)錄組、宏基因組等)數(shù)據(jù)具有通量高、數(shù)據(jù)量大、復(fù)雜異質(zhì)等特點(diǎn),所涉及的處理與分析步驟多且繁瑣,對(duì)數(shù)據(jù)處理的軟、硬件都提出了較高的要求。如何快速、高效處理和分析高通量測(cè)序數(shù)據(jù)成為高通量測(cè)序技術(shù)廣泛應(yīng)用的瓶頸。比如,當(dāng)前受到廣泛關(guān)注的精準(zhǔn)醫(yī)療主要依賴(lài)于基因測(cè)序技術(shù),如何高效處理和分析海量的病人的基因測(cè)序數(shù)據(jù),從中獲取個(gè)性化的癌變驅(qū)動(dòng)信息成為實(shí)現(xiàn)腫瘤精準(zhǔn)診療的關(guān)鍵和難點(diǎn)問(wèn)題�；驕y(cè)序技術(shù)從第一代測(cè)序技術(shù)發(fā)展到當(dāng)前最新的第三代測(cè)序技術(shù),其測(cè)序通量爆炸性增長(zhǎng)。第一代測(cè)序技術(shù)的通量?jī)H僅只有0.2MB/run,而以Illumina為代表的第二代測(cè)序技術(shù)其通量能達(dá)到1500GB/run左右,第三代測(cè)序技術(shù)的通量更是達(dá)到了 30-400bp/s。測(cè)序技術(shù)的進(jìn)步為相關(guān)的生物、醫(yī)學(xué)研究提供了有力的支持,但是如何解決海量的測(cè)序數(shù)據(jù)成為急需解決的學(xué)術(shù)和行業(yè)難題。為了解決上述問(wèn)題,本文基于Hadoop系統(tǒng)設(shè)計(jì)并實(shí)現(xiàn)了一套高通量測(cè)序數(shù)據(jù)自動(dòng)化并行處理系統(tǒng)(SeqReduce),其主要的目的是利用計(jì)算機(jī)集群,為海量的測(cè)序數(shù)據(jù)分析提供一款高效、穩(wěn)定、低廉的自動(dòng)化處理工具。該系統(tǒng)的核心設(shè)計(jì)思想是通過(guò)MapReduce并行運(yùn)算框架對(duì)相關(guān)測(cè)序數(shù)據(jù)進(jìn)行分割、對(duì)比、信息查詢(xún),最后輸出突變基因信息文件或者轉(zhuǎn)錄本文件。該系統(tǒng)具有以下幾個(gè)優(yōu)點(diǎn):(1)該款工具能夠同時(shí)兼容多種測(cè)序平臺(tái)包括主流的Illumina以及Roche 454等所產(chǎn)生的測(cè)序數(shù)據(jù)。(2)該款工具不僅能夠處理DNA-seq的數(shù)據(jù),還能夠?qū)NA-seq數(shù)據(jù)進(jìn)行分析處理。(3)為了使該工具能夠適應(yīng)不同的硬件壞境,設(shè)計(jì)了兩種不同的并行處理模式,分別是低性能模式和高性能模式,使得該工具能夠適應(yīng)不同配置條件的硬件環(huán)境。
[Abstract]:With the continuous development and maturity of next-generation sequencing technology, high-throughput sequencing has become a conventional tool in biological and medical research, and will soon be widely applied in agriculture and medical industry. It has promoted the emerging industries such as precision medicine and molecular breeding. Different from the previous low flux technology, many high-throughput sequencing technology generated by Science (whole genome, whole exome, transcriptome and metagenomics) data with high flux and large amount of data, complex and heterogeneous characteristics, processing and analysis steps involved and complicated, have put forward higher requirements for the hardware and software of data processing. How to quickly and efficiently process and analyze high - throughput sequencing data has become a bottleneck for the wide application of high - throughput sequencing technology. For example, the current precision medical treatment that is widely concerned is mainly dependent on gene sequencing technology. How to efficiently process and analyze the large number of patient's gene sequencing data and get personalized cancer driving information from it is the key and difficult problem to achieve precise diagnosis and treatment of tumor. Gene sequencing technology has developed from the first generation sequencing technology to the latest third generation sequencing technology, and its sequencing flux has exploded. The throughput of the first generation sequencing technology is only 0.2MB/run, and the throughput of the second generation sequencing technology, which is represented by Illumina, is about 1500GB/run. The throughput of the third generation sequencing technology is 30-400bp/s. The progress of sequencing technology has provided strong support for related biological and medical research, but how to solve massive sequencing data has become an urgent academic and industry problem. In order to solve the above problems, this paper based on the design and implementation of Hadoop system is a high-throughput sequencing data automatic parallel processing system (SeqReduce), its main objective is the use of computer cluster, providing an efficient, stable, low automation processing tools for sequencing data analysis. The core idea of the system is to segment, contrast and query the related sequencing data through the MapReduce parallel operation framework, and finally output the mutant gene information file or transcript file. The system has the following advantages: (1) the tool can be compatible with various sequencing platforms, including the mainstream Illumina and Roche 454, etc. (2) the tool not only can handle the data of DNA-seq, but also can analyze and process the RNA-seq data. (3) in order to enable the tool to adapt to different hardware environment, two different parallel processing modes are designed, which are low performance mode and high performance mode respectively, enabling the tool to adapt to different configuration conditions of the hardware environment.
【學(xué)位授予單位】：中國(guó)科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類(lèi)號(hào)】：Q811.4;TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前8條

1 趙興芝;臧麗;朱效麗;譚鳳華;;云計(jì)算概念、技術(shù)發(fā)展與應(yīng)用[J];電子世界;2017年03期

2 于穎彥;;胃癌精準(zhǔn)診療中的基因組學(xué)測(cè)序技術(shù)與應(yīng)用[J];外科理論與實(shí)踐;2017年01期

3 陳鳳珍;李玲;操利超;嚴(yán)志祥;;四種常用的生物序列比對(duì)軟件比較[J];生物信息學(xué);2016年01期

4 杭渤;束永前;劉平;魏光偉;金健;郝文山;王培俊;李斌;毛建華;;腫瘤的精準(zhǔn)醫(yī)療腫瘤的精準(zhǔn)醫(yī)療:概念、技術(shù)和展望[J];科技導(dǎo)報(bào);2015年15期

5 高靜;焦雅;張文廣;;高通量測(cè)序序列比對(duì)研究綜述[J];生命科學(xué)研究;2014年05期

6 劉朋虎;林冬梅;林占q，

本文編號(hào)：1347191

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/xixikjs/1347191.html

上一篇：機(jī)器人末端工具快換裝置的設(shè)計(jì)及優(yōu)化
下一篇：基于機(jī)器視覺(jué)測(cè)量的齒輪圖像邊界提取算法研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

組學(xué)大數(shù)據(jù)環(huán)境下的基因信息并行處理與分析方法研究