二三代基因組混合組裝流程的搭建與序列拼接并行優(yōu)化方法研究
發(fā)布時(shí)間:2018-07-05 18:26
本文選題:生物信息學(xué) + Rocks集群; 參考:《昆明理工大學(xué)》2017年碩士論文
【摘要】:隨著生物信息學(xué)的飛速發(fā)展,當(dāng)今世界已經(jīng)邁入生命科學(xué)和信息科學(xué)的時(shí)代。第三代測(cè)序技術(shù)因?yàn)槠渥x長(zhǎng)長(zhǎng)的特點(diǎn),徹底的革新了基因組學(xué)。測(cè)序技術(shù)發(fā)展的同時(shí),生物信息學(xué)面臨了更多的挑戰(zhàn),越來(lái)越多的測(cè)序數(shù)據(jù)的積累意味著需要更多的計(jì)算資源來(lái)滿(mǎn)足其分析需求,而新的測(cè)序技術(shù)產(chǎn)生新的特征的序列又勢(shì)必需要新的序列組裝技術(shù)來(lái)應(yīng)對(duì)。本文從上述挑戰(zhàn)出發(fā),研究二三代混合組裝策略和序列拼接并行優(yōu)化方法,以此滿(mǎn)足科研人員對(duì)于二三代基因測(cè)序數(shù)據(jù)分析的需求,也可以在序列拼接過(guò)程中能夠保證更好的節(jié)約計(jì)算資源,主要開(kāi)展以下3個(gè)工作。首先,生物數(shù)據(jù)數(shù)據(jù)量大且資源多樣,對(duì)數(shù)據(jù)進(jìn)行處理必須以來(lái)強(qiáng)大的計(jì)算資源。為滿(mǎn)足課題需求,建立生物信息學(xué)平臺(tái)成為必須。本文中搭建了一個(gè)基于Rocks集群系統(tǒng)的生物信息學(xué)平臺(tái)(Rocks Cluster),充分利用現(xiàn)有的集群計(jì)算技術(shù)來(lái)整合計(jì)算資源,為生物信息學(xué)的研究提供了方便快捷且強(qiáng)有力的數(shù)據(jù)處理平臺(tái)。其次,測(cè)序技術(shù)日新月異,推動(dòng)了基因組學(xué)的發(fā)展。本文分析三代測(cè)序數(shù)據(jù)具有讀長(zhǎng)長(zhǎng)、錯(cuò)誤率較高的特點(diǎn)和二代測(cè)序數(shù)據(jù)讀長(zhǎng)短但錯(cuò)誤率低的特點(diǎn),于生物信息學(xué)平臺(tái)搭建了二三代基因組混合組裝流程,充分利用了三代測(cè)序技術(shù)讀長(zhǎng)長(zhǎng)和二代測(cè)序技術(shù)錯(cuò)誤率低的優(yōu)點(diǎn),以二代測(cè)序數(shù)據(jù)對(duì)三代測(cè)序數(shù)據(jù)進(jìn)行糾錯(cuò),再以糾錯(cuò)之后得到的三代數(shù)據(jù)進(jìn)行基因組裝,以達(dá)到更好的拼接效果。最后,考慮到在基因混合組裝過(guò)程中糾錯(cuò)環(huán)節(jié)內(nèi)存消耗較高,如果對(duì)基因組較大的物種進(jìn)行基因組裝,現(xiàn)有平臺(tái)無(wú)法滿(mǎn)足其內(nèi)存消耗需求。為了解決這個(gè)問(wèn)題,本文分析了組裝過(guò)程中內(nèi)存使用情況,并根據(jù)實(shí)驗(yàn)室的生物信息學(xué)平臺(tái)結(jié)構(gòu)特點(diǎn)設(shè)計(jì)了解決方案。一是利用GlobalArray虛擬和管理不同節(jié)點(diǎn)的內(nèi)存,將數(shù)據(jù)和計(jì)算分開(kāi)運(yùn)行;二是設(shè)計(jì)進(jìn)程并行優(yōu)化方法用來(lái)緩解單節(jié)點(diǎn)的內(nèi)存壓力。同時(shí)為了尋求更好的解決方案,以基因混合組裝糾錯(cuò)方法本身所用算法為突破點(diǎn),基于二代三代數(shù)據(jù)混合拼接的思想,即考慮首先用二代數(shù)據(jù)進(jìn)行拼接得到正確率高的序列拼接圖,然后用三代測(cè)序數(shù)據(jù)比對(duì)到圖上,利用三代測(cè)序數(shù)據(jù)讀長(zhǎng)長(zhǎng)的優(yōu)勢(shì)確定圖上路徑的選擇,以達(dá)到簡(jiǎn)化圖的目的,這樣就避免了糾錯(cuò)環(huán)節(jié)。
[Abstract]:With the rapid development of bioinformatics, the world has entered the era of life science and information science. The third generation sequencing technology revolutionized genomics because of its long reading characteristics. With the development of sequencing technology, bioinformatics is facing more challenges. The accumulation of more and more sequenced data means that more computing resources are needed to meet its analytical needs. And the new sequencing technology produces the new characteristic sequence, and it is bound to need the new sequence assembly technology to deal with. Based on the above challenges, this paper studies the strategy of hybrid assembly of the second and third generation and the parallel optimization method of sequence splicing, so as to meet the needs of researchers for the analysis of gene sequencing data of the second and third generation. It can also be used to save computing resources in the process of sequence splicing. First of all, because of the large amount of biological data and diverse resources, the processing of the data must be a powerful computing resource. In order to meet the needs of the subject, it is necessary to establish a bioinformatics platform. In this paper, a Rocks Cluster platform based on Rocks cluster system is built, which makes full use of the existing cluster computing technology to integrate computing resources, and provides a convenient and fast and powerful data processing platform for bioinformatics research. Secondly, sequencing technology changes with each passing day, promoting the development of genomics. This paper analyzes the characteristics of the third generation sequencing data with long reading length, high error rate and the second generation sequencing data reading length but low error rate, and builds the second and third generation genome mixed assembly process on the bioinformatics platform. It makes full use of the advantages of the third generation sequencing technology and the low error rate of the second generation sequencing technology. The second generation sequencing data is used to correct the error of the third generation sequencing data, and then the third generation data obtained after the error correction is used for genome installation. In order to achieve better stitching effect. Finally, considering the high memory consumption in the error-correcting process, the existing platforms can not meet the memory consumption needs of the species with larger genomes. In order to solve this problem, the memory usage in the assembly process is analyzed, and the solution is designed according to the structural characteristics of the bioinformatics platform in the laboratory. One is to use GlobalArray to virtual and manage the memory of different nodes, and the other is to design a parallel optimization method to reduce the memory pressure of a single node. At the same time, in order to find a better solution, the algorithm used in the gene hybrid assembly and error correction method itself is the breakthrough point, based on the idea of the second generation and the third generation data mixed splicing. That is to say, we first use the second generation data to get the sequence splicing map with high accuracy, then compare the third generation sequence data to the graph, and make use of the long advantage of the third generation sequencing data to determine the choice of the path on the map, so as to achieve the purpose of simplifying the graph. This avoids error correction.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:Q811.4
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 柳延虎;王璐;于黎;;單分子實(shí)時(shí)測(cè)序技術(shù)的原理與應(yīng)用[J];遺傳;2015年03期
2 韓九強(qiáng);呂紅強(qiáng);劉俊;張善新;;基于生物信息學(xué)的HERV研究現(xiàn)狀與發(fā)展趨勢(shì)[J];生物信息學(xué);2014年02期
3 徐培杰;;生物信息學(xué)研究現(xiàn)狀[J];科技信息;2013年10期
4 任魯風(fēng);于軍;;解讀生命密碼的基本手段——DNA測(cè)序技術(shù)的前世今生[J];生命科學(xué);2012年12期
5 楊曉玲;施蘇華;唐恬;;新一代測(cè)序技術(shù)的發(fā)展及應(yīng)用前景[J];生物技術(shù)通報(bào);2010年10期
6 張予倩;周健;翁紅明;韓靜;;Rocks高性能計(jì)算集群的建立和管理[J];實(shí)驗(yàn)室研究與探索;2006年04期
,本文編號(hào):2101287
本文鏈接:http://sikaile.net/shoufeilunwen/benkebiyelunwen/2101287.html
最近更新
教材專(zhuān)著