基于Hadoop基因序列比對(duì)BWT索引的建立方法研究
發(fā)布時(shí)間:2018-02-04 12:12
本文關(guān)鍵詞: 基因序列 BWT索引 后綴數(shù)組 Hadoop MapReduce 出處:《內(nèi)蒙古農(nóng)業(yè)大學(xué)》2016年碩士論文 論文類型:學(xué)位論文
【摘要】:由于基因數(shù)據(jù)的增長(zhǎng)速度飛快,人工進(jìn)行序列比對(duì)已經(jīng)無法滿足科研人員的需求,那么機(jī)器比對(duì)已經(jīng)走上了舞臺(tái),基因序列比對(duì)是基因數(shù)據(jù)分析和處理的基礎(chǔ)。而現(xiàn)在的序列比對(duì)算法大致分為兩類,一類是精確比對(duì)算法,另一類是模糊比對(duì)算法。目前,大部分的基因序列比對(duì)方法都是啟發(fā)式算法,該類算法大致分為兩步:建立索引和序列比對(duì),所以無論是精確比對(duì)算法和非精確比對(duì)算法都離不開索引結(jié)構(gòu)。由此可見,建立索引是基因序列比對(duì)算法的重要步驟,常見的索引構(gòu)建算法大致分為兩類,一類是基于哈希表的算法,另一類是基于后綴樹或后綴數(shù)組的算法。而BWT(Burrows-Wheeler Transform)索引是基于后綴數(shù)組中比較重要的索引結(jié)構(gòu)。目前,構(gòu)建較大基因組序列(例如,人類基因組序列)的BWT索引需要幾個(gè)小時(shí)的串行計(jì)算。本文提出一種基于Hadoop的并行計(jì)算方法構(gòu)建后綴數(shù)組和BWT索引。算法使用MapReduce的數(shù)據(jù)處理功能,并且更改了原有的使用哈希方式的Partitioner,本文使用直接分配任務(wù)來建立索引。本文依次將基因鏈?zhǔn)椎囊粋(gè)堿基輪轉(zhuǎn)到基因鏈尾并與鏈尾的17個(gè)字符形成一個(gè)Key以及相應(yīng)的Map任務(wù),將這些Map任務(wù)根據(jù)新改寫Partitioner分配給Reduce。最終得到全序的后綴數(shù)組和BWT索引,減少建立索引的時(shí)間。通過實(shí)驗(yàn)數(shù)據(jù)表明,本文提出的方法可以節(jié)省索引構(gòu)建的時(shí)間,達(dá)到了預(yù)期目的,并驗(yàn)證了算法的有效性。
[Abstract]:Because of the rapid growth of genetic data, artificial sequence alignment has been unable to meet the needs of researchers, so machine alignment has been on the stage. Sequence alignment is the basis of gene data analysis and processing. Now, sequence alignment algorithms are divided into two categories, one is accurate alignment algorithm, the other is fuzzy alignment algorithm. Most gene sequence alignment methods are heuristic algorithms, which can be divided into two steps: indexing and sequence alignment. Therefore, both accurate alignment algorithm and imprecise alignment algorithm can not be separated from index structure. Thus, building index is an important step of gene sequence alignment algorithm, and the common index construction algorithm can be divided into two categories. One is the algorithm based on hash table. The other is the algorithm based on suffix tree or suffix array. And BWT(Burrows-Wheeler transform). Index is based on the suffix array of the more important index structure. Construction of large genome sequences (for example. Human genome sequence. This paper presents a parallel computing method based on Hadoop to construct suffix array and BWT index. The algorithm uses the data processing of MapReduce. Function. And changed the original hash Partitioner. In this paper, the direct assignment task is used to build the index. In this paper, one base of the first gene chain is rotated to the tail of the gene chain and a Key and the corresponding Map task are formed with the 17 characters of the tail. Assign these Map tasks to Reducebased on the new overwrite Partitioner, resulting in a fully ordered suffix array and BWT index. The experimental data show that the proposed method can save the time of index construction, achieve the expected purpose, and verify the effectiveness of the algorithm.
【學(xué)位授予單位】:內(nèi)蒙古農(nóng)業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP391.3
,
本文編號(hào):1490179
本文鏈接:http://sikaile.net/kejilunwen/jiyingongcheng/1490179.html
最近更新
教材專著