面向云平臺(tái)的二代測(cè)序數(shù)據(jù)近似去重方法研究
發(fā)布時(shí)間:2018-04-07 16:50
本文選題:高通量測(cè)序 切入點(diǎn):重復(fù)數(shù)據(jù)刪除 出處:《計(jì)算機(jī)工程與應(yīng)用》2017年23期
【摘要】:新一代測(cè)序因其數(shù)據(jù)量大、數(shù)據(jù)處理過(guò)程復(fù)雜、對(duì)計(jì)算資源要求高等特點(diǎn),需要通過(guò)云計(jì)算進(jìn)行處理。然而,云計(jì)算的處理方式要求先將測(cè)序數(shù)據(jù)上傳到云平臺(tái)中。但由于測(cè)序過(guò)程的隨機(jī)性,使得同一樣本的兩次測(cè)序、兩個(gè)相似樣本分別測(cè)序后所產(chǎn)生的文件在二進(jìn)制層面會(huì)有較大差別。目前已有的去重方法無(wú)法有效識(shí)別出這樣的"重復(fù)"測(cè)序文件和測(cè)序結(jié)果中的"重復(fù)"內(nèi)容。重復(fù)上傳和存儲(chǔ)這些重復(fù)數(shù)據(jù),不僅消耗網(wǎng)絡(luò)帶寬,而且浪費(fèi)存儲(chǔ)空間。針對(duì)現(xiàn)存的重復(fù)數(shù)據(jù)刪除方法僅僅基于文件的二進(jìn)制特征,并未有效利用測(cè)序結(jié)果數(shù)據(jù)相似性特點(diǎn)的問(wèn)題,提出一種面向云平臺(tái)的海量高通量測(cè)序數(shù)據(jù)近似去重方法NPD(Near Probability Deduplication)。該方法對(duì)Fast Q中的序列和質(zhì)量信息,使用Sim Hash計(jì)算分塊指紋,采用客戶端與云平臺(tái)雙布谷過(guò)濾器(Cukoo Filter)對(duì)指紋值進(jìn)行快速存在性檢測(cè),最后由云平臺(tái)使用近似算法對(duì)指紋值近似去重。實(shí)驗(yàn)結(jié)果表明,NPD方法在保證高效的同時(shí),大幅提升了去重率,進(jìn)而減少了網(wǎng)絡(luò)流量,縮短了數(shù)據(jù)上傳時(shí)間,能夠支撐海量數(shù)據(jù)處理,具有良好的實(shí)用價(jià)值。
[Abstract]:The new generation sequencing needs to be processed by cloud computing because of its large amount of data, complex data processing process and high demand for computing resources.However, cloud computing requires that sequenced data be uploaded to the cloud platform first.However, because of the randomness of the sequencing process, the two similar samples are sequenced twice, and the files produced after the two similar samples are sequenced will be different in binary level.The existing methods can not effectively identify such "repeat" sequencing documents and sequencing results of "repeat" content.Uploading and storing these duplicate data repeatedly not only consumes network bandwidth, but also wastes storage space.In order to solve the problem that the existing methods of repeated data deletion are only based on the binary features of files and do not effectively utilize the similarity of sequencing results, a cloud platform-oriented approximate de-reduplication method for massive high-throughput sequencing data, NPD(Near Probability replication, is proposed.In this method, the sequence and quality information in Fast Q are calculated by using Sim Hash, and the existence of fingerprint is detected by using client and cloud platform double valley filter.At last, the approximate algorithm is used to remove the fingerprint value from the cloud platform.The experimental results show that the NPD method not only ensures high efficiency, but also greatly increases the weight removal rate, thus reducing the network traffic, shortening the time of data upload, and can support the massive data processing, which has good practical value.
【作者單位】: 北京信息科技大學(xué)信息管理學(xué)院;首都醫(yī)科大學(xué)附屬北京地壇醫(yī)院傳染病研究所;
【基金】:國(guó)家自然科學(xué)基金(No.61572079) 北京市教育委員會(huì)科技計(jì)劃一般項(xiàng)目(No.KM201711232018)
【分類號(hào)】:TP301.6
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 姜s,
本文編號(hào):1720026
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1720026.html
最近更新
教材專著