天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于k-mer短序列的DNA數(shù)據(jù)壓縮算法研究

發(fā)布時(shí)間:2019-04-01 15:56
【摘要】:DNA序列數(shù)據(jù)量巨大,其相關(guān)壓縮技術(shù)是生物信息學(xué)中必不可少的關(guān)鍵技術(shù),是DNA序列數(shù)據(jù)有效存儲(chǔ)、讀取和傳輸?shù)幕A(chǔ),是進(jìn)行DNA測(cè)序拼接、序列比對(duì)、基因預(yù)測(cè)等的前提,因此,對(duì)DNA序列數(shù)據(jù)的壓縮技術(shù)進(jìn)行研究具有非常重要的理論意義與應(yīng)用價(jià)值。近年來,隨著信息處理技術(shù)的發(fā)展以及對(duì)DNA序列數(shù)據(jù)自身特點(diǎn)研究的深入,各種專門針對(duì)DNA序列數(shù)據(jù)的壓縮算法大量涌現(xiàn)。 本文從DNA序列數(shù)據(jù)具有高度重復(fù)性的特點(diǎn)出發(fā),對(duì)序列中長(zhǎng)度很小的k-mer子序列片段重復(fù)性進(jìn)行了統(tǒng)計(jì)分析,并歸納和總結(jié)了DNA序列數(shù)據(jù)堿基及k-mer短序列分布的重復(fù)性規(guī)律。 針對(duì)DNA序列中,不同片段區(qū)域k-mer分布具有很大差異性的特點(diǎn),提出了基于分段編碼的DNA數(shù)據(jù)壓縮算法。在預(yù)處理階段,將DNA序列分割成64個(gè)堿基一組的短序列片段,對(duì)每一個(gè)片段分別進(jìn)行獨(dú)立考慮。統(tǒng)計(jì)片段中重復(fù)率最高的3-mer子序列,利用其在片段中出現(xiàn)的次數(shù)和位置等信息進(jìn)行替代編碼,從而對(duì)DNA序列進(jìn)行壓縮。分段編碼壓縮算法簡(jiǎn)單,對(duì)常用基準(zhǔn)測(cè)試序列都能具有比較好的壓縮性能。 針對(duì)DNA序列中,k-mer長(zhǎng)度很小時(shí),部分k-mer具有很高重復(fù)性的特點(diǎn),提出了基于GA-PSO混合優(yōu)化的DNA數(shù)據(jù)壓縮算法,將DNA序列中等長(zhǎng)k-mer的不同組合抽象成不同的尋優(yōu)粒子,用GA-PSO混合優(yōu)化算法搜索序列中重復(fù)性高,能達(dá)到最大壓縮率的最優(yōu)k-mer組合,對(duì)序列中出現(xiàn)的最優(yōu)k-mer進(jìn)行編碼,從而對(duì)序列進(jìn)行壓縮。GA-PSO混合優(yōu)化算法中,每一輪迭代尋優(yōu)前,先用支持向量機(jī)模型將DNA堿基粒子群分成兩組,,然后分別采用GA算法和PSO算法優(yōu)化。實(shí)驗(yàn)結(jié)果表明,本算法能獲得比較高的壓縮率,而且相比于傳統(tǒng)算法,具有更好的魯棒性。
[Abstract]:DNA sequence is a huge amount of data, and its correlation compression technology is the essential key technology in bioinformatics, is the basis of effective storage, reading and transmission of DNA sequence data, and is the premise of DNA sequence splicing, sequence alignment, gene prediction and so on. Therefore, the research on the compression technology of DNA sequence data has very important theoretical significance and application value. In recent years, with the development of information processing technology and in-depth research on the characteristics of DNA sequence data, a large number of compression algorithms for DNA sequence data have emerged. Based on the high repeatability of the DNA sequence data, the reproducibility of the small length k-mer subsequence fragment in the sequence was statistically analyzed in this paper. The repeatability of base and k-mer short sequence distribution of DNA sequence data were summarized and summarized. In view of the great difference of k-mer distribution in different fragment regions in DNA sequences, a DNA data compression algorithm based on piecewise coding is proposed. In the pre-processing phase, the DNA sequence is divided into 64-base groups of short sequence fragments, and each fragment is considered independently. In order to compress the DNA sequence, the 3-mer subsequence, which has the highest repetition rate, is substituted for coding by using the information of the number and position of its occurrence in the fragment. Piecewise coding compression algorithm is simple and has good compression performance for common benchmark sequences. In view of the fact that the length of k-mer is very small in DNA sequences and some k-mer have high repeatability, a DNA data compression algorithm based on GA-PSO hybrid optimization is proposed. The different combinations of equal length k-mer in DNA sequences are abstracted into different optimization particles. The GA-PSO hybrid optimization algorithm is used to search the optimal k-mer combinations with high repeatability and maximum compression ratio. The optimal k-mer appeared in the sequence is encoded to compress the sequence. In the GA-PSO hybrid optimization algorithm, the DNA base particle swarm is divided into two groups by using the support vector machine model before each iteration optimization. Then GA algorithm and PSO algorithm are used to optimize. The experimental results show that the proposed algorithm can achieve high compression ratio and has better robustness than the traditional algorithm.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TN911.7

【參考文獻(xiàn)】

相關(guān)期刊論文 前4條

1 Hoong-Chien Lee;SeeDNA: A Visualization Tool for K-string Content of Long DNA Sequences and Their Randomized Counterparts[J];Genomics Proteomics & Bioinformatics;2004年03期

2 姜運(yùn)良;衛(wèi)星、小衛(wèi)星和微衛(wèi)星DNA——真核生物基因組的串狀重復(fù)序列[J];生命的化學(xué);1998年03期

3 陳惟昌,陳志華,陳志義,王自強(qiáng),邱紅霞;遺傳密碼和DNA序列的高維空間數(shù)字編碼[J];生物物理學(xué)報(bào);2000年04期

4 劉紅梅;劉國(guó)慶;;基于k-mer組分信息的系統(tǒng)發(fā)生樹構(gòu)建方法[J];生物信息學(xué);2013年02期



本文編號(hào):2451681

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/wltx/2451681.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶0672e***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
日本深夜福利在线播放| 麻豆果冻传媒一二三区| 久久亚洲精品中文字幕| 深夜福利欲求不满的人妻| av中文字幕一区二区三区在线| 懂色一区二区三区四区| 日韩精品成区中文字幕| 国产一区二区精品高清免费| 中文久久乱码一区二区| 人妻精品一区二区三区视频免精| 国产精品伦一区二区三区四季| 天海翼精品久久中文字幕| 国产欧美日韩精品一区二区| 欧美日韩亚洲国产精品| 五月天综合网五月天综合网| 国产丝袜极品黑色高跟鞋| 色偷偷偷拍视频在线观看| 色婷婷在线精品国自产拍| 亚洲中文字幕在线观看黑人| 欧美一级特黄大片做受大屁股| 日韩精品少妇人妻一区二区| 亚洲国产成人久久99精品| 中文字幕佐山爱一区二区免费| 国产又大又猛又粗又长又爽| 国产亚州欧美一区二区| 欧美国产日本免费不卡| 亚洲中文字幕视频在线观看| 欧美黑人精品一区二区在线| 欧美日韩综合免费视频| 日本人妻熟女一区二区三区| 亚洲中文字幕一区三区| 欧美日韩精品久久亚洲区熟妇人| 精品人妻一区二区三区免费看| 免费观看成人免费视频| 国产免费一区二区三区不卡| 亚洲熟妇中文字幕五十路| 亚洲人午夜精品射精日韩| 欧美日韩一区二区三区色拉拉| 黄片在线免费观看全集| 日韩精品日韩激情日韩综合| 国产欧美日韩精品一区二区|