基于序列統(tǒng)計(jì)特征的基因識(shí)別算法研究
本文選題:基因識(shí)別 + 多特征融合; 參考:《哈爾濱工業(yè)大學(xué)》2017年碩士論文
【摘要】:面對(duì)世間紛繁浩瀚的模式生物的全基因組數(shù)據(jù),能夠高效、精準(zhǔn)的識(shí)別其中可編碼蛋白的基因序列具有非常巨大的實(shí)用意義。這種意義致使基因識(shí)別作為生物信息學(xué)研究和發(fā)展的基礎(chǔ),向來備受學(xué)者們的青睞。傳統(tǒng)的研究方式主要依托于繁瑣的生物實(shí)驗(yàn),過程緩慢且耗時(shí)耗力。本文則主要依托信號(hào)處理的理論和方法,如傅里葉變換、濾波器算法、智能計(jì)算、統(tǒng)計(jì)學(xué)習(xí)等,從序列統(tǒng)計(jì)特征的角度對(duì)該問題加以深入研討。而周期3性質(zhì)作為一項(xiàng)重要的統(tǒng)計(jì)特征一直被廣泛地應(yīng)用于基因識(shí)別中。為了獲得更好的識(shí)別性能,研究者們?cè)诨蛐蛄械男盘?hào)濾波處理以及周期3特征強(qiáng)化方面做出了很大的研究貢獻(xiàn),但仍然存在很大的不足。本文針對(duì)固定步長LMS自適應(yīng)濾波器算法在基因預(yù)測(cè)中存在的問題,結(jié)合系統(tǒng)的反饋輸出和基因序列堿基組成成份的特征信息,提出一種新的具有更好濾波效果和強(qiáng)化周期3特征功能的變步長LMS自適應(yīng)濾波器改進(jìn)算法,通過仿真實(shí)驗(yàn)分析驗(yàn)證算法性能。研究表明,與現(xiàn)有算法相比,所提算法精度優(yōu)越性較為明顯。另外,針對(duì)短基因序列存在的特征信息較弱,不利于基因識(shí)別的問題,本文也提出一種新的依據(jù)各單特征表征能力而加權(quán)融合多特征的改進(jìn)算法,著重分析其在序列長度低于200 bp的短基因數(shù)據(jù)集中的識(shí)別性能,與傳統(tǒng)多特征融合算法相比,所提算法是有效的、魯棒的。結(jié)合上述兩方面的研究,本文實(shí)現(xiàn)一個(gè)結(jié)合了數(shù)字信號(hào)處理技術(shù)和多特征融合優(yōu)勢(shì)的人類基因組專用的基因識(shí)別系統(tǒng)。該系統(tǒng)因擺脫了對(duì)條件隨機(jī)場(chǎng)、隱馬爾科夫模型和支持向量機(jī)等傳統(tǒng)機(jī)器學(xué)習(xí)方法的依賴,具有實(shí)現(xiàn)簡單、無需訓(xùn)練保存大量模型參數(shù)、不過多受已有訓(xùn)練數(shù)據(jù)集知識(shí)結(jié)構(gòu)影響以及可實(shí)時(shí)識(shí)別等特點(diǎn)。并通過基準(zhǔn)測(cè)試數(shù)據(jù)集ALLSEQ和HMR195綜合驗(yàn)證系統(tǒng)性能。
[Abstract]:It is of great practical significance to recognize the gene sequence of the encoded protein efficiently and accurately in the face of the vast genome data of the model organism in the world. As the basis of bioinformatics research and development, gene recognition has always been favored by scholars. The traditional research methods mainly rely on tedious biological experiments, the process is slow and time-consuming. This paper mainly relies on the theory and methods of signal processing, such as Fourier transform, filter algorithm, intelligent computing, statistical learning, etc. Cycle 3, as an important statistical feature, has been widely used in gene recognition. In order to obtain better recognition performance, researchers have made great contributions to the signal filtering of gene sequences and the enhancement of cycle 3 features, but there are still many shortcomings. In order to solve the problem of fixed-step LMS adaptive filter algorithm in gene prediction, this paper combines the feedback output of the system and the characteristic information of the base composition of gene sequence. A new variable step size LMS adaptive filter with better filtering effect and enhanced cycle 3 features is proposed. The performance of the algorithm is verified by simulation analysis. The results show that compared with the existing algorithms, the accuracy of the proposed algorithm is obvious. In addition, in view of the weak feature information of short gene sequences, which is not conducive to gene recognition, this paper also proposes a new weighted fusion algorithm for multiple features according to the ability of each single feature representation. The performance of the proposed algorithm in the short gene dataset with a sequence length of less than 200 BP is analyzed. Compared with the traditional multi-feature fusion algorithm, the proposed algorithm is effective and robust. Combined with the above two aspects, this paper implements a special gene recognition system for human genome, which combines the advantages of digital signal processing and multi-feature fusion. The system is free from the dependence of traditional machine learning methods such as conditional random field, hidden Markov model and support vector machine, so it is easy to implement and saves a large number of model parameters without training. It is not too much influenced by the knowledge structure of existing training data sets and can be recognized in real time. The system performance is verified by benchmark data set ALLSEQ and HMR195.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:Q811.4
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 Haohuan FU;Junfeng LIAO;Jinzhe YANG;Lanning WANG;Zhenya SONG;Xiaomeng HUANG;Chao YANG;Wei XUE;Fangfang LIU;Fangli QIAO;Wei ZHAO;Xunqiang YIN;Chaofeng HOU;Chenglong ZHANG;Wei GE;Jian ZHANG;Yangang WANG;Chunbo ZHOU;Guangwen YANG;;The Sunway Taihu Light supercomputer:system and applications[J];Science China(Information Sciences);2016年07期
2 馬玉韜;軒秀巍;車進(jìn);滕建輔;;基于全相位濾波理論的基因預(yù)測(cè)[J];上海交通大學(xué)學(xué)報(bào);2013年07期
3 羅亮;史曉紅;許進(jìn);;LVQ神經(jīng)網(wǎng)絡(luò)方法預(yù)測(cè)蛋白質(zhì)結(jié)構(gòu)中的二硫鍵[J];系統(tǒng)仿真學(xué)報(bào);2007年09期
4 王明怡,吳平,王德林;基于相關(guān)性分析的基因選擇算法[J];浙江大學(xué)學(xué)報(bào)(工學(xué)版);2004年10期
5 陳曉燕,鮑倫軍,莫金垣;連續(xù)小波變換法分析核酸序列的長程相關(guān)性[J];中山大學(xué)學(xué)報(bào)(自然科學(xué)版);2003年03期
6 夏慧煜,周晴,李衍達(dá);隱Markov模型在剪接位點(diǎn)識(shí)別中的應(yīng)用[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2002年09期
7 楊文強(qiáng),錢敏平,HUANG Da-Wei;基于隱馬氏模型對(duì)編碼序列缺失與插入的檢測(cè)(英文)[J];生物化學(xué)與生物物理進(jìn)展;2002年01期
相關(guān)博士學(xué)位論文 前1條
1 馬寶山;基于信號(hào)處理理論和方法的基因預(yù)測(cè)研究[D];大連海事大學(xué);2008年
,本文編號(hào):1904940
本文鏈接:http://sikaile.net/kejilunwen/jiyingongcheng/1904940.html