蛋白質遠同源性檢測和DNA結合蛋白識別研究
發(fā)布時間:2018-03-09 07:29
本文選題:蛋白質遠同源性檢測 切入點:DNA結合蛋白 出處:《哈爾濱工業(yè)大學》2017年碩士論文 論文類型:學位論文
【摘要】:蛋白質是構成生命的物質基礎,是生命活動的主要承擔者。在后基因組時代,隨著蛋白質測定技術的發(fā)展,蛋白質序列數據庫規(guī)模呈爆炸式的增長。因此,對蛋白質的識別在生物學中具有重要的意義。本課題對蛋白質的結構和功能方面進行深入的研究。在蛋白質結構方面,我們選取蛋白質遠同源性作為研究,不同物種中具有相同或相似功能的蛋白質具有明顯序列同源性,基于蛋白質序列同源性來判別未知類別的蛋白質序列的超家族歸屬。在蛋白質功能方面,我們選取了DNA結合蛋白作為研究。DNA結合蛋白在生命體中扮演著重要的角色。在基因的轉錄、重組、修復、復制等方面起了重要的作用。本文通過處理蛋白質的一級序列,結合機器學習的方法對上面的兩個特定問題進行了深入的研究,具體的研究內容如下:蛋白質遠源檢測是蛋白質結構研究的基礎。本文提出偽二肽結構狀態(tài)成分(Pseudo Dimer Composition,PDC)的概念。針對原始的偽氨基酸組成的信息不足,我們提出了改進的方案。首先采用包含進化信息的頻率譜將原始的序列轉換為包含進化信息的蛋白質序列。然后采用PDC特征提取方法將蛋白質一級序列轉換為固定長度的向量。結合支持向量機和集成學習策略預測蛋白質的超家族的類別。該集成策略的方法是將每個家族的ROC值作為其權重,進行線性集成。該方法的AUC為0.927,AUC50為0.749,該實驗表明其方法優(yōu)于該領域的其他方法。DNA結合蛋白識別是蛋白質功能研究的一個重要方向。本文首次將包含進化信息的頻率譜和偽氨基酸組成應用到該問題上。首先通過序列譜和偽氨基酸組成將蛋白質序列變?yōu)殚L度固定的特征向量。采用支持向量機構建分類器識別DNA結合蛋白。本章采取的集成方式是異態(tài)集成方法,通過擴展樣本得到更多的訓練模型進行集成。在獨立測試集上,實驗結果的準確率為76.56%,AUC為0.8392。另外,通過分析支持向量機不同特征的權重,可分析對應的氨基酸在識別過程的重要程度,進而分析其在生物學上的特征。針對偽氨基酸組成的提取信息不足的問題,我們提出一種融合K元氨基酸組成和自交叉協方差結合的方法。該方法克服了偽氨基酸組成包含信息不足的問題。K元氨基酸組成方法包含了氨基酸距離對的信息,自交叉協方差方法包含了全局的氨基酸的理化信息。通過優(yōu)化特征參數組合,我們可以進一步提高對DNA結合蛋白的準確率。在獨立測試集上的實驗結果顯示,該方法的預測精度為75.16%。該方法相較于其他方法有進一步提升。本文在DNA結合蛋白問題上提出一種基于近鄰傳播聚類策略的方法進行選擇性集成的方法。為了提高預測的精度和進一步深入研究集成方法,我們采用了基于縮減字母表距離對的特征提取策略。通過近鄰傳播聚類的集成策略,對656個基本分類器聚類集成。該方法在獨立測試集上的準確率為83.87%,相比于其他方法其實驗性能有進一步提升。
[Abstract]:Protein is a material base of life, is mainly responsible for the activities of life. In the post genomic era, with the development of technology of determination of protein, protein sequence database, the scale of explosive growth. Therefore, the protein recognition has important significance in biology. This research on protein structure and function of study on protein structure, protein remote homology research as we selected, with the same or similar functions in different species have obvious protein sequence homology superfamily protein sequences belonging protein sequence homology to determine the unknown. Based on protein function, we selected the DNA binding protein as the research.DNA binding protein plays an important role in life. In gene transcription, recombination, repair, replication plays a important role Use. Through processing the protein primary sequences, combined with machine learning methods conducted in-depth research on two specific questions above, the specific contents are as follows: protein far source detection is the basis for the research of protein structure. In this paper, two pseudo peptide structure state component (Pseudo Dimer Composition, PDC) concept according to the composition of pseudo amino acid deficiency. The original information, we propose the improved scheme. Firstly, the frequency spectrum of evolutionary information contains the original sequence into a protein sequence contains the evolutionary information. Then the PDC feature extraction method of the protein sequence is converted into a fixed length vector. Combined with the prediction of super family category protein support vector machines and integrated learning strategies. The method of integrated strategy is that each family ROC value as the weight, linear integration. This method is 0 AUC .927, AUC50 is 0.749, the experimental results show that the.DNA method is better than the other methods in the field of protein identification is an important direction of research on protein function. In this paper, for the first time will contain the evolutionary information of the frequency spectrum and pseudo amino acid composition is applied to the problem. Firstly, through sequence spectrum and pseudo amino acid composition of protein sequence into features fixed length vector. By using the support vector machine classifier to build a DNA binding protein. This chapter adopts the integration mode is the ensemble method, by extending the sample to get the training model more integrated. In the independent test set, the accuracy of experimental results was 76.56%, AUC was 0.8392. in addition, support vector machine with different feature weight through the analysis, corresponding analysis of the amino acids in the degree of importance of the recognition process, and then analyzed the biological characteristics. According to the extracted pseudo amino acid composition The problem of insufficient information, we propose a method based on K meta amino acid composition and combining self cross covariance matrix. This method overcomes the problem of pseudo amino acid composition.K amino acids contain insufficient information which contains information on amino acid distance method, self cross covariance methods include physical and chemical information of global amino acids. By optimizing the feature combination of parameters, we can further improve the accuracy of the DNA binding protein. In the independent test set and the experimental results show that the prediction accuracy of this method is 75.16%. this method compared with other methods in this paper. To further enhance the DNA binding protein on the paper presents a method for selective method of affinity propagation clustering strategy based on integration in order to improve the accuracy of prediction and further research on the integration method, we use the reduced alphabet distance on feature extraction based on Strategy A clustering algorithm based on affinity propagation clustering is applied to ensemble 656 basic classifiers. The accuracy of the algorithm on independent test set is 83.87%. Compared with other methods, its performance is further improved.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:Q811.4;TP311.13
【參考文獻】
相關期刊論文 前4條
1 敖麗敏;羅存金;;基于神經網絡集成的DNA序列分類方法研究[J];計算機仿真;2012年06期
2 張春霞;張講社;;選擇性集成學習算法綜述[J];計算機學報;2011年08期
3 Kathy L. MOSER,Eric J. TOPOL;An ensemble method for gene discovery based on DNA microarray data[J];Science in China(Series C:Life Sciences);2004年05期
4 張春霆;生物信息學的現狀與展望[J];世界科技研究與發(fā)展;2000年06期
相關博士學位論文 前1條
1 鄒權;基于二級結構的非編碼RNA挖掘方法研究[D];哈爾濱工業(yè)大學;2009年
,本文編號:1587548
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1587548.html