基于智能計算的蛋白質(zhì)殘基溶劑可及性和功能的分析預測

發(fā)布時間：2018-03-09 11:09

本文選題：智能計算　切入點：機器學習　出處：《東北師范大學》2017年博士論文　論文類型：學位論文

【摘要】：蛋白質(zhì)結(jié)構(gòu)決定其相應的功能,蛋白質(zhì)結(jié)構(gòu)研究是蛋白質(zhì)組學研究的基礎。蛋白質(zhì)殘基溶劑可及性是一種基礎的蛋白質(zhì)結(jié)構(gòu)信息,它對于分析蛋白質(zhì)空間三維構(gòu)象、構(gòu)建蛋白質(zhì)三維結(jié)構(gòu)、預測蛋白質(zhì)與其它分子相互作用以及蛋白質(zhì)自身的新陳代謝和進化提供重要的基礎性意義。蛋白質(zhì)通過與其它分子(核酸、蛋白質(zhì)、小分子配體)之間的相互作用表達自身的功能。蛋白質(zhì)功能殘基的分析和識別對于研究蛋白質(zhì)的功能表達具有重要的現(xiàn)實意義。傳統(tǒng)的基于生物物理和生物化學獲取蛋白質(zhì)結(jié)構(gòu)和功能信息的方法需要精密昂貴的實驗儀器,繁瑣的實驗過程和密集的人力資源。這些傳統(tǒng)的方法受益于生物信息學的發(fā)展,后者通過使用智能計算的方式提供準確的工具預測蛋白質(zhì)結(jié)構(gòu)信息和功能殘基。事實上,僅有約2‰的蛋白質(zhì)具有較為準確的結(jié)構(gòu)數(shù)據(jù)。面對海量增長的未知結(jié)構(gòu)和功能的蛋白質(zhì),基于智能計算的方法充分發(fā)揮了計算機高效便捷和準確的特性,給進一步實驗探究提供了豐富的寶貴的線索。本文針對蛋白質(zhì)殘基的溶劑可及性和功能進行了分析和預測,主要成果如下:(1)提出了一種基于權(quán)重滑動窗口策略和粒子群優(yōu)化的回歸預測蛋白質(zhì)殘基暴露水平(溶劑可及性)的方法。首先,提取了基于序列的五種類型的特征來編碼蛋白質(zhì)每一個殘基及其鄰近殘基。為了精確量化鄰近殘基溶劑可及性對于中心殘基的影響,采用了基于權(quán)重的滑動窗口策略賦予滑動窗口中每個位置不同的權(quán)重。最后,使用粒子群優(yōu)化算法對于支持向量回歸算法中的參數(shù)進行尋優(yōu)。該方法在兩個基準數(shù)據(jù)集上的預測性能較于前人的研究方法有較大的提升。該研究探究了不同的回歸算法對于模型的影響,對比了不同的參數(shù)尋優(yōu)方法對于預測性能的影響,分析了回歸預測誤差的來源以及20種氨基酸的平均誤差水平。為了驗證該方法的泛化性能,同時與之前的預測工具進行對比,該方法連同領域內(nèi)知名的若干預測工具在獨立測試集上進行對比試驗。獨立測試集上結(jié)果證明了本文方法具有較好的泛化性能。(2)提出了一種基于代價敏感性集成學習和空間聚類算法預測抗原蛋白質(zhì)與抗體相互作用的抗原決定殘基及潛在表位的方法。首先,使用五種基于序列的特征對抗原蛋白質(zhì)殘基進行編碼,這些特征包括保守性特征、二級結(jié)構(gòu)特征、無序區(qū)域特征、二肽構(gòu)成特征和理化屬性特征。為了提高計算速度并且去除冗余特征,使用Fisher-Markov Selector對特征與樣本標簽進行相關性排序,然后使用增量特征選擇方法獲得最優(yōu)特征子集�？乖砦活A測是一個典型的不平衡數(shù)據(jù)分類問題,為了克服傳統(tǒng)機器學習在此類問題上的缺陷,本研究引入基于代價敏感性的集成學習算法�？紤]到絕大多數(shù)抗原決定殘基或序列連續(xù)或空間鄰近的情況,本研究在預測抗原決定殘基的基礎上,引入空間聚類算法預測這些抗原決定殘基可能形成的潛在表位。該方法分別在基準測試集和獨立測試集上與前人的方法進行對比,實驗結(jié)果證明了該方法的有效性和良好的泛化性能。(3)提出了一種基于快速自適應集成學習和配體特異性策略預測亞鐵血紅素綁定殘基的方法。首先根據(jù)亞鐵血紅素綁定殘基的特性,綜合使用了氨基酸分布特征、motif序列模板特征、表面傾向性特征和二級結(jié)構(gòu)特征。特征分析發(fā)現(xiàn),亞鐵血紅素綁定殘基在半胱氨酸和組氨酸上呈現(xiàn)出富集分布,傾向于蛋白質(zhì)表面的凹陷區(qū)域,較多的集中在二級結(jié)構(gòu)的銜接處。亞鐵血紅素綁定殘基預測是一個典型的不平衡數(shù)據(jù)分類問題。本研究針對性地提出一種新的快速自適應集成學習算法,該算法旨在通過動態(tài)監(jiān)控和調(diào)節(jié)子數(shù)據(jù)集中正負樣本比例實現(xiàn)對于子分類器的優(yōu)化。該算法速度較快同時具有較佳的自適應性;研究中特別針對兩種主要的亞鐵血紅素綁定配體類型引入了配體特異性策略,該策略能夠顯著提高傳統(tǒng)的通用模型的預測準確率。基準測試集和獨立測試集上的實驗分別證明了該方法相對于其它算法的優(yōu)越性和良好的泛化性能。文中同時分析論述了測試集正負樣本比例對算法造成的潛在影響。最后,本研究發(fā)布了在線預測工具,為生物學家高效計算亞鐵血紅素蛋白質(zhì)提供了有益的幫助。
[Abstract]:Protein structure determines its function, the study of protein structure is the basis of proteomics research. Protein residue solvent accessibility is one of the basic information of protein structure, for its analysis of the three-dimensional conformation of protein space construction, protein structure, and predict the evolution of The new supersedes the old. protein interactions with other molecules and their protein provide the fundamental significance. Through the protein and other molecules (nucleic acid, protein and small molecule ligands) expression of the interaction between its function. Analysis and identification of protein functional residues has important practical significance for the study of the expression of protein function. The traditional bio physical and bio chemical acquisition method based on protein structure and function the information need expensive instrument precision, tedious experiment process and intensive human resources. The traditional party Benefit from the development method of bioinformatics, the latter through the use of intelligent computing methods provide an accurate tool for prediction of protein structural information and functional residues. In fact, only about 2 per thousand protein structure with more accurate data. In the face of unknown structure and function of the massive growth of the protein, method based on Intelligent Computing and give full play to the computer efficient and convenient and accurate characteristics, provides a wealth of valuable clues for further experimental research. This paper aimed at the solvent residues in protein and function is analyzed and forecasted. The main results are as follows: (1) proposed a weighted sliding window method and particle swarm optimization based on the prediction of protein residue based on the level of exposure (solvent accessibility) method. First, to extract protein encoding each residue and its neighboring residues features five types based on sequence for. Effect of precise quantification of neighboring residues and solvent accessibility for center residues, the weighted sliding window strategy gives the sliding window in each different position based weights. Finally, using particle swarm optimization algorithm for the parameters of support vector regression algorithm in optimization. The method of performance prediction compared with the method in the previous two on the benchmark data sets have greatly improved. This study explores different regression algorithm for the model, comparison of the effects of different parameters optimization methods on the prediction performance, analyzes the sources of regression prediction error and the average error level of 20 kinds of amino acids. In order to verify generalization performance of the method, at the same time compared with the previous prediction tools, methods of prediction tools together with several well-known within the field in the independent test set were compared. Results show that on the independent test set This method has good generalization performance. (2) proposed a prediction of antigen protein and antigen antibody interaction cost sensitive ensemble learning and spatial clustering algorithm for determining residues and potential epitope based method. Firstly, using five kinds of sequence based on the characteristics of antigenic residues of protein encoding, these features include conservative characteristics, two level structure, disordered region characteristics, composition and characteristics of two peptide physicochemical properties. In order to improve the calculation speed and remove the redundant features, the use of Fisher-Markov Selector for relevance ranking features and labels, and then use the incremental feature selection method to obtain the optimal feature subset. Epitope prediction is a typical imbalanced data classification problems, in order to overcome the defects of traditional machine learning on such issues, the introduction of integrated learning based on cost sensitive Algorithm. Considering the vast majority of epitope residues or sequence or spatial proximity, based on determining residues in the prediction of antigen on the introduction of spatial clustering algorithm to predict these antigenic determinants of potential residues may form the table. This method were compared with the previous methods in the benchmark test set and independent test set, the experimental results demonstrate the effectiveness of the method and good generalization performance. (3) proposed a method for prediction of the heme binding residues of fast adaptive learning and integration strategy based on ligand specificity. According to the properties of the heme binding residues, the integrated use of the amino acid distribution characteristics, motif sequence template the surface characteristics, tendency features and two features. Feature analysis found that heme binding residues in cysteine and histidine showed enrichment distribution, tend to Depression area of the protein surface, more focused on the convergence of two level structure. The heme binding residues prediction is a typical unbalanced data classification. This paper proposes a new fast adaptive ensemble learning algorithm, the algorithm through dynamic monitoring and regulating sub data set of positive and negative samples the proportion for optimization of classifier. The algorithm is faster and has better adaptability; in particular, for the two major types of heme binding ligands into the ligand specific strategy, this strategy can significantly improve the prediction accuracy of the traditional model. The benchmark test set and independent experiments on the test set we prove this method compared with other algorithm superiority and good generalization performance. The paper also discusses the positive and negative samples of test set ratio algorithm The potential impact. Finally, the study published an online prediction tool that helped biologists to efficiently calculate heme protein.

【學位授予單位】：東北師范大學
【學位級別】：博士
【學位授予年份】：2017
【分類號】：Q51

【參考文獻】

相關期刊論文前2條

1 唐旭清;朱平;;后基因組時代生物信息學的發(fā)展趨勢[J];生物信息學;2008年03期

2 馬袁君;程震龍;孫野青;;生物信息學及其在蛋白質(zhì)組學中的應用[J];生物信息學;2008年01期

相關博士學位論文前1條

1 張華;蛋白質(zhì)殘基深度、柔性和功能的預測與分析[D];南開大學;2009年

，

本文編號：1588268

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/jckxbs/1588268.html

上一篇：常用有機磷酸酯阻燃劑對斑馬魚發(fā)育及肝臟代謝的影響
下一篇：結(jié)構(gòu)相似有機分子太赫茲譜實驗與理論研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于智能計算的蛋白質(zhì)殘基溶劑可及性和功能的分析預測