基于相似性比對改進(jìn)KNN的蛋白質(zhì)亞細(xì)胞定位預(yù)測研究
發(fā)布時間:2019-05-07 09:03
【摘要】:蛋白質(zhì)的功能與其所處的亞細(xì)胞區(qū)間緊密相關(guān),通過對蛋白質(zhì)的亞細(xì)胞區(qū)間預(yù)測研究能夠幫助我們了解蛋白質(zhì)的功能信息,對于生物研究有重要意義。傳統(tǒng)通過實(shí)驗的方式獲得蛋白質(zhì)亞細(xì)胞區(qū)間信息不僅耗時久、成本高,而且不利于大量蛋白序列的區(qū)間定位,因此需要找到一種高效的蛋白質(zhì)亞細(xì)胞區(qū)間預(yù)測方法。本文中介紹了蛋白序列的特征提取算法并對傳統(tǒng)K最近鄰(k-NearestNeighbor,KNN)分類器進(jìn)行改進(jìn),提出一種基于相似性比對改進(jìn)KNN的蛋白質(zhì)亞細(xì)胞分類預(yù)測算法,通過AdaBoost和Bagging進(jìn)行集成預(yù)測,取得較好的實(shí)驗效果,本文主要工作如下:本文主要介紹了氨基酸組成、二肽、偽氨基酸組成三種特征提取算法;除了公共數(shù)據(jù)集ZD98,CH317,還構(gòu)建了新的數(shù)據(jù)集Gram1253;對傳統(tǒng)KNN分類器進(jìn)行改進(jìn),使用Blast比對尋找最相似序列完成KNN算法的決策,提出一種新的分類預(yù)測算法:相似性比對KNN預(yù)測算法,在三個數(shù)據(jù)集上進(jìn)行Jackknife檢驗,成功率分別為93.9%,91.5%和92.5%;隨后引入Hadoop分布式計算框架對算法進(jìn)行優(yōu)化。為了進(jìn)一步對預(yù)測算法進(jìn)行研究,本文采用AdaBoost和Bagging算法對多個相似性比對KNN分類器進(jìn)行集成后對蛋白序列的亞細(xì)胞區(qū)間進(jìn)行預(yù)測,三個數(shù)據(jù)集在Jackknife檢驗下,AdaBoost的預(yù)測成功率分別為94.9%,92.4%和93.1%。由于ZD98和CH317數(shù)據(jù)集區(qū)間分布不均衡,Bagging集成算法的預(yù)測準(zhǔn)確率低于相似性比對KNN算法,為89.8%和87.7%。但在Gram1253上實(shí)驗效果較好,預(yù)測準(zhǔn)確率達(dá)到92.9%,實(shí)驗結(jié)果表明AdaBoost和Bagging集成分類預(yù)測方法是一種較為有效的蛋白質(zhì)亞細(xì)胞區(qū)間預(yù)測方法。
[Abstract]:The function of protein is closely related to its subcellular interval. The prediction of subcellular interval of protein can help us to understand the functional information of protein, which is of great significance for biological research. The traditional method of obtaining protein subcellular interval information by experiment is not only time-consuming, high-cost, but also unfavorable to the localization of a large number of protein sequences, so it is necessary to find an efficient method of protein subcellular interval prediction. In this paper, the feature extraction algorithm of protein sequence is introduced, and the traditional K nearest neighbor classifier is improved. A novel protein subcellular classification prediction algorithm based on similarity ratio based on improved KNN is proposed. Through AdaBoost and Bagging integrated prediction, good experimental results have been obtained. The main work of this paper is as follows: this paper mainly introduces three feature extraction algorithms: amino acid composition, dipeptide, pseudo amino acid composition; In addition to the common dataset ZD98,CH317, a new dataset Gram1253; has been built The traditional KNN classifier is improved, and the decision of KNN algorithm is completed by using Blast comparison to find the most similar sequence. A new classification and prediction algorithm is proposed: similarity ratio KNN prediction algorithm, and Jackknife test is performed on three data sets. The success rates were 93.9%, 91.5% and 92.5%, respectively. Then the Hadoop distributed computing framework is introduced to optimize the algorithm. In order to further study the prediction algorithm, the AdaBoost and Bagging algorithms are used to predict the subcellular interval of the protein sequence after integrating the KNN classifier with multiple similarity ratios. The three data sets are tested by Jackknife. The predictive success rates of AdaBoost were 94.9%, 92.4% and 93.1%, respectively. Because of the uneven interval distribution between ZD98 and CH317 data sets, the prediction accuracy of Bagging integration algorithm is lower than that of KNN algorithm, which is 89.8% and 87.7% respectively. However, the experimental results on Gram1253 show that the prediction accuracy is 92.9%. The experimental results show that AdaBoost and Bagging integrated classification prediction method is an effective method for protein subcellular interval prediction.
【學(xué)位授予單位】:南京農(nóng)業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:Q51;TP301.6
本文編號:2470952
[Abstract]:The function of protein is closely related to its subcellular interval. The prediction of subcellular interval of protein can help us to understand the functional information of protein, which is of great significance for biological research. The traditional method of obtaining protein subcellular interval information by experiment is not only time-consuming, high-cost, but also unfavorable to the localization of a large number of protein sequences, so it is necessary to find an efficient method of protein subcellular interval prediction. In this paper, the feature extraction algorithm of protein sequence is introduced, and the traditional K nearest neighbor classifier is improved. A novel protein subcellular classification prediction algorithm based on similarity ratio based on improved KNN is proposed. Through AdaBoost and Bagging integrated prediction, good experimental results have been obtained. The main work of this paper is as follows: this paper mainly introduces three feature extraction algorithms: amino acid composition, dipeptide, pseudo amino acid composition; In addition to the common dataset ZD98,CH317, a new dataset Gram1253; has been built The traditional KNN classifier is improved, and the decision of KNN algorithm is completed by using Blast comparison to find the most similar sequence. A new classification and prediction algorithm is proposed: similarity ratio KNN prediction algorithm, and Jackknife test is performed on three data sets. The success rates were 93.9%, 91.5% and 92.5%, respectively. Then the Hadoop distributed computing framework is introduced to optimize the algorithm. In order to further study the prediction algorithm, the AdaBoost and Bagging algorithms are used to predict the subcellular interval of the protein sequence after integrating the KNN classifier with multiple similarity ratios. The three data sets are tested by Jackknife. The predictive success rates of AdaBoost were 94.9%, 92.4% and 93.1%, respectively. Because of the uneven interval distribution between ZD98 and CH317 data sets, the prediction accuracy of Bagging integration algorithm is lower than that of KNN algorithm, which is 89.8% and 87.7% respectively. However, the experimental results on Gram1253 show that the prediction accuracy is 92.9%. The experimental results show that AdaBoost and Bagging integrated classification prediction method is an effective method for protein subcellular interval prediction.
【學(xué)位授予單位】:南京農(nóng)業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:Q51;TP301.6
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 文學(xué)志;方巍;鄭鈺輝;;一種基于類Haar特征和改進(jìn)AdaBoost分類器的車輛識別算法[J];電子學(xué)報;2011年05期
2 李利珍;董自梅;;基于整合蛋白質(zhì)進(jìn)化保守性的偽氨基酸組成成分預(yù)測蛋白質(zhì)亞細(xì)胞定位(英文)[J];生物物理學(xué)報;2009年02期
相關(guān)博士學(xué)位論文 前1條
1 高青斌;蛋白質(zhì)亞細(xì)胞定位預(yù)測相關(guān)問題研究[D];國防科學(xué)技術(shù)大學(xué);2006年
相關(guān)碩士學(xué)位論文 前1條
1 陳愛平;基于Hadoop的聚類算法并行化分析及應(yīng)用研究[D];電子科技大學(xué);2012年
,本文編號:2470952
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2470952.html
最近更新
教材專著