基于相似性比對改進KNN的蛋白質(zhì)亞細胞定位預(yù)測研究

發(fā)布時間：2019-05-07 09:03

【摘要】：蛋白質(zhì)的功能與其所處的亞細胞區(qū)間緊密相關(guān),通過對蛋白質(zhì)的亞細胞區(qū)間預(yù)測研究能夠幫助我們了解蛋白質(zhì)的功能信息,對于生物研究有重要意義。傳統(tǒng)通過實驗的方式獲得蛋白質(zhì)亞細胞區(qū)間信息不僅耗時久、成本高,而且不利于大量蛋白序列的區(qū)間定位,因此需要找到一種高效的蛋白質(zhì)亞細胞區(qū)間預(yù)測方法。本文中介紹了蛋白序列的特征提取算法并對傳統(tǒng)K最近鄰(k-NearestNeighbor,KNN)分類器進行改進,提出一種基于相似性比對改進KNN的蛋白質(zhì)亞細胞分類預(yù)測算法,通過AdaBoost和Bagging進行集成預(yù)測,取得較好的實驗效果,本文主要工作如下:本文主要介紹了氨基酸組成、二肽、偽氨基酸組成三種特征提取算法;除了公共數(shù)據(jù)集ZD98,CH317,還構(gòu)建了新的數(shù)據(jù)集Gram1253;對傳統(tǒng)KNN分類器進行改進,使用Blast比對尋找最相似序列完成KNN算法的決策,提出一種新的分類預(yù)測算法:相似性比對KNN預(yù)測算法,在三個數(shù)據(jù)集上進行Jackknife檢驗,成功率分別為93.9%,91.5%和92.5%;隨后引入Hadoop分布式計算框架對算法進行優(yōu)化。為了進一步對預(yù)測算法進行研究,本文采用AdaBoost和Bagging算法對多個相似性比對KNN分類器進行集成后對蛋白序列的亞細胞區(qū)間進行預(yù)測,三個數(shù)據(jù)集在Jackknife檢驗下,AdaBoost的預(yù)測成功率分別為94.9%,92.4%和93.1%。由于ZD98和CH317數(shù)據(jù)集區(qū)間分布不均衡,Bagging集成算法的預(yù)測準確率低于相似性比對KNN算法,為89.8%和87.7%。但在Gram1253上實驗效果較好,預(yù)測準確率達到92.9%,實驗結(jié)果表明AdaBoost和Bagging集成分類預(yù)測方法是一種較為有效的蛋白質(zhì)亞細胞區(qū)間預(yù)測方法。
[Abstract]:The function of protein is closely related to its subcellular interval. The prediction of subcellular interval of protein can help us to understand the functional information of protein, which is of great significance for biological research. The traditional method of obtaining protein subcellular interval information by experiment is not only time-consuming, high-cost, but also unfavorable to the localization of a large number of protein sequences, so it is necessary to find an efficient method of protein subcellular interval prediction. In this paper, the feature extraction algorithm of protein sequence is introduced, and the traditional K nearest neighbor classifier is improved. A novel protein subcellular classification prediction algorithm based on similarity ratio based on improved KNN is proposed. Through AdaBoost and Bagging integrated prediction, good experimental results have been obtained. The main work of this paper is as follows: this paper mainly introduces three feature extraction algorithms: amino acid composition, dipeptide, pseudo amino acid composition; In addition to the common dataset ZD98,CH317, a new dataset Gram1253; has been built The traditional KNN classifier is improved, and the decision of KNN algorithm is completed by using Blast comparison to find the most similar sequence. A new classification and prediction algorithm is proposed: similarity ratio KNN prediction algorithm, and Jackknife test is performed on three data sets. The success rates were 93.9%, 91.5% and 92.5%, respectively. Then the Hadoop distributed computing framework is introduced to optimize the algorithm. In order to further study the prediction algorithm, the AdaBoost and Bagging algorithms are used to predict the subcellular interval of the protein sequence after integrating the KNN classifier with multiple similarity ratios. The three data sets are tested by Jackknife. The predictive success rates of AdaBoost were 94.9%, 92.4% and 93.1%, respectively. Because of the uneven interval distribution between ZD98 and CH317 data sets, the prediction accuracy of Bagging integration algorithm is lower than that of KNN algorithm, which is 89.8% and 87.7% respectively. However, the experimental results on Gram1253 show that the prediction accuracy is 92.9%. The experimental results show that AdaBoost and Bagging integrated classification prediction method is an effective method for protein subcellular interval prediction.
【學(xué)位授予單位】：南京農(nóng)業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：Q51;TP301.6

【參考文獻】

相關(guān)期刊論文前2條

1 文學(xué)志;方巍;鄭鈺輝;;一種基于類Haar特征和改進AdaBoost分類器的車輛識別算法[J];電子學(xué)報;2011年05期

2 李利珍;董自梅;;基于整合蛋白質(zhì)進化保守性的偽氨基酸組成成分預(yù)測蛋白質(zhì)亞細胞定位(英文)[J];生物物理學(xué)報;2009年02期

相關(guān)博士學(xué)位論文前1條

1 高青斌;蛋白質(zhì)亞細胞定位預(yù)測相關(guān)問題研究[D];國防科學(xué)技術(shù)大學(xué);2006年

相關(guān)碩士學(xué)位論文前1條

1 陳愛平;基于Hadoop的聚類算法并行化分析及應(yīng)用研究[D];電子科技大學(xué);2012年

，

本文編號：2470952

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2470952.html

上一篇：新奧集團物業(yè)管理ETS擴展系統(tǒng)的設(shè)計與實現(xiàn)
下一篇：銀行員工在線交流系統(tǒng)的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于相似性比對改進KNN的蛋白質(zhì)亞細胞定位預(yù)測研究