基于機(jī)器學(xué)習(xí)的蛋白質(zhì)相互作用預(yù)測精度與數(shù)據(jù)集關(guān)系的研究

發(fā)布時間：2019-03-05 15:32

【摘要】：機(jī)器學(xué)習(xí)研究計(jì)算機(jī)如何模擬或?qū)崿F(xiàn)人類的學(xué)習(xí)行為，以獲取新的知識或技能，重新組織已有的知識結(jié)構(gòu)使之不斷改善自身的性能，它是使計(jì)算機(jī)具有智能的根本途徑。機(jī)器學(xué)習(xí)在數(shù)據(jù)挖掘、計(jì)算機(jī)視覺、生物特征識別、搜索引擎、醫(yī)學(xué)診斷等領(lǐng)域有廣泛的應(yīng)用。蛋白質(zhì)在細(xì)胞的生命活動中扮演著重要角色，，是細(xì)胞活性及功能的最終執(zhí)行者，蛋白質(zhì)功能的發(fā)揮是通過蛋白質(zhì)之間的相互作用實(shí)現(xiàn)的，蛋白質(zhì)間的相互作用是所有生物體保持正常生理功能的基礎(chǔ)。鑒于用實(shí)驗(yàn)方法測定蛋白質(zhì)相互作用的局限性，近年來,研究者利用機(jī)器學(xué)習(xí)的方法結(jié)合蛋白質(zhì)的結(jié)構(gòu)等生物學(xué)信息預(yù)測蛋白質(zhì)之間的相互作用，并且提出了許多具有不同預(yù)測精度的預(yù)測方法。我們發(fā)現(xiàn)多數(shù)預(yù)測方法的精度存在著偏差。本文利用人類和酵母菌的蛋白質(zhì)相互作用數(shù)據(jù)集結(jié)合多個編碼方法，研究利用機(jī)器學(xué)習(xí)算法預(yù)測蛋白質(zhì)間的相互作用的預(yù)測精度與數(shù)據(jù)集的樣本重復(fù)性間的關(guān)系。主要內(nèi)容如下：正負(fù)數(shù)據(jù)集的構(gòu)造是利用機(jī)器學(xué)習(xí)方法預(yù)測蛋白質(zhì)相互作用的基礎(chǔ)。首先利用圖論的鄰接矩陣和最大匹配方法分別對人類和酵母菌構(gòu)造兩類正數(shù)據(jù)集和負(fù)數(shù)據(jù)集，進(jìn)而構(gòu)造機(jī)器學(xué)習(xí)使用的數(shù)據(jù)集。兩類中的每個數(shù)據(jù)集都具有不同的樣本重復(fù)率，用來分析預(yù)測精度與數(shù)據(jù)集的樣本重復(fù)性間的關(guān)系。然后用自動協(xié)方差、局部描述符、偽氨基酸組成和三元組這四種編碼方法對這構(gòu)造的數(shù)據(jù)編碼，用兩種機(jī)器學(xué)習(xí)方法：k-近鄰和隨機(jī)森林，對編碼后的數(shù)據(jù)進(jìn)行訓(xùn)練和預(yù)測。最后對預(yù)測結(jié)果進(jìn)行了詳細(xì)分析。實(shí)驗(yàn)結(jié)果表明，對每個機(jī)器學(xué)習(xí)方法和4種編碼方法，正負(fù)數(shù)據(jù)集中蛋白質(zhì)樣本重復(fù)率不同預(yù)測的精度也不同，隨著數(shù)據(jù)集中蛋白質(zhì)樣本的重復(fù)率由高到底的變化，對應(yīng)的預(yù)測精度也隨之相應(yīng)變化。由此，我們得出正負(fù)數(shù)據(jù)集中樣本的重復(fù)性對機(jī)器學(xué)習(xí)方法的預(yù)測精度有直接的影響，分析機(jī)器學(xué)習(xí)方法的預(yù)測結(jié)果時要考慮正負(fù)數(shù)據(jù)集中樣本的重復(fù)性。
[Abstract]:Machine learning studies how computers simulate or implement human learning behavior in order to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve their own performance. It is the fundamental way to make computers intelligent. Machine learning is widely used in data mining, computer vision, biometric recognition, search engine, medical diagnosis and so on. Proteins play an important role in the life activities of cells, and they are the final executors of cell activity and function. The function of proteins is realized by the interaction between proteins. Protein-protein interactions are the basis for all organisms to maintain normal physiological functions. In view of the limitations of measuring protein interactions by experimental methods, in recent years, researchers have used machine learning methods to predict protein-protein interactions by combining biological information such as protein structure, and so on. Moreover, many prediction methods with different prediction accuracy are proposed. We find that there is a deviation in the accuracy of most prediction methods. In this paper, the relationship between the prediction accuracy of protein-protein interaction prediction by machine learning algorithm and the repeatability of the data set is studied by using the protein-protein interaction data set of human and yeast combined with multiple coding methods. The main contents are as follows: the construction of positive and negative data sets is the basis of predicting protein interaction by machine learning method. Firstly, the adjacency matrix of graph theory and the maximum matching method are used to construct two types of positive data sets and negative data sets for human and yeast respectively, and then the data sets for machine learning are constructed. Each data set in the two classes has a different sample repetition rate, which is used to analyze the relationship between the prediction accuracy and the sample repeatability of the data set. Then four coding methods, namely automatic covariance, local descriptor, pseudo-amino acid composition and triplet, are used to encode the constructed data. Two machine learning methods, k-nearest neighbor and random forest, are used to train and predict the encoded data. Finally, the prediction results are analyzed in detail. The experimental results show that for each machine learning method and the four coding methods, the different prediction accuracy of protein sample repetition rate in positive and negative data sets is different, and the repetition rate of protein samples in the data set varies from the high to the end with the change of the repetition rate of the protein samples in the data set. The corresponding prediction accuracy also changes accordingly. Therefore, it is concluded that the repeatability of positive and negative data sets has a direct effect on the prediction accuracy of machine learning methods, and the repeatability of positive and negative data sets should be taken into account when analyzing the prediction results of machine learning methods.
【學(xué)位授予單位】：華南理工大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：Q51;TP181

【參考文獻(xiàn)】

相關(guān)期刊論文前2條

1 林丹玲;;度在圖論中的運(yùn)用[J];長江大學(xué)學(xué)報(bào)(自科版);2006年04期

2 林成德;彭國蘭;;隨機(jī)森林在企業(yè)信用評估指標(biāo)體系確定中的應(yīng)用[J];廈門大學(xué)學(xué)報(bào)(自然科學(xué)版);2007年02期

本文編號：2435058

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2435058.html

上一篇：基于增量的網(wǎng)頁快照及其可視化
下一篇：基于語義的旅游信息搜索引擎

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于機(jī)器學(xué)習(xí)的蛋白質(zhì)相互作用預(yù)測精度與數(shù)據(jù)集關(guān)系的研究