基于機器學(xué)習(xí)的蛋白質(zhì)相互作用預(yù)測精度與數(shù)據(jù)集關(guān)系的研究
發(fā)布時間:2019-03-05 15:32
【摘要】:機器學(xué)習(xí)研究計算機如何模擬或?qū)崿F(xiàn)人類的學(xué)習(xí)行為,以獲取新的知識或技能,重新組織已有的知識結(jié)構(gòu)使之不斷改善自身的性能,它是使計算機具有智能的根本途徑。機器學(xué)習(xí)在數(shù)據(jù)挖掘、計算機視覺、生物特征識別、搜索引擎、醫(yī)學(xué)診斷等領(lǐng)域有廣泛的應(yīng)用。蛋白質(zhì)在細(xì)胞的生命活動中扮演著重要角色,,是細(xì)胞活性及功能的最終執(zhí)行者,蛋白質(zhì)功能的發(fā)揮是通過蛋白質(zhì)之間的相互作用實現(xiàn)的,蛋白質(zhì)間的相互作用是所有生物體保持正常生理功能的基礎(chǔ)。鑒于用實驗方法測定蛋白質(zhì)相互作用的局限性,近年來,研究者利用機器學(xué)習(xí)的方法結(jié)合蛋白質(zhì)的結(jié)構(gòu)等生物學(xué)信息預(yù)測蛋白質(zhì)之間的相互作用,并且提出了許多具有不同預(yù)測精度的預(yù)測方法。我們發(fā)現(xiàn)多數(shù)預(yù)測方法的精度存在著偏差。 本文利用人類和酵母菌的蛋白質(zhì)相互作用數(shù)據(jù)集結(jié)合多個編碼方法,研究利用機器學(xué)習(xí)算法預(yù)測蛋白質(zhì)間的相互作用的預(yù)測精度與數(shù)據(jù)集的樣本重復(fù)性間的關(guān)系。主要內(nèi)容如下: 正負(fù)數(shù)據(jù)集的構(gòu)造是利用機器學(xué)習(xí)方法預(yù)測蛋白質(zhì)相互作用的基礎(chǔ)。首先利用圖論的鄰接矩陣和最大匹配方法分別對人類和酵母菌構(gòu)造兩類正數(shù)據(jù)集和負(fù)數(shù)據(jù)集,進(jìn)而構(gòu)造機器學(xué)習(xí)使用的數(shù)據(jù)集。兩類中的每個數(shù)據(jù)集都具有不同的樣本重復(fù)率,用來分析預(yù)測精度與數(shù)據(jù)集的樣本重復(fù)性間的關(guān)系。然后用自動協(xié)方差、局部描述符、偽氨基酸組成和三元組這四種編碼方法對這構(gòu)造的數(shù)據(jù)編碼,用兩種機器學(xué)習(xí)方法:k-近鄰和隨機森林,對編碼后的數(shù)據(jù)進(jìn)行訓(xùn)練和預(yù)測。最后對預(yù)測結(jié)果進(jìn)行了詳細(xì)分析。 實驗結(jié)果表明,對每個機器學(xué)習(xí)方法和4種編碼方法,正負(fù)數(shù)據(jù)集中蛋白質(zhì)樣本重復(fù)率不同預(yù)測的精度也不同,隨著數(shù)據(jù)集中蛋白質(zhì)樣本的重復(fù)率由高到底的變化,對應(yīng)的預(yù)測精度也隨之相應(yīng)變化。由此,我們得出正負(fù)數(shù)據(jù)集中樣本的重復(fù)性對機器學(xué)習(xí)方法的預(yù)測精度有直接的影響,分析機器學(xué)習(xí)方法的預(yù)測結(jié)果時要考慮正負(fù)數(shù)據(jù)集中樣本的重復(fù)性。
[Abstract]:Machine learning studies how computers simulate or implement human learning behavior in order to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve their own performance. It is the fundamental way to make computers intelligent. Machine learning is widely used in data mining, computer vision, biometric recognition, search engine, medical diagnosis and so on. Proteins play an important role in the life activities of cells, and they are the final executors of cell activity and function. The function of proteins is realized by the interaction between proteins. Protein-protein interactions are the basis for all organisms to maintain normal physiological functions. In view of the limitations of measuring protein interactions by experimental methods, in recent years, researchers have used machine learning methods to predict protein-protein interactions by combining biological information such as protein structure, and so on. Moreover, many prediction methods with different prediction accuracy are proposed. We find that there is a deviation in the accuracy of most prediction methods. In this paper, the relationship between the prediction accuracy of protein-protein interaction prediction by machine learning algorithm and the repeatability of the data set is studied by using the protein-protein interaction data set of human and yeast combined with multiple coding methods. The main contents are as follows: the construction of positive and negative data sets is the basis of predicting protein interaction by machine learning method. Firstly, the adjacency matrix of graph theory and the maximum matching method are used to construct two types of positive data sets and negative data sets for human and yeast respectively, and then the data sets for machine learning are constructed. Each data set in the two classes has a different sample repetition rate, which is used to analyze the relationship between the prediction accuracy and the sample repeatability of the data set. Then four coding methods, namely automatic covariance, local descriptor, pseudo-amino acid composition and triplet, are used to encode the constructed data. Two machine learning methods, k-nearest neighbor and random forest, are used to train and predict the encoded data. Finally, the prediction results are analyzed in detail. The experimental results show that for each machine learning method and the four coding methods, the different prediction accuracy of protein sample repetition rate in positive and negative data sets is different, and the repetition rate of protein samples in the data set varies from the high to the end with the change of the repetition rate of the protein samples in the data set. The corresponding prediction accuracy also changes accordingly. Therefore, it is concluded that the repeatability of positive and negative data sets has a direct effect on the prediction accuracy of machine learning methods, and the repeatability of positive and negative data sets should be taken into account when analyzing the prediction results of machine learning methods.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:Q51;TP181
本文編號:2435058
[Abstract]:Machine learning studies how computers simulate or implement human learning behavior in order to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve their own performance. It is the fundamental way to make computers intelligent. Machine learning is widely used in data mining, computer vision, biometric recognition, search engine, medical diagnosis and so on. Proteins play an important role in the life activities of cells, and they are the final executors of cell activity and function. The function of proteins is realized by the interaction between proteins. Protein-protein interactions are the basis for all organisms to maintain normal physiological functions. In view of the limitations of measuring protein interactions by experimental methods, in recent years, researchers have used machine learning methods to predict protein-protein interactions by combining biological information such as protein structure, and so on. Moreover, many prediction methods with different prediction accuracy are proposed. We find that there is a deviation in the accuracy of most prediction methods. In this paper, the relationship between the prediction accuracy of protein-protein interaction prediction by machine learning algorithm and the repeatability of the data set is studied by using the protein-protein interaction data set of human and yeast combined with multiple coding methods. The main contents are as follows: the construction of positive and negative data sets is the basis of predicting protein interaction by machine learning method. Firstly, the adjacency matrix of graph theory and the maximum matching method are used to construct two types of positive data sets and negative data sets for human and yeast respectively, and then the data sets for machine learning are constructed. Each data set in the two classes has a different sample repetition rate, which is used to analyze the relationship between the prediction accuracy and the sample repeatability of the data set. Then four coding methods, namely automatic covariance, local descriptor, pseudo-amino acid composition and triplet, are used to encode the constructed data. Two machine learning methods, k-nearest neighbor and random forest, are used to train and predict the encoded data. Finally, the prediction results are analyzed in detail. The experimental results show that for each machine learning method and the four coding methods, the different prediction accuracy of protein sample repetition rate in positive and negative data sets is different, and the repetition rate of protein samples in the data set varies from the high to the end with the change of the repetition rate of the protein samples in the data set. The corresponding prediction accuracy also changes accordingly. Therefore, it is concluded that the repeatability of positive and negative data sets has a direct effect on the prediction accuracy of machine learning methods, and the repeatability of positive and negative data sets should be taken into account when analyzing the prediction results of machine learning methods.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:Q51;TP181
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 林丹玲;;度在圖論中的運用[J];長江大學(xué)學(xué)報(自科版);2006年04期
2 林成德;彭國蘭;;隨機森林在企業(yè)信用評估指標(biāo)體系確定中的應(yīng)用[J];廈門大學(xué)學(xué)報(自然科學(xué)版);2007年02期
本文編號:2435058
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2435058.html
最近更新
教材專著