天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于機器學(xué)習(xí)的蛋白質(zhì)相互作用預(yù)測精度與數(shù)據(jù)集關(guān)系的研究

發(fā)布時間:2019-03-05 15:32
【摘要】:機器學(xué)習(xí)研究計算機如何模擬或?qū)崿F(xiàn)人類的學(xué)習(xí)行為,以獲取新的知識或技能,重新組織已有的知識結(jié)構(gòu)使之不斷改善自身的性能,它是使計算機具有智能的根本途徑。機器學(xué)習(xí)在數(shù)據(jù)挖掘、計算機視覺、生物特征識別、搜索引擎、醫(yī)學(xué)診斷等領(lǐng)域有廣泛的應(yīng)用。蛋白質(zhì)在細(xì)胞的生命活動中扮演著重要角色,,是細(xì)胞活性及功能的最終執(zhí)行者,蛋白質(zhì)功能的發(fā)揮是通過蛋白質(zhì)之間的相互作用實現(xiàn)的,蛋白質(zhì)間的相互作用是所有生物體保持正常生理功能的基礎(chǔ)。鑒于用實驗方法測定蛋白質(zhì)相互作用的局限性,近年來,研究者利用機器學(xué)習(xí)的方法結(jié)合蛋白質(zhì)的結(jié)構(gòu)等生物學(xué)信息預(yù)測蛋白質(zhì)之間的相互作用,并且提出了許多具有不同預(yù)測精度的預(yù)測方法。我們發(fā)現(xiàn)多數(shù)預(yù)測方法的精度存在著偏差。 本文利用人類和酵母菌的蛋白質(zhì)相互作用數(shù)據(jù)集結(jié)合多個編碼方法,研究利用機器學(xué)習(xí)算法預(yù)測蛋白質(zhì)間的相互作用的預(yù)測精度與數(shù)據(jù)集的樣本重復(fù)性間的關(guān)系。主要內(nèi)容如下: 正負(fù)數(shù)據(jù)集的構(gòu)造是利用機器學(xué)習(xí)方法預(yù)測蛋白質(zhì)相互作用的基礎(chǔ)。首先利用圖論的鄰接矩陣和最大匹配方法分別對人類和酵母菌構(gòu)造兩類正數(shù)據(jù)集和負(fù)數(shù)據(jù)集,進(jìn)而構(gòu)造機器學(xué)習(xí)使用的數(shù)據(jù)集。兩類中的每個數(shù)據(jù)集都具有不同的樣本重復(fù)率,用來分析預(yù)測精度與數(shù)據(jù)集的樣本重復(fù)性間的關(guān)系。然后用自動協(xié)方差、局部描述符、偽氨基酸組成和三元組這四種編碼方法對這構(gòu)造的數(shù)據(jù)編碼,用兩種機器學(xué)習(xí)方法:k-近鄰和隨機森林,對編碼后的數(shù)據(jù)進(jìn)行訓(xùn)練和預(yù)測。最后對預(yù)測結(jié)果進(jìn)行了詳細(xì)分析。 實驗結(jié)果表明,對每個機器學(xué)習(xí)方法和4種編碼方法,正負(fù)數(shù)據(jù)集中蛋白質(zhì)樣本重復(fù)率不同預(yù)測的精度也不同,隨著數(shù)據(jù)集中蛋白質(zhì)樣本的重復(fù)率由高到底的變化,對應(yīng)的預(yù)測精度也隨之相應(yīng)變化。由此,我們得出正負(fù)數(shù)據(jù)集中樣本的重復(fù)性對機器學(xué)習(xí)方法的預(yù)測精度有直接的影響,分析機器學(xué)習(xí)方法的預(yù)測結(jié)果時要考慮正負(fù)數(shù)據(jù)集中樣本的重復(fù)性。
[Abstract]:Machine learning studies how computers simulate or implement human learning behavior in order to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve their own performance. It is the fundamental way to make computers intelligent. Machine learning is widely used in data mining, computer vision, biometric recognition, search engine, medical diagnosis and so on. Proteins play an important role in the life activities of cells, and they are the final executors of cell activity and function. The function of proteins is realized by the interaction between proteins. Protein-protein interactions are the basis for all organisms to maintain normal physiological functions. In view of the limitations of measuring protein interactions by experimental methods, in recent years, researchers have used machine learning methods to predict protein-protein interactions by combining biological information such as protein structure, and so on. Moreover, many prediction methods with different prediction accuracy are proposed. We find that there is a deviation in the accuracy of most prediction methods. In this paper, the relationship between the prediction accuracy of protein-protein interaction prediction by machine learning algorithm and the repeatability of the data set is studied by using the protein-protein interaction data set of human and yeast combined with multiple coding methods. The main contents are as follows: the construction of positive and negative data sets is the basis of predicting protein interaction by machine learning method. Firstly, the adjacency matrix of graph theory and the maximum matching method are used to construct two types of positive data sets and negative data sets for human and yeast respectively, and then the data sets for machine learning are constructed. Each data set in the two classes has a different sample repetition rate, which is used to analyze the relationship between the prediction accuracy and the sample repeatability of the data set. Then four coding methods, namely automatic covariance, local descriptor, pseudo-amino acid composition and triplet, are used to encode the constructed data. Two machine learning methods, k-nearest neighbor and random forest, are used to train and predict the encoded data. Finally, the prediction results are analyzed in detail. The experimental results show that for each machine learning method and the four coding methods, the different prediction accuracy of protein sample repetition rate in positive and negative data sets is different, and the repetition rate of protein samples in the data set varies from the high to the end with the change of the repetition rate of the protein samples in the data set. The corresponding prediction accuracy also changes accordingly. Therefore, it is concluded that the repeatability of positive and negative data sets has a direct effect on the prediction accuracy of machine learning methods, and the repeatability of positive and negative data sets should be taken into account when analyzing the prediction results of machine learning methods.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:Q51;TP181

【參考文獻(xiàn)】

相關(guān)期刊論文 前2條

1 林丹玲;;度在圖論中的運用[J];長江大學(xué)學(xué)報(自科版);2006年04期

2 林成德;彭國蘭;;隨機森林在企業(yè)信用評估指標(biāo)體系確定中的應(yīng)用[J];廈門大學(xué)學(xué)報(自然科學(xué)版);2007年02期



本文編號:2435058

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2435058.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶b521c***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
青青免费操手机在线视频| 精品一区二区三区不卡少妇av| 亚洲一区二区欧美激情| 大胆裸体写真一区二区| 久热这里只有精品九九| 日本午夜免费啪视频在线| 欧美日韩综合在线第一页| 欧美偷拍一区二区三区四区| 国产精品久久女同磨豆腐| 青青操在线视频精品视频| 色好吊视频这里只有精| 日韩人妻av中文字幕| 欧洲一区二区三区蜜桃| 91久久精品在这里色伊人| 国产91人妻精品一区二区三区| 午夜成年人黄片免费观看| 欧美日韩精品人妻二区三区| 国产欧美性成人精品午夜| 欧美特色特黄一级大黄片| 亚洲国产四季欧美一区| 国产精品成人免费精品自在线观看| 日韩午夜老司机免费视频| 色播五月激情五月婷婷| 日本午夜一本久久久综合| 精品人妻一区二区三区在线看| 欧美国产亚洲一区二区三区| 老鸭窝精彩从这里蔓延| 亚洲精品偷拍一区二区三区| 韩日黄片在线免费观看| 亚洲高清亚洲欧美一区二区| 成人三级视频在线观看不卡| 91日韩在线观看你懂的| 千仞雪下面好爽好紧好湿全文| 熟女一区二区三区国产| 日本不卡在线视频你懂的 | 国产成人av在线免播放观看av| 亚洲av专区在线观看| 欧美日韩亚洲国产综合网| 亚洲国产精品一区二区| 丰满熟女少妇一区二区三区| 国产一区二区三区丝袜不卡 |