基于貝葉斯的質(zhì)譜數(shù)據(jù)分析方法
本文關(guān)鍵詞: 質(zhì)譜 蛋白質(zhì)組學(xué) 貝葉斯理論 機(jī)器學(xué)習(xí) 出處:《華東師范大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:伴隨著人類基因組計(jì)劃發(fā)展起來的基因組學(xué)為人類探索生命的原理起來劃時(shí)代的重要作用。但是在其發(fā)展的同時(shí),人們慢慢認(rèn)識(shí)到只從基因水平上去探索生命的本質(zhì)是完全不夠的,需要從更根本的本質(zhì)上去研究揭示生命現(xiàn)象,這樣蛋白質(zhì)組學(xué)應(yīng)運(yùn)而生。質(zhì)譜作為一種有效的工具為科學(xué)家們研究蛋白質(zhì)提供了很大的幫助。 本文首先介紹了目前主流的基于質(zhì)譜的蛋白質(zhì)分析流程和技術(shù),并介紹了一些常用的基于質(zhì)譜的蛋白質(zhì)的算法,包括SEQUEST、MASCOT、X! Tandom中的算法?偨Y(jié)了蛋白質(zhì)定量分析的兩種策略同位素標(biāo)記方法和無標(biāo)記定量技術(shù),并分析了他們的區(qū)別和各自的優(yōu)點(diǎn),介紹了目前基于質(zhì)譜的蛋白質(zhì)翻譯后修飾發(fā)現(xiàn)與鑒定的常用算法。 現(xiàn)有的基于質(zhì)譜的蛋白質(zhì)鑒定算法各有千秋,各有各的優(yōu)點(diǎn)。我們嘗試?yán)脵C(jī)器學(xué)習(xí)并結(jié)合樸素貝葉斯理論對(duì)現(xiàn)有的算法進(jìn)行整合。選取的機(jī)器學(xué)習(xí)方法包括SVM、LDA、logistic回歸、KNN、貝葉斯置信網(wǎng)絡(luò)、人工神經(jīng)網(wǎng)絡(luò)等方法。選取的分類特征包括SEQUEST算法中提供的多種參數(shù)。訓(xùn)練數(shù)據(jù)來自于18組已知的混合蛋白的質(zhì)譜數(shù)據(jù)。通過機(jī)器學(xué)習(xí)的方法得到分類器的分界面,并計(jì)算陰陽極樣本在分類器分類函數(shù)作用下的條件分布。利用陰陽極的條件分布和新樣本在分類器下的特征得分,在均勻先驗(yàn)的條件下通過樸素貝葉斯的方法就可以計(jì)算出蛋白質(zhì)鑒定結(jié)果的后驗(yàn)概率。通過交叉驗(yàn)證的結(jié)果表明我們的算法的正確率在80%-90%,同時(shí)可以保證召回率達(dá)到40%-50%,具有加好的實(shí)用價(jià)值。 蛋白質(zhì)翻譯后修飾的鑒定一直是蛋白質(zhì)組研究里面一個(gè)重要的領(lǐng)域。通常的基于質(zhì)譜的蛋白質(zhì)翻譯后修飾的鑒定的方法是機(jī)器學(xué)習(xí)和直接與已知數(shù)據(jù)庫(kù)對(duì)比。與已知數(shù)據(jù)庫(kù)對(duì)比的算法時(shí)間復(fù)雜度較高,同時(shí)因?yàn)楸葘?duì)的次數(shù)很多算法的假陽性率較高。我們嘗試?yán)没谕队熬嚯x的聚類算法來對(duì)質(zhì)譜數(shù)據(jù)先進(jìn)行聚類分析,然后再在此基礎(chǔ)上進(jìn)行翻譯后修飾的識(shí)別,這樣不僅降低了算法的時(shí)間復(fù)雜度,而且也提高了精度。投影方向是利用已知樣本基于LDA和SVM計(jì)算出來的,使得在投影方向上類內(nèi)距離盡可能的小,類間的距離盡可能大。得到投影方向之后在通過對(duì)未知樣本兩兩之間進(jìn)行投影距離的計(jì)算得到距離矩陣。通過利用距離矩陣和常用的聚類算法對(duì)數(shù)據(jù)直接進(jìn)行聚類分析。得到的聚類結(jié)果中的每一個(gè)類可能就是同一肽段的不同的翻譯后修飾的實(shí)例,通過比較同一類內(nèi)的結(jié)果可以快速高效的發(fā)現(xiàn)可能存在的翻譯后修飾。在已知數(shù)據(jù)的交叉驗(yàn)證下算法的正確率和召回率都在70%左右 自從Google提出了云計(jì)算的概念,各種基于云計(jì)算應(yīng)用層出不窮,蛋白質(zhì)質(zhì)譜數(shù)據(jù)分析具有高通量和可并行化的特點(diǎn),可以方便的部署到云計(jì)算平臺(tái)上。我們提出了兩種部署策略并比較了兩種策略的優(yōu)點(diǎn)和不足。
[Abstract]:Genomics, which has been developed with the Human Genome Project, plays an epoch-making role in exploring the principles of human life, but at the same time. People have come to realize that it is not enough to explore the nature of life only at the gene level, and that it is necessary to study and reveal the phenomenon of life from a more fundamental nature. Mass spectrometry is an effective tool for scientists to study proteins. This paper first introduces the current mainstream flow and technology of protein analysis based on mass spectrometry, and introduces some commonly used algorithms of protein based on mass spectrometry, including SEQUESTE MASCOTX! The algorithms in Tandom. Two strategies for protein quantitative analysis, isotope labeling and unlabeled quantification, were summarized, and their differences and advantages were analyzed. In this paper, the common algorithms of protein posttranslational modification discovery and identification based on mass spectrometry are introduced. The existing protein identification algorithms based on mass spectrometry have their own advantages and disadvantages. Each has its own advantages. We try to use machine learning and combining with naive Bayes theory to integrate the existing algorithms. The selected machine learning methods include SVMN LDA-logistic regression. KNNs, Bayesian confidence Networks. Artificial neural network and other methods. The selected classification features include a variety of parameters provided in the SEQUEST algorithm. Training data from 18 known mass spectrum data of mixed proteins. Obtained by machine learning method. Interface to the classifier. The conditional distribution of the anode and cathode samples under the classifier classification function is calculated. The conditional distribution of the cathode and cathode and the characteristic score of the new sample under the classifier are calculated. The posteriori probability of protein identification results can be calculated by naive Bayes method under the condition of uniform priori. The results of cross-validation show that the accuracy of our algorithm is between 80% and 90%. At the same time, the recall rate can reach 40-50, with good practical value. The identification of post-translational modification of proteins has been an important field in proteome research. The common methods of identification of post-translational modification of proteins based on mass spectrometry are machine learning and direct comparison with known databases. Compared with the known database, the algorithm has higher time complexity. At the same time, because of the high false positive rate of many algorithms, we try to use the projection distance based clustering algorithm to cluster the mass spectrum data first. Then the post-translational modification recognition is carried out on this basis, which not only reduces the time complexity of the algorithm, but also improves the accuracy. The projection direction is calculated by using known samples based on LDA and SVM. Make the distance between classes in the projection direction as small as possible. The distance between classes is as large as possible. After the projection direction is obtained, the distance matrix is obtained by calculating the projection distance between unknown samples. The distance matrix is directly clustered by using the distance matrix and the usual clustering algorithm. Cluster analysis. Each of the resulting clusters may be an example of a different post-translational modification of the same peptide. By comparing the results within the same class, we can quickly and efficiently find possible posttranslational modifications. The correct rate and recall rate of the algorithm are about 70% under the cross-validation of known data. Since Google put forward the concept of cloud computing, a variety of cloud-based applications have emerged, protein mass spectrometry data analysis has the characteristics of high throughput and parallelism. We propose two deployment strategies and compare the advantages and disadvantages of the two strategies.
【學(xué)位授予單位】:華東師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:R346
【共引文獻(xiàn)】
相關(guān)期刊論文 前10條
1 劉煥香;;《概率論與數(shù)理統(tǒng)計(jì)》的教學(xué)探索[J];安陽師范學(xué)院學(xué)報(bào);2010年05期
2 黃河清,林慶梅,MIAO Hai,Won-Suk KIM;豬血管緊張肽的質(zhì)譜特性[J];動(dòng)物學(xué)雜志;2003年05期
3 劉書芝,徐書榮;質(zhì)譜學(xué)中與質(zhì)量相關(guān)的量和單位[J];編輯學(xué)報(bào);2005年05期
4 費(fèi)紹金;周克元;;三本“概率統(tǒng)計(jì)”教學(xué)困境成因與解困方略[J];教育與教學(xué)研究;2010年12期
5 劉煥香;;概率論與數(shù)理統(tǒng)計(jì)課程的教學(xué)探索[J];時(shí)代教育(教育教學(xué));2010年09期
6 陳雪平;馬強(qiáng);蔣衛(wèi)軍;陳絢青;;本科概率統(tǒng)計(jì)教學(xué)的幾點(diǎn)探索[J];江蘇技術(shù)師范學(xué)院學(xué)報(bào);2010年09期
7 崔智超,王青建;數(shù)理統(tǒng)計(jì)學(xué)源流及應(yīng)用[J];大連教育學(xué)院學(xué)報(bào);2005年02期
8 劉旭華;田英;陳薇;;對(duì)研究生數(shù)理統(tǒng)計(jì)課程教學(xué)的思考與探索[J];高等農(nóng)業(yè)教育;2010年07期
9 柴根象;徐建平;;突出統(tǒng)計(jì)思維能力的培養(yǎng)——統(tǒng)計(jì)學(xué)教學(xué)淺談[J];大學(xué)數(shù)學(xué);2006年02期
10 張建俠;宋紅偉;;統(tǒng)計(jì)學(xué)知識(shí)建構(gòu)中的邏輯思維方法[J];廣西教育;2011年21期
相關(guān)會(huì)議論文 前1條
1 于惠蘭;裴承新;胡真;張?zhí)m波;;高效液相色譜-四極桿飛行時(shí)間質(zhì)譜檢測(cè)人血清中芥子氣染毒[(S-HETE)Cys-Pro-Phe]三肽加合物[A];公共安全中的化學(xué)問題研究進(jìn)展(第二卷)[C];2011年
相關(guān)博士學(xué)位論文 前10條
1 程宇;馬鈴薯蛋白水解物在水包油乳狀液中的抗氧化作用及機(jī)理研究[D];江南大學(xué);2010年
2 張艷萍;貽貝蛋白中ACE抑制肽的制備及其構(gòu)效關(guān)系研究[D];浙江工商大學(xué);2011年
3 韋星船;姜黃素類似物的合成及抗腫瘤活性研究[D];廣東工業(yè)大學(xué);2011年
4 李波;羊棲菜褐藻糖膠的提取純化和結(jié)構(gòu)研究[D];江南大學(xué);2005年
5 陳益;抗HIV前藥及其與蛋白質(zhì)弱相互作用的電噴霧質(zhì)譜研究[D];鄭州大學(xué);2005年
6 周慧;電噴霧質(zhì)譜及其聯(lián)用技術(shù)在藥物分析中的應(yīng)用[D];浙江大學(xué);2005年
7 王玉t,
本文編號(hào):1452431
本文鏈接:http://sikaile.net/xiyixuelunwen/1452431.html