基于粒子群優(yōu)化的選擇性自助集成算法用于肺癌血清的~1H NMR代謝組學(xué)數(shù)據(jù)分析研究
發(fā)布時間:2018-09-14 07:38
【摘要】:作為研究全部生物分子的組學(xué)之一,代謝組學(xué)以大量的包括部分氨基酸、脂質(zhì)、有機酸等小分子在內(nèi)的代謝產(chǎn)物為研究對象,對它們的變化和代謝途徑進(jìn)行整體分析,從而產(chǎn)生富含變量的數(shù)據(jù)。如何從多維、復(fù)雜的數(shù)據(jù)中挖掘出潛在的關(guān)鍵性信息是完成代謝組學(xué)數(shù)據(jù)分析任務(wù)的重心。選擇性集成學(xué)習(xí)算法是一類從眾多集成學(xué)習(xí)器中選擇出部分學(xué)習(xí)器參與集成從而獲得更好的泛化性能和更高的預(yù)測效率的方法,是集成學(xué)習(xí)算法性能提高的新思路,近些年已日益引起研究者們的關(guān)注。在本論文中,鑒于代謝組學(xué)數(shù)據(jù)固有的特性、自助集成學(xué)習(xí)算法(Bagging)的優(yōu)缺點以及粒子群優(yōu)化算法(particle swarm optimization, PSO)強大的優(yōu)化性能,我們引入粒子群優(yōu)化算法來提高Bagging算法的性能,發(fā)展了一種選擇性Bagging算法,并將其用于提高基礎(chǔ)學(xué)習(xí)算法分類樹(classification tree, CT)和偏最小二乘-判別分析(partial least squares-discriminant analysis, PLS-DA)的穩(wěn)定性和泛化能力,由此提出了兩種新穎的代謝組學(xué)數(shù)據(jù)解析方法,分別開展了以下兩方面研究:(1)從集成學(xué)習(xí)算法的泛化誤差/偏差分解理論基礎(chǔ)出發(fā),在保證子模型準(zhǔn)確度的同時,增大子模型間的差異性,能進(jìn)一步提高集成算法的性能。因此,在本章中,提出基于粒子群優(yōu)化的選擇性自助集成(Bagging)算法,并將其用于提高不穩(wěn)定的模式識別技術(shù)—CT的穩(wěn)定性和泛化性能,發(fā)展了一種新型的代謝組學(xué)數(shù)據(jù)解析技術(shù),即,PSOBAGCT。該算法首先采用Bagging思路產(chǎn)生一系列差異性較大的CT模型(即自助集成分類樹算法,BAGCT),然后,同時考慮集成模型的誤差和子模型之間的差異性來設(shè)計PSO算法中的目標(biāo)函數(shù),利用PSO來選擇部分準(zhǔn)確度高且差異性大的子模型用于最終的模型集成,最后采用相對多數(shù)投票法產(chǎn)生最終學(xué)習(xí)器集成的輸出結(jié)果。在本章中,PSOBAGCT算法被用于基于1H NMR收集到的三組血清樣本的代謝組學(xué)數(shù)據(jù)分析中,這三組樣本分別是健康的志愿者、新診斷的肺癌患者和治療后又復(fù)發(fā)的肺癌患者,并將BAGCT和CT也用于該數(shù)據(jù)解析中,以驗證新算法的性能。結(jié)果表明:Bagging算法能夠顯著改善單一識別模型分類樹的識別性能和穩(wěn)定性,而且選擇性自助集成分類樹算法(PSOBAGCT)通過引入PSO算法使得其泛化能力明顯優(yōu)于自助集成分類樹算法(BAGCT)。此外,通過該算法還獲得了可以區(qū)分肺癌患者與健康者的顯著性代謝物,如脂質(zhì)、乳酸、糖蛋白、丙氨酸、蘇氨酸、肌醇、3-羥基丁酸鹽、二甲胺、谷氨酰胺、脯氨酸和三甲胺。(2)考慮到PLS-DA模式識別技術(shù)在代謝組學(xué)數(shù)據(jù)解析中的優(yōu)缺點,在本章中,我們以此為基本學(xué)習(xí)器,將第二章發(fā)展的選擇性自助集成(PSOBAG)算法用于提升PLS-DA算法的識別性能,形成了另一種新型的代謝組學(xué)數(shù)據(jù)解析方法,即:基于粒子群優(yōu)化的選擇性自助集成偏最小二乘-判別分析(PSOBAGPLS-DA)。該方法通過引入PSO算法,實現(xiàn)了對所有由Bagging算法訓(xùn)練產(chǎn)生的PLS-DA模型(即自助集成偏最小二乘-判別分析,BAGPLS-DA)的優(yōu)選。同樣地,PSOBAGPLS-DA聯(lián)合BAGPLS-DA、PLS-DA被用于基于1H NMR的肺癌血清代謝組學(xué)數(shù)據(jù)分析中。研究表明:采用Bagging算法訓(xùn)練產(chǎn)生一系列PLS-DA模型形成的方法(BAGPLS-DA)能顯著提高PLS-DA的識別性能,此外,通過引入PSO發(fā)展的選擇性Bagging算法,能進(jìn)一步提高建模算法的性能;同時,PSOBAGPLS-DA法還能識別一些具有顯著性差異的關(guān)鍵性肺癌血清代謝標(biāo)志物,其中有脂質(zhì)、乳酸、糖蛋白、丙氨酸、蘇氨酸、肌醇、谷氨酰胺、脯氨酸、三甲胺和膽堿。
[Abstract]:Metabonomics, as one of the studies of all biological molecules, takes a large number of metabolites, including some small molecules such as amino acids, lipids, organic acids, and so on, as the research object, and makes a comprehensive analysis of their changes and metabolic pathways to produce variable-rich data. Selective ensemble learning (SIL) algorithm is a new approach to improve the performance of ensemble learning algorithm, which is a method to select part of the learners from many ensemble learners to participate in the ensemble to obtain better generalization performance and higher prediction efficiency. In this paper, in view of the inherent characteristics of metabonomic data, the advantages and disadvantages of self-help ensemble learning algorithm (Bagging) and the powerful optimization performance of particle swarm optimization (PSO), we introduce particle swarm optimization (PSO) to improve the performance of Bagging algorithm and develop a selective Bagging algorithm. This algorithm is used to improve the stability and generalization ability of basic learning algorithm classification tree (CT) and partial least squares-discriminant analysis (PLS-DA). Therefore, two novel metabonomics data analysis methods are proposed, and the following two aspects are studied: (1) Based on the generalized error/deviation decomposition theory of ensemble learning algorithm, the performance of ensemble algorithm can be further improved by increasing the difference between sub-models while ensuring the accuracy of sub-models. Therefore, in this chapter, a selective self-help ensemble (Bagging) algorithm based on particle swarm optimization is proposed and applied to improve the instability. A new metabonomic data analysis technique, PSOBAGCT, is developed based on the stability and generalization performance of CT. This algorithm firstly uses Bagging method to generate a series of CT models with large difference (i.e. BAGCT) and then considers the error of integration model and the difference between sub-models. In this chapter, PSOBAGCT algorithm is used to metabolize three sets of serum samples collected by 1H NMR. In the analysis of histological data, these three groups of samples were healthy volunteers, newly diagnosed lung cancer patients and lung cancer patients who recurred after treatment. BAGCT and CT were also used to analyze the data to verify the performance of the new algorithm. The selective self-help ensemble classification tree algorithm (PSOBAGCT) is superior to the self-help ensemble classification tree algorithm (BAGCT) in generalization ability by introducing PSO algorithm. In addition, significant metabolites such as lipids, lactic acid, glycoprotein, alanine, threonine, inositol and 3-hydroxybutyrate can be distinguished between lung cancer patients and healthy subjects. Salts, dimethylamine, glutamine, proline and trimethylamine. (2) Considering the advantages and disadvantages of PLS-DA pattern recognition technology in metabonomic data analysis, we use this as a basic learning tool and use the selective self-help integration (PSOBAG) algorithm developed in Chapter 2 to improve the recognition performance of PLS-DA algorithm, forming a new type of PLS-DA algorithm. METABOLOGICAL DATA ANALYSIS METHOD, namely Selective Self-Integrated Partial Least Squares Discriminant Analysis (PSOBAGPLS-DA) based on Particle Swarm Optimization (PSOBAGPLS-DA), implements the optimization of all PLS-DA models (BAGPLS-DA) trained by Bagging algorithm by introducing PSO algorithm. LS-DA combined with BAGPLS-DA and PLS-DA were used in the analysis of lung cancer serum metabonomic data based on 1H NMR. The results showed that the method of generating a series of PLS-DA models (BAGPLS-DA) trained by Bagging algorithm could significantly improve the recognition performance of PLS-DA. In addition, the selective Bagging algorithm developed by PSO could further improve the modeling performance. At the same time, PSOBAGPLS-DA method can also identify some key lung cancer serum metabolic markers with significant differences, including lipid, lactic acid, glycoprotein, alanine, threonine, inositol, glutamine, proline, trimethylamine and choline.
【學(xué)位授予單位】:華中師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:R734.2
本文編號:2242018
[Abstract]:Metabonomics, as one of the studies of all biological molecules, takes a large number of metabolites, including some small molecules such as amino acids, lipids, organic acids, and so on, as the research object, and makes a comprehensive analysis of their changes and metabolic pathways to produce variable-rich data. Selective ensemble learning (SIL) algorithm is a new approach to improve the performance of ensemble learning algorithm, which is a method to select part of the learners from many ensemble learners to participate in the ensemble to obtain better generalization performance and higher prediction efficiency. In this paper, in view of the inherent characteristics of metabonomic data, the advantages and disadvantages of self-help ensemble learning algorithm (Bagging) and the powerful optimization performance of particle swarm optimization (PSO), we introduce particle swarm optimization (PSO) to improve the performance of Bagging algorithm and develop a selective Bagging algorithm. This algorithm is used to improve the stability and generalization ability of basic learning algorithm classification tree (CT) and partial least squares-discriminant analysis (PLS-DA). Therefore, two novel metabonomics data analysis methods are proposed, and the following two aspects are studied: (1) Based on the generalized error/deviation decomposition theory of ensemble learning algorithm, the performance of ensemble algorithm can be further improved by increasing the difference between sub-models while ensuring the accuracy of sub-models. Therefore, in this chapter, a selective self-help ensemble (Bagging) algorithm based on particle swarm optimization is proposed and applied to improve the instability. A new metabonomic data analysis technique, PSOBAGCT, is developed based on the stability and generalization performance of CT. This algorithm firstly uses Bagging method to generate a series of CT models with large difference (i.e. BAGCT) and then considers the error of integration model and the difference between sub-models. In this chapter, PSOBAGCT algorithm is used to metabolize three sets of serum samples collected by 1H NMR. In the analysis of histological data, these three groups of samples were healthy volunteers, newly diagnosed lung cancer patients and lung cancer patients who recurred after treatment. BAGCT and CT were also used to analyze the data to verify the performance of the new algorithm. The selective self-help ensemble classification tree algorithm (PSOBAGCT) is superior to the self-help ensemble classification tree algorithm (BAGCT) in generalization ability by introducing PSO algorithm. In addition, significant metabolites such as lipids, lactic acid, glycoprotein, alanine, threonine, inositol and 3-hydroxybutyrate can be distinguished between lung cancer patients and healthy subjects. Salts, dimethylamine, glutamine, proline and trimethylamine. (2) Considering the advantages and disadvantages of PLS-DA pattern recognition technology in metabonomic data analysis, we use this as a basic learning tool and use the selective self-help integration (PSOBAG) algorithm developed in Chapter 2 to improve the recognition performance of PLS-DA algorithm, forming a new type of PLS-DA algorithm. METABOLOGICAL DATA ANALYSIS METHOD, namely Selective Self-Integrated Partial Least Squares Discriminant Analysis (PSOBAGPLS-DA) based on Particle Swarm Optimization (PSOBAGPLS-DA), implements the optimization of all PLS-DA models (BAGPLS-DA) trained by Bagging algorithm by introducing PSO algorithm. LS-DA combined with BAGPLS-DA and PLS-DA were used in the analysis of lung cancer serum metabonomic data based on 1H NMR. The results showed that the method of generating a series of PLS-DA models (BAGPLS-DA) trained by Bagging algorithm could significantly improve the recognition performance of PLS-DA. In addition, the selective Bagging algorithm developed by PSO could further improve the modeling performance. At the same time, PSOBAGPLS-DA method can also identify some key lung cancer serum metabolic markers with significant differences, including lipid, lactic acid, glycoprotein, alanine, threonine, inositol, glutamine, proline, trimethylamine and choline.
【學(xué)位授予單位】:華中師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:R734.2
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 陳音;鐘美佐;哈木拉提·吾甫爾;巴吐爾·買買提明;鄧皖利;張洪亮;王銳;;乳腺癌患者血漿和尿液的~1H-NMR代謝組學(xué)[J];科技導(dǎo)報;2014年13期
2 馬變穎;王梓;張彬;朱敏;劉靜;;肺癌相關(guān)抑癌基因的研究進(jìn)展[J];生命科學(xué)研究;2014年01期
3 張亞男;趙宗興;張亞坤;韓磊;王珂;;血清腫瘤標(biāo)記物在肺癌診斷中的意義[J];中國實驗診斷學(xué);2014年02期
4 牛艷潔;江銀玲;許長江;王向迎;劉友如;趙珩;韓寶惠;姜麗巖;;代謝組學(xué)方法分析肺癌患者血清和尿液小分子代謝產(chǎn)物的初步研究[J];中國肺癌雜志;2012年04期
5 田雨波;李正強;朱人杰;;基于混沌PSO算法的選擇性神經(jīng)網(wǎng)絡(luò)集成方法[J];計算機應(yīng)用;2008年11期
,本文編號:2242018
本文鏈接:http://sikaile.net/yixuelunwen/zlx/2242018.html
最近更新
教材專著