統(tǒng)計學習模型分析蛋白質(zhì)表達對乳癌細胞增殖的作用
發(fā)布時間:2018-09-04 09:40
【摘要】:隨著人們在日常生活中與有害物質(zhì)的接觸越來越頻繁,癌癥的發(fā)病率也逐漸增高。在這個大數(shù)據(jù)時代,如何在錯綜復雜的數(shù)據(jù)中選取有效的部分,變得十分重要。由于統(tǒng)計學習方法能夠更好的挖掘出有用的信息,這使得它成為十分重要的研究內(nèi)容。本文的研究對象為MD Anderson的一組乳癌細胞MDA-MB-231所掃描的反時相蛋白質(zhì)陣列(RPPA)和細胞增殖數(shù)據(jù)。通過這些數(shù)據(jù)對線性回歸、支持向量機(SVM)和隨機森林模型(RF)分別進行訓練,從而找到控制乳癌細胞增殖的關鍵蛋白質(zhì)。最終把這些關鍵蛋白質(zhì)作為癌癥藥物的潛在靶標。本文使用的數(shù)據(jù)波動性較大,為減少這些數(shù)據(jù)對統(tǒng)計效能產(chǎn)生的影響,首先對RPPA進行數(shù)據(jù)預處理。然后將預處理過的RPPA作為輸入數(shù)據(jù),細胞增殖作為輸出數(shù)據(jù),分別對線性回歸、SVM和RF進行訓練,其中在線性回歸模型的應用中,提出并使用了主成分分析(PCA)與線性回歸模型相結(jié)合的方法。最后通過比較三種模型的結(jié)果,得到了既具有較高精確度、又能夠篩選出具有關鍵影響力的蛋白質(zhì)組合的模型。本文結(jié)果表明,線性回歸模型精確度高,SVM模型能篩選出對乳癌細胞增殖起關鍵作用的蛋白質(zhì)組合,而RF在這兩方面表現(xiàn)都非常好。最后,利用RF對RPPA進行分析,得到28種對乳癌細胞影響較大的蛋白質(zhì),查找文獻可知,確認其中21種對乳癌細胞增殖有很大影響。
[Abstract]:As people contact with harmful substances more and more frequently in their daily life, the incidence of cancer increases gradually. In this big data era, how to select valid parts in the intricate data becomes very important. Because the statistical learning method can better excavate useful information, it becomes a very important research content. The object of this study was reverse phase protein array (RPPA) and cell proliferation data scanned by MDA-MB-231 of a group of breast cancer cells in MD Anderson. These data were used to train linear regression, support vector machine (SVM) and random forest model (RF) to find the key proteins to control the proliferation of breast cancer cells. These key proteins are eventually used as potential targets for cancer drugs. The data used in this paper are highly volatile. In order to reduce the impact of these data on statistical performance, the data preprocessing of RPPA is carried out first. Then the preprocessed RPPA is used as input data and cell proliferation is used as output data to train linear regression SVM and RF, respectively, which are used in the application of linear regression model. The method of combining principal component analysis (PCA) with linear regression model is proposed and used. Finally, by comparing the results of the three models, the model with high accuracy and the ability to screen out protein combinations with key influence is obtained. The results show that the linear regression model with high accuracy can screen out protein combinations that play a key role in the proliferation of breast cancer cells, and RF performs very well in both aspects. Finally, RF was used to analyze RPPA, and 28 kinds of proteins which had a great effect on breast cancer cells were obtained. The results showed that 21 of them had great influence on the proliferation of breast cancer cells.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:R737.9;Q811.4
本文編號:2221705
[Abstract]:As people contact with harmful substances more and more frequently in their daily life, the incidence of cancer increases gradually. In this big data era, how to select valid parts in the intricate data becomes very important. Because the statistical learning method can better excavate useful information, it becomes a very important research content. The object of this study was reverse phase protein array (RPPA) and cell proliferation data scanned by MDA-MB-231 of a group of breast cancer cells in MD Anderson. These data were used to train linear regression, support vector machine (SVM) and random forest model (RF) to find the key proteins to control the proliferation of breast cancer cells. These key proteins are eventually used as potential targets for cancer drugs. The data used in this paper are highly volatile. In order to reduce the impact of these data on statistical performance, the data preprocessing of RPPA is carried out first. Then the preprocessed RPPA is used as input data and cell proliferation is used as output data to train linear regression SVM and RF, respectively, which are used in the application of linear regression model. The method of combining principal component analysis (PCA) with linear regression model is proposed and used. Finally, by comparing the results of the three models, the model with high accuracy and the ability to screen out protein combinations with key influence is obtained. The results show that the linear regression model with high accuracy can screen out protein combinations that play a key role in the proliferation of breast cancer cells, and RF performs very well in both aspects. Finally, RF was used to analyze RPPA, and 28 kinds of proteins which had a great effect on breast cancer cells were obtained. The results showed that 21 of them had great influence on the proliferation of breast cancer cells.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:R737.9;Q811.4
【參考文獻】
相關期刊論文 前1條
1 林成德;彭國蘭;;隨機森林在企業(yè)信用評估指標體系確定中的應用[J];廈門大學學報(自然科學版);2007年02期
,本文編號:2221705
本文鏈接:http://sikaile.net/yixuelunwen/swyx/2221705.html
最近更新
教材專著