基于特征選擇和集成學(xué)習(xí)的結(jié)直腸癌預(yù)測(cè)模型研究
本文關(guān)鍵詞: 結(jié)直腸癌 特征選擇 集成學(xué)習(xí) HELM算法 出處:《西南大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:結(jié)直腸癌是世界范圍內(nèi)最常見同時(shí)也是最危險(xiǎn)的惡性腫瘤之一,它的高發(fā)區(qū)主要集中在歐美、新西蘭和澳大利亞等經(jīng)濟(jì)發(fā)達(dá)的西方國(guó)家。雖然中國(guó)是傳統(tǒng)意義上的結(jié)直腸癌低發(fā)地區(qū),但是隨著人們生活方式及飲食習(xí)慣等越來越西方化,結(jié)直腸癌在我國(guó)的發(fā)病率正在逐年呈上升趨勢(shì),不僅嚴(yán)重危害著人們的健康,同時(shí)對(duì)人們的生活質(zhì)量也造成了一定的影響。雖然結(jié)直腸癌一直是全球范圍內(nèi)最具危害的腫瘤之一,但是到目前為止,其病因及發(fā)病機(jī)制仍然尚未完全明了,盡管大量的流行病學(xué)研究表明結(jié)直腸癌的發(fā)生是一個(gè)復(fù)雜過程,在這個(gè)過程中,它不僅會(huì)受到環(huán)境因素、遺傳因素等諸多因素單方面的影響,同時(shí)也可能受到它們之間相互作用的影響。然而,究竟是哪些環(huán)境因素、遺傳因素或者其相互作用影響著結(jié)直腸癌的發(fā)生及發(fā)展,仍舊沒有統(tǒng)一的定論。因此,建立結(jié)直腸癌預(yù)測(cè)模型,研究環(huán)境、膳食及遺傳易感性等多因素對(duì)結(jié)直腸癌的影響具有重要的意義。本文基于第三軍醫(yī)大學(xué)提供的結(jié)直腸癌病例對(duì)照組樣本數(shù)據(jù),利用機(jī)器學(xué)習(xí)研究方法建立了結(jié)直腸癌預(yù)測(cè)模型,為結(jié)直腸癌早期診斷和預(yù)防提供了可靠依據(jù),本文的主要工作如下:1、提出了從多方面的特征選擇方法。由于數(shù)據(jù)維度較大,為了降低模型的計(jì)算復(fù)雜度,本文提出從兩個(gè)方面對(duì)數(shù)據(jù)進(jìn)行降維處理,即relief特征選擇算法和相關(guān)性檢驗(yàn)方法。通過relief算法計(jì)算樣本特征權(quán)重,將權(quán)重小的特征刪除,保留權(quán)重大的特征得到特征子集,然后對(duì)relief算法得到的特征子集進(jìn)行相關(guān)性分析,對(duì)于相關(guān)性大的特征對(duì),只保留權(quán)重大的特征,刪除權(quán)重小的特征,進(jìn)而得到權(quán)重大且無相關(guān)性的征子集,稱之為最優(yōu)特征子集。2、提出了混合集成學(xué)習(xí)模型(HELM)。HELM算法是在經(jīng)典的集成學(xué)習(xí)算法Adaboost的基礎(chǔ)上提出的。為了提高Adaboost算法的泛化能力,本文在提高Adaboost基本分類器的差異度上做了相關(guān)研究并提出了HELM方法。HELM方法同時(shí)融合了同態(tài)集成和異態(tài)集成方法,即分別利用不同類型的基本分類器訓(xùn)練得到多個(gè)Adaboost同態(tài)集成分類器,然后將這些Adaboost同態(tài)集成分類器作為基本分類器進(jìn)行集成,最終得到HELM模型。結(jié)果表明,HELM算法具有很好的性能。3、建立了CRC癌癥預(yù)測(cè)模型。整個(gè)預(yù)測(cè)模型分為四個(gè)部分:(1)數(shù)據(jù)收集和預(yù)處理。主要分為兩個(gè)步驟完成,首先是對(duì)數(shù)據(jù)進(jìn)行清洗,即除噪、處理缺失值等;然后通過第三軍醫(yī)大學(xué)研究結(jié)直腸癌的教授專家指導(dǎo),從生物學(xué)的角度對(duì)數(shù)據(jù)進(jìn)行分類,將一百多個(gè)維度的樣本屬性分為四大類,即基因位點(diǎn)(SNPs),人口學(xué)特征,生活方式及食物。(2)特征選擇,從兩個(gè)方面對(duì)樣本特征進(jìn)行提取,即按照特征對(duì)分類貢獻(xiàn)大小(relief特征選擇)和特征之間的冗余度(相關(guān)性檢驗(yàn))來選擇最優(yōu)特征。(3)分類預(yù)測(cè),利用提出的HELM算法對(duì)數(shù)據(jù)進(jìn)行分類預(yù)測(cè)。(4)對(duì)比分析,通過相關(guān)算法與HELM分類算法進(jìn)行對(duì)比。綜上所述,本文把基于relief特征選擇算法和基于相關(guān)性檢驗(yàn)的特征選擇方法進(jìn)行有效的結(jié)合,同時(shí)利用提出的HELM算法,建立的CRC癌癥預(yù)測(cè)模型能夠?qū)Y(jié)直腸癌進(jìn)行有效的預(yù)測(cè),并通過與相關(guān)算法對(duì)比,證明了本研究模型具有較好的穩(wěn)定性及泛化能力。今后可將此模型應(yīng)用于更多的復(fù)雜疾病病因?qū)W的研究中。
[Abstract]:Colorectal cancer is one of the world's most common and the most dangerous malignant tumor, its incidence area mainly concentrated in Europe, New Zealand and Australia and other developed countries. Although China is low incidence of colorectal cancer in the traditional sense, but with people's lifestyle and dietary habits are more and more Westernized colorectal cancer incidence in China is increasing year by year, not only seriously endanger people's health, but also caused a certain impact on people's quality of life. Although colorectal cancer has been one of the world within the scope of the most dangerous tumor, but so far, the etiology and pathogenesis is still not completely clear although, a large number of epidemiological studies showed that the occurrence of colorectal cancer is a complicated process, in this process, it will not only affected by environmental factors, genetic factors etc. The influence factors of unilateral, but also may be affected by the interaction between them. However, what exactly is the environmental factors, genetic factors or their interactions affect the occurrence and development of colorectal cancer, still no unified conclusion. Therefore, the establishment of colorectal cancer prediction model, research environment, has important significance of many factors dietary and genetic susceptibility to colorectal cancer. The Third Military Medical University colorectal cancer cases control group based on the sample data, using machine learning method to establish the prediction model of colorectal cancer, and provide a reliable basis for early diagnosis and prevention of colorectal cancer, the main work of this paper are as follows: 1. A method is proposed to select from many characteristics.. because the data dimension is larger, in order to reduce the computational complexity of the model, this paper proposes to reduce the dimension of the data from two aspects, namely relief Feature selection algorithm and correlation test method. The relief algorithm is used to calculate the sample feature weights, will feature weight small deletion, major characteristics of reserves the right to get the feature subset, and then obtain the feature subset of relief algorithm are analyzed. The characteristics of relevance for large, only reserves the right major characteristic, delete feature weight small then, get the subset weights large and no correlation, called the optimal feature subset of.2, proposes a hybrid integrated learning model (HELM.HELM) algorithm is proposed based on Adaboost ensemble learning algorithm on the classic Adaboost algorithm. In order to improve the generalization ability of the improved Adaboost classifier of basic differences the related research and puts forward HELM method.HELM method combines homomorphic integration and ensemble method, namely using the basic classifier training of different types are Multiple Adaboost homomorphic ensemble classifier, and then the Adaboost homomorphic ensemble classifier as base classifier integration, HELM model is obtained. The results show that the HELM algorithm has good performance of.3, established the CRC prediction model. The prediction model of cancer is divided into four parts: (1) data collection and preprocessing is divided. For the two steps, the first is to clean data, namely, denoising, dealing with missing values; then by Third Military Medical University professor of colorectal cancer expert guidance, to classify the data from the perspective of biology, properties of the samples of the more than 100 dimensions are divided into four categories, namely the gene locus (SNPs), demographic characteristics that way of life and food. (2) feature selection, the features of the samples extracted from two aspects, namely, according to the characteristics of size classification contribution (relief feature selection) and redundancy (correlation between features Test) to select the optimal feature. (3) classification prediction, the prediction of the HELM data using the proposed algorithm. (4) comparative analysis, by comparing the related algorithm and HELM algorithm. To sum up, the relief feature selection algorithm and feature correlation test selection method based on effective combination, at the same time by using the proposed HELM algorithm, a CRC cancer prediction model can effectively predict the colorectal cancer, and by comparison with the related algorithms, proved this model has better generalization ability and stability. The future study of this model is applied to the more complex disease etiology.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:R735.34;TP181
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 王海艷;許春偉;吳永芳;張博;邵云;滿秋紅;邰艷紅;李曉兵;;結(jié)直腸癌患者腫瘤組織中KRAS和BRAF基因突變的分子病理檢測(cè)分析[J];貴州醫(yī)藥;2015年11期
2 李道娟;李倩;賀宇彤;;結(jié)直腸癌流行病學(xué)趨勢(shì)[J];腫瘤防治研究;2015年03期
3 傅傳剛;高顯華;;結(jié)直腸癌診斷治療新進(jìn)展[J];中華外科雜志;2012年06期
4 陳坤;國(guó)人結(jié)直腸癌的病因?qū)W及綜合防治策略[J];國(guó)外醫(yī)學(xué).流行病學(xué)傳染病學(xué)分冊(cè);2005年04期
5 余捷凱,楊美琴,姜鐵軍,鄭樹;血清腫瘤標(biāo)志物優(yōu)化組合人工神經(jīng)網(wǎng)絡(luò)模型在大腸癌診斷中的應(yīng)用[J];浙江大學(xué)學(xué)報(bào)(醫(yī)學(xué)版);2004年05期
6 王磊;宋順心;汪建平;;結(jié)直腸癌實(shí)驗(yàn)研究現(xiàn)狀及展望[J];中華實(shí)驗(yàn)外科雜志;2013年03期
相關(guān)博士學(xué)位論文 前1條
1 周紫垣;環(huán)境—膳食因素和遺傳易感性與結(jié)直腸癌發(fā)病的研究[D];第三軍醫(yī)大學(xué);2005年
相關(guān)碩士學(xué)位論文 前2條
1 熊莎;國(guó)內(nèi)移動(dòng)社交用戶使用意愿的影響因素研究[D];北京郵電大學(xué);2013年
2 曹倩;異態(tài)集成學(xué)習(xí)方法在個(gè)人信用評(píng)估中的應(yīng)用[D];哈爾濱工業(yè)大學(xué);2011年
,本文編號(hào):1491939
本文鏈接:http://sikaile.net/yixuelunwen/zlx/1491939.html