關(guān)聯(lián)分類算法研究及其在海量慢病醫(yī)療數(shù)據(jù)挖掘中的應(yīng)用
本文選題:關(guān)聯(lián)分類 + Hadoop; 參考:《北京郵電大學(xué)》2016年碩士論文
【摘要】:關(guān)聯(lián)分類是將關(guān)聯(lián)規(guī)則挖掘和分類技術(shù)結(jié)合而產(chǎn)生的一種算法,它首先使用關(guān)聯(lián)規(guī)則挖掘技術(shù)生成分類關(guān)聯(lián)規(guī)則,然后基于這些規(guī)則構(gòu)建分類器用于分類過程。與決策樹、神經(jīng)網(wǎng)絡(luò)等傳統(tǒng)的分類算法相比,它具有分類準(zhǔn)確率高、模型可理解性強(qiáng)的優(yōu)點(diǎn),尤其適合于醫(yī)療數(shù)據(jù)挖掘等需要分類模型易于理解、易于應(yīng)用的場(chǎng)景。高血壓、心腦血管病等慢性疾病給人類的健康帶來了極大危害,有必要借助數(shù)據(jù)挖掘技術(shù)建立慢病分類決策模型,進(jìn)行患病預(yù)測(cè)和輔助診斷。但是,慢病數(shù)據(jù)特有的數(shù)值型屬性多、屬性重要性差異大的特點(diǎn)會(huì)導(dǎo)致現(xiàn)有關(guān)聯(lián)分類技術(shù)的應(yīng)用效果不理想。本文針對(duì)慢病數(shù)據(jù)的特點(diǎn),提出了基于信息增益比的模糊加權(quán)關(guān)聯(lián)分類算法,以提升算法的分類準(zhǔn)確性。同時(shí),還對(duì)單節(jié)點(diǎn)的關(guān)聯(lián)分類算法進(jìn)行并行化改造和優(yōu)化來提升算法的擴(kuò)展性,從而滿足對(duì)海量數(shù)據(jù)高效處理的需求。論文研究工作主要圍繞模糊加權(quán)關(guān)聯(lián)分類算法設(shè)計(jì),慢病數(shù)據(jù)挖掘方案設(shè)計(jì),算法的并行化改造和性能評(píng)估等方面展開。首先,融合模糊集和信息增益比提出了能夠提高分類器性能的GRWFAC算法;然后結(jié)合心血管患病風(fēng)險(xiǎn)預(yù)測(cè)場(chǎng)景,設(shè)計(jì)了海量慢病數(shù)據(jù)挖掘方案和模型輸入輸出參數(shù);最后基于Hadoop分布式平臺(tái)重新設(shè)計(jì)實(shí)現(xiàn)了并行化關(guān)聯(lián)分類MRWFAC算法,并開展海量慢病數(shù)據(jù)挖掘?qū)嶒?yàn)來驗(yàn)證算法性能的提升。論文最終驗(yàn)證了慢病數(shù)據(jù)挖掘方案的可行性以及算法性能的提升。與C4.5算法和CBA算法相比,GRWFAC算法的準(zhǔn)確率和穩(wěn)定性獲得提升,而并行化實(shí)現(xiàn)的MRWFAC算法在加速比和擴(kuò)展性評(píng)估中也體現(xiàn)了對(duì)海量慢病數(shù)據(jù)的適應(yīng)性。本課題的研究成果對(duì)于慢病防治和輔助診斷具有積極的意義。
[Abstract]:Association classification is an algorithm which combines association rule mining with classification technology. It first uses association rule mining technology to generate classification association rules, and then constructs classifier based on these rules for classification process. Compared with the traditional classification algorithms such as decision tree and neural network, it has the advantages of high classification accuracy and strong model comprehensibility. It is especially suitable for medical data mining, where classification models are easy to understand and apply. Chronic diseases such as hypertension and cardiovascular and cerebrovascular diseases have brought great harm to human health. It is necessary to establish a classification decision model of chronic diseases by using data mining technology to predict disease and assist diagnosis. However, there are many numerical attributes and great differences in the importance of attributes in slow disease data, which will lead to unsatisfactory application of existing association classification techniques. According to the characteristics of slow disease data, a fuzzy weighted association classification algorithm based on information gain ratio is proposed to improve the classification accuracy of the algorithm. At the same time, the single node association classification algorithm is parallelized and optimized to improve the scalability of the algorithm, so as to meet the demand for efficient processing of mass data. This paper mainly focuses on the design of fuzzy weighted association classification algorithm, the scheme design of slow disease data mining, the parallelization of the algorithm and the performance evaluation. Firstly, a GRWFAC algorithm which can improve the performance of classifier is proposed by combining fuzzy set and information gain ratio, and then the massive slow disease data mining scheme and the input and output parameters of the model are designed according to the forecast scenario of cardiovascular disease risk. Finally, the parallel association classification MRWFAC algorithm is redesigned based on Hadoop distributed platform, and the massive slow sickness data mining experiment is carried out to verify the performance of the algorithm. Finally, the paper verifies the feasibility of slow disease data mining and the improvement of algorithm performance. Compared with C4.5 algorithm and CBA algorithm, the accuracy and stability of GRWFAC algorithm are improved, and the parallel MRWFAC algorithm has the adaptability to mass slow disease data in speedup and scalability evaluation. The research results of this paper have positive significance for the prevention and treatment of chronic diseases and auxiliary diagnosis.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:R-05;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 文婷;;卡方檢驗(yàn)在醫(yī)學(xué)資料處理中的應(yīng)用[J];長(zhǎng)江大學(xué)學(xué)報(bào)(自科版);2013年24期
2 陸秋;程小輝;;基于MapReduce的決策樹算法并行化[J];計(jì)算機(jī)應(yīng)用;2012年09期
3 姚旭;王曉丹;張玉璽;權(quán)文;;特征選擇方法綜述[J];控制與決策;2012年02期
4 ;中國(guó)心血管病預(yù)防指南[J];中華心血管病雜志;2011年01期
5 周水紅;聶紹發(fā);王重建;魏晟;許奕華;李雪華;宋恩民;;應(yīng)用人工神經(jīng)網(wǎng)絡(luò)預(yù)測(cè)個(gè)體患原發(fā)性高血壓病危險(xiǎn)度[J];中華流行病學(xué)雜志;2008年06期
6 劉業(yè)政;焦寧;姜元春;;連續(xù)屬性離散化算法比較研究[J];計(jì)算機(jī)應(yīng)用研究;2007年09期
7 毛利鋒,瞿海斌;一種基于決策樹的乳腺癌計(jì)算機(jī)輔助診斷新方法[J];江南大學(xué)學(xué)報(bào);2004年03期
8 朱凌云,吳寶明;醫(yī)學(xué)數(shù)據(jù)挖掘的技術(shù)、方法及應(yīng)用[J];生物醫(yī)學(xué)工程學(xué)雜志;2003年03期
相關(guān)博士學(xué)位論文 前1條
1 朱林;基于特征加權(quán)與特征選擇的數(shù)據(jù)挖掘算法研究[D];上海交通大學(xué);2013年
相關(guān)碩士學(xué)位論文 前2條
1 胡賢利;混合型數(shù)據(jù)的缺失數(shù)據(jù)的填補(bǔ)[D];中南大學(xué);2013年
2 許立莎;基于關(guān)聯(lián)規(guī)則挖掘的分類算法研究[D];西安科技大學(xué);2012年
,本文編號(hào):2030510
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2030510.html