基于云計(jì)算的貝葉斯算法在疾病預(yù)測(cè)中的研究與應(yīng)用
本文選題:疾病預(yù)測(cè) 切入點(diǎn):貝葉斯分類 出處:《中國(guó)科學(xué)技術(shù)大學(xué)》2016年碩士論文
【摘要】:疾病診斷是醫(yī)學(xué)領(lǐng)域的重要課題。各種醫(yī)療機(jī)構(gòu)積累了越來(lái)越多的就診樣本數(shù)據(jù),人工對(duì)樣本進(jìn)行疾病分類預(yù)測(cè)的結(jié)果限于經(jīng)驗(yàn)、決策能力等主觀因素的影響難以避免地出現(xiàn)誤差,其分類精度和效率有很大提升空間。中醫(yī)疾病預(yù)測(cè)理論強(qiáng)調(diào)健康與內(nèi)外環(huán)境密切關(guān)聯(lián),基于概率統(tǒng)計(jì)學(xué)的貝葉斯分類器的類屬性聯(lián)合概率很難被準(zhǔn)確估計(jì),基于單機(jī)內(nèi)存的分類算法也無(wú)法在期望時(shí)間內(nèi)處理大規(guī)模樣本集。理想的分類模型能充分表達(dá)樣本特征和疾病類別間的關(guān)聯(lián),提高分類效果和可擴(kuò)展性。釗對(duì)以上不足,本文主要做了以下幾點(diǎn)改進(jìn)。首先,從局部學(xué)習(xí)的角度提出了一種基于余弦相似度進(jìn)行實(shí)例加權(quán)改進(jìn)的樸素貝葉斯分類算法(IWIMNB)。算法在訓(xùn)練樣本集的局部構(gòu)建高質(zhì)量分類器,利用局部的訓(xùn)練樣本弱化屬性條件獨(dú)立性假設(shè),使用余弦相似度度量驗(yàn)證與訓(xùn)練樣本的距離,并作為權(quán)值對(duì)修正的樸素貝葉斯模型進(jìn)行參數(shù)訓(xùn)練,對(duì)比實(shí)驗(yàn)的結(jié)果表明IWIMNB算法可操作性強(qiáng)并具有更好的分類效果。其次,從結(jié)構(gòu)擴(kuò)展的角度考慮將關(guān)聯(lián)規(guī)則應(yīng)用到加權(quán)平均的1-依賴貝葉斯模型(AR-WAODE),從而考慮非公共父結(jié)點(diǎn)屬性間依賴關(guān)系與不同AODE對(duì)分類的貢獻(xiàn)。為了提高生成關(guān)聯(lián)規(guī)則的效率,提出了一種基于矩陣剪枝的分布式頻繁項(xiàng)集挖掘算法(DFIMA),目的是減少Apriori算法產(chǎn)生的無(wú)用候選項(xiàng)集及文件系統(tǒng)I/O負(fù)載,利用2-候選項(xiàng)集矩陣對(duì)生成(k+1)-頻繁項(xiàng)集的計(jì)算過(guò)程進(jìn)行剪枝,之后基于內(nèi)存迭代計(jì)算框架Spark實(shí)現(xiàn)改進(jìn)算法,對(duì)比實(shí)驗(yàn)的結(jié)果表明DFIMA能減少迭代過(guò)程中產(chǎn)生的無(wú)用候選項(xiàng)集,在加速比和可擴(kuò)展性上表現(xiàn)良好。然后,基于Hadoop框架實(shí)現(xiàn)AR-WAODE分類算法(Hadoop-AR-WAODE),從而提高模型參數(shù)的訓(xùn)練速度。算法主要分為預(yù)處理作業(yè)、分類器的訓(xùn)練作業(yè)和預(yù)測(cè)作業(yè)。對(duì)比實(shí)驗(yàn)的結(jié)果表明,Hadoop-AR-WAODE通過(guò)考慮非公共父結(jié)點(diǎn)屬性間依賴關(guān)系以及不同AODE對(duì)分類結(jié)果的貢獻(xiàn)不同提高了分類模型的預(yù)測(cè)效果,在處理大規(guī)模樣本集時(shí)分類效率得到有效改進(jìn)。最后,將Hadoop-AR-WAODE算法應(yīng)用到疾病分類預(yù)測(cè)實(shí)際問題中,以對(duì)原始樣本集的初步數(shù)據(jù)分析結(jié)論為指導(dǎo),設(shè)計(jì)并實(shí)現(xiàn)一個(gè)疾病分類模型。模型以經(jīng)絡(luò)值、面象舌象脈象測(cè)量值、氣象數(shù)據(jù)為輸入,以疾病類別為輸出。對(duì)比實(shí)驗(yàn)的結(jié)果表明受限于疾病預(yù)測(cè)理論的不成熟,疾病分類模型的分類效果有限,但模型具有較好的處理效率與可擴(kuò)展性,在疾病預(yù)測(cè)領(lǐng)域具有一定的參考價(jià)值。
[Abstract]:Disease diagnosis is an important subject in the field of medicine. Various medical institutions have accumulated more and more medical sample data, and the results of artificial classification and prediction of disease in samples are limited to experience. The influence of subjective factors, such as decision ability, can hardly avoid errors, and its classification accuracy and efficiency have great room for improvement. The theory of TCM disease prediction emphasizes that health is closely related to internal and external environment. It is difficult to estimate the joint probability of class attributes of Bayesian classifier based on probabilistic statistics. The classification algorithm based on single machine memory can not deal with large-scale sample set in the expected time. The ideal classification model can fully express the correlation between sample characteristics and disease categories, and improve the classification effect and scalability. The main improvements of this paper are as follows. Firstly, from the point of view of local learning, an improved naive Bayesian classification algorithm based on cosine similarity is proposed. The algorithm constructs a high quality classifier in the local training sample set. Using the local training samples to weaken the conditional independence hypothesis of attributes, using cosine similarity to measure the distance between the training samples and the training samples, and training the modified naive Bayes model as weights, the parameters of the modified naive Bayes model are trained. The results of comparative experiments show that the IWIMNB algorithm is more operable and has better classification effect. Secondly, From the point of view of structure extension, this paper considers the application of association rules to the weighted average 1-dependent Bayesian model (AR-WAODEN), so as to consider the dependencies between attributes of non-common parent nodes and the contribution of different AODE to classification. In order to improve the efficiency of generating association rules, A distributed frequent itemset mining algorithm based on matrix pruning is proposed in this paper, which aims to reduce the useless candidate set generated by the Apriori algorithm and the file system I / O load. The 2-candidate itemset matrix is used to prune the computing process of generating the k-1- frequent itemsets, and then an improved algorithm is implemented based on the memory iterative computing framework Spark. The results of comparison experiments show that DFIMA can reduce the useless candidate itemsets generated in the iterative process. Then, the AR-WAODE classification algorithm based on Hadoop framework is implemented to improve the training speed of the model parameters. The algorithm is divided into preprocessing jobs, and the algorithm is based on the Hadoop framework to implement the Hadoop-AR-WAODEG algorithm, which can improve the training speed of the model parameters. The results of the comparative experiments show that Hadoop-AR-WAODE improves the prediction effect of the classification model by considering the dependencies between the attributes of non-common parent nodes and the contribution of different AODE to the classification results. The classification efficiency is improved effectively when dealing with large-scale sample sets. Finally, the Hadoop-AR-WAODE algorithm is applied to the actual problem of disease classification and prediction, which is guided by the preliminary data analysis conclusion of the original sample set. A disease classification model is designed and implemented. The model is based on meridian value, tongue image pulse value, meteorological data and disease type. The results of comparative experiments show that the model is limited by the immaturity of disease prediction theory. The classification effect of the disease classification model is limited, but the model has better processing efficiency and expansibility, and it has certain reference value in the field of disease prediction.
【學(xué)位授予單位】:中國(guó)科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:O212.8
【相似文獻(xiàn)】
相關(guān)期刊論文 前1條
1 袁鶯楹;董建成;;基于數(shù)學(xué)模型的疾病預(yù)測(cè)方法比較研究[J];軟件導(dǎo)刊;2009年05期
相關(guān)會(huì)議論文 前3條
1 蔣維祖;;易醫(yī)在疾病預(yù)測(cè)上的應(yīng)用探討[A];中醫(yī)理論臨床應(yīng)用學(xué)術(shù)研討會(huì)論文集[C];2007年
2 鄭新水;;指紋的應(yīng)用研究——疾病預(yù)測(cè)學(xué)探討[A];全國(guó)中醫(yī)藏象研究創(chuàng)新思路學(xué)術(shù)研討會(huì)論文匯編[C];2001年
3 田富鵬;萬(wàn)淑慧;;主成分神經(jīng)網(wǎng)絡(luò)模型在疾病預(yù)測(cè)中的應(yīng)用[A];中國(guó)運(yùn)籌學(xué)會(huì)模糊信息與模糊工程分會(huì)第五屆學(xué)術(shù)年會(huì)論文集[C];2010年
相關(guān)重要報(bào)紙文章 前3條
1 記者 胡德榮;園區(qū)將根據(jù)天氣預(yù)報(bào)進(jìn)行疾病預(yù)測(cè)[N];健康報(bào);2010年
2 曉劍;“早安心”早期醫(yī)療解決方案進(jìn)入我國(guó)[N];中國(guó)勞動(dòng)保障報(bào);2006年
3 蔣廷玉;“基因檢測(cè)”健康遺傳高危因素登陸南京[N];新華日?qǐng)?bào);2006年
相關(guān)碩士學(xué)位論文 前2條
1 付歡歡;基于云計(jì)算的貝葉斯算法在疾病預(yù)測(cè)中的研究與應(yīng)用[D];中國(guó)科學(xué)技術(shù)大學(xué);2016年
2 劉宏軍;基于灰色理論的中醫(yī)疾病預(yù)測(cè)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];黑龍江大學(xué);2009年
,本文編號(hào):1680971
本文鏈接:http://sikaile.net/kejilunwen/yysx/1680971.html