天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 自動(dòng)化論文 >

基于屬性選擇加權(quán)的樸素貝葉斯算法的改進(jìn)與應(yīng)用

發(fā)布時(shí)間:2018-04-26 18:18

  本文選題:數(shù)據(jù)挖掘 + 樸素貝葉斯。 參考:《西安理工大學(xué)》2017年碩士論文


【摘要】:隨著信息技術(shù)的普及、大數(shù)據(jù)時(shí)代的到來,數(shù)據(jù)深度分析的需求也越來越大,數(shù)據(jù)挖掘技術(shù)便是一種實(shí)現(xiàn)從信息到知識(shí)轉(zhuǎn)變的有效工具。而樸素貝葉斯算法是國際權(quán)威的數(shù)據(jù)挖掘?qū)W術(shù)會(huì)議評(píng)選出來的數(shù)據(jù)挖掘領(lǐng)域的十大經(jīng)典算法之一,樸素貝葉斯模型發(fā)源于古典概率論,有著堅(jiān)實(shí)的數(shù)學(xué)基礎(chǔ),以及穩(wěn)定的分類效率。同時(shí),它所需估計(jì)的參數(shù)少,對(duì)缺失數(shù)據(jù)不太敏感,算法也比較簡單。理論上,樸素貝葉斯模型與其他分類算法相比具有最小的誤差率。但是由于其假設(shè)屬性之間相互獨(dú)立,而實(shí)際應(yīng)用中這個(gè)假設(shè)往往不成立。在屬性個(gè)數(shù)較多或者屬性之間相關(guān)性較大時(shí),模型性能會(huì)降低。本文主要針對(duì)樸素貝葉斯算法的不足在屬性選擇和屬性加權(quán)兩個(gè)方面對(duì)其進(jìn)行改進(jìn)。在屬性選擇方面,先引入信息價(jià)值指標(biāo),得到第一個(gè)與類別相關(guān)度較高的屬性子集,然后在此基礎(chǔ)上進(jìn)一步過濾冗余屬性,得到第二個(gè)屬性子集,分別在這兩個(gè)屬性子集上構(gòu)造樸素貝葉斯分類模型。分析發(fā)現(xiàn)對(duì)初始屬性集合進(jìn)行兩次屬性選擇構(gòu)造的樸素貝葉斯分類模型既實(shí)現(xiàn)了屬性降維的目的又提高了分類準(zhǔn)確率。在屬性加權(quán)方面,通過層次分析法量化經(jīng)驗(yàn)知識(shí),對(duì)樣本訓(xùn)練的權(quán)值進(jìn)行調(diào)整,得到更加全面的權(quán)值,根據(jù)屬性取值的重要程度對(duì)樸素貝葉斯分類計(jì)算公式中的后驗(yàn)概率加權(quán),提高分類準(zhǔn)確率。然后結(jié)合屬性選擇和屬性加權(quán)的優(yōu)勢(shì),對(duì)樸素貝葉斯算法進(jìn)行選擇加權(quán),該算法先通過信息價(jià)值指標(biāo)對(duì)初始屬性集進(jìn)行二次屬性選擇,再通過層次分析法計(jì)算權(quán)值,在最優(yōu)屬性子集上構(gòu)造加權(quán)樸素貝葉斯分類器,并在通用數(shù)據(jù)集上進(jìn)行實(shí)驗(yàn)驗(yàn)證。最后將改進(jìn)的樸素貝葉斯算法合理地應(yīng)用到電信行業(yè)的垃圾短信用戶識(shí)別模型中,通過在Spark平臺(tái)上進(jìn)行實(shí)驗(yàn)分析證明其有效性,從而進(jìn)一步提高垃圾信息治理工作效果,優(yōu)化垃圾信息治理的技術(shù)。
[Abstract]:With the popularization of information technology and the arrival of big data era, the demand of data depth analysis is also increasing. Data mining technology is an effective tool to realize the transformation from information to knowledge. The naive Bayesian algorithm is one of the ten classical algorithms in the field of data mining selected by the international authoritative conference on data mining. The naive Bayesian model originated from the classical probability theory and has a solid mathematical foundation. And stable classification efficiency. At the same time, it needs less estimation parameters, is not sensitive to missing data, and the algorithm is relatively simple. In theory, the naive Bayesian model has the smallest error rate compared with other classification algorithms. However, because the hypothesis attributes are independent of each other, this hypothesis is often not true in practical application. When the number of attributes is more or the correlation between them is large, the performance of the model will be reduced. In this paper, we improve the naive Bayes algorithm in two aspects: attribute selection and attribute weighting. In the aspect of attribute selection, we first introduce the information value index to get the first attribute subset which has high correlation with the category, and then filter the redundant attribute further, and get the second attribute subset. The naive Bayes classification model is constructed on these two attribute subsets. It is found that the naive Bayesian classification model based on the second attribute selection for the initial attribute sets not only achieves the goal of attribute dimension reduction but also improves the classification accuracy. In the aspect of attribute weighting, the weight of sample training is adjusted by quantifying the empirical knowledge through AHP, and the weight value is more comprehensive. According to the importance of attribute value, the posterior probability is weighted in the formula of naive Bayes classification. Improve the accuracy of classification. Then combining the advantages of attribute selection and attribute weighting, the naive Bayes algorithm is selected and weighted. The algorithm selects the initial attribute set by the information value index, and then calculates the weight value by the analytic hierarchy process (AHP). The weighted naive Bayes classifier is constructed on the optimal attribute subset and tested on the general data set. Finally, the improved naive Bayes algorithm is reasonably applied to the spam short message user identification model of the telecom industry. The effectiveness of the improved Bayesian algorithm is proved by the experimental analysis on the Spark platform, thus further improving the effect of garbage information management. Optimize the technology of garbage information management.
【學(xué)位授予單位】:西安理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP18

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 魏浩;丁要軍;;一種基于相關(guān)的屬性選擇改進(jìn)算法[J];計(jì)算機(jī)應(yīng)用與軟件;2014年08期

2 張步良;;基于分類概率加權(quán)的樸素貝葉斯分類方法[J];重慶理工大學(xué)學(xué)報(bào)(自然科學(xué));2012年07期

3 廖紅強(qiáng);邱勇;楊俠;王星剛;葛任偉;;對(duì)應(yīng)用層次分析法確定權(quán)重系數(shù)的探討[J];機(jī)械工程師;2012年06期

4 張東亮;董禮;;基于改進(jìn)的樸素貝葉斯算法在垃圾短信過濾中的研究[J];計(jì)算機(jī)測(cè)量與控制;2012年02期

5 曹根;葛孝X;楊麗琴;;基于K-近鄰法的局部加權(quán)樸素貝葉斯分類算法[J];計(jì)算機(jī)應(yīng)用與軟件;2011年09期

6 龔之聞;;不基于短信內(nèi)容的垃圾短信識(shí)別模型[J];科技信息;2011年07期

7 陳朝大;梁柱勛;鄭士基;;一種利用關(guān)聯(lián)規(guī)則的改進(jìn)樸素貝葉斯分類算法[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2010年11期

8 范敏;石為人;;層次樸素貝葉斯分類器構(gòu)造算法及應(yīng)用研究[J];儀器儀表學(xué)報(bào);2010年04期

9 劉勇;熊蓉;褚健;;Hash快速屬性約簡算法[J];計(jì)算機(jī)學(xué)報(bào);2009年08期

10 張明衛(wèi);王波;張斌;朱志良;;基于相關(guān)系數(shù)的加權(quán)樸素貝葉斯分類算法[J];東北大學(xué)學(xué)報(bào)(自然科學(xué)版);2008年07期

相關(guān)博士學(xué)位論文 前2條

1 蔣良孝;樸素貝葉斯分類器及其改進(jìn)算法研究[D];中國地質(zhì)大學(xué);2009年

2 陳景年;選擇性貝葉斯分類算法研究[D];北京交通大學(xué);2008年

,

本文編號(hào):1807117

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1807117.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶959d4***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com