測(cè)試代價(jià)敏感的貝葉斯分類器研究
發(fā)布時(shí)間:2018-12-24 08:26
【摘要】:分類作為數(shù)據(jù)挖掘與機(jī)器學(xué)習(xí)中的熱點(diǎn)研究方向吸引了眾多學(xué)者的關(guān)注,并在客戶流失、入侵檢測(cè)、醫(yī)療診斷、文本分類等實(shí)際領(lǐng)域取得了廣泛的應(yīng)用。在傳統(tǒng)的分類研究及其應(yīng)用中,經(jīng)常假設(shè)實(shí)例數(shù)據(jù)已經(jīng)被存儲(chǔ)在數(shù)據(jù)庫(kù)中或者假設(shè)可以無(wú)償獲得并隨意使用,因此研究的目標(biāo)是構(gòu)建一個(gè)最大化預(yù)測(cè)精度的分類模型,常見(jiàn)的分類方法有貝葉斯網(wǎng)絡(luò)、決策樹、人工神經(jīng)網(wǎng)絡(luò)、支持向量機(jī)等。然而上述假設(shè)在大部分實(shí)際應(yīng)用中是不實(shí)際的,每一個(gè)實(shí)例屬性值的獲取都需要付出一定代價(jià)(金錢、時(shí)間、風(fēng)險(xiǎn)等),簡(jiǎn)稱為測(cè)試代價(jià)。為了使傳統(tǒng)的算法能被轉(zhuǎn)換到可用的實(shí)際系統(tǒng)中,研究者除了關(guān)注如何最大化提升模型分類精度之外還需要嘗試最小化模型所需的測(cè)試代價(jià),因此測(cè)試代價(jià)敏感學(xué)習(xí)顯得十分重要。測(cè)試代價(jià)敏感學(xué)習(xí)中需要同時(shí)優(yōu)化模型的分類精度和測(cè)試代價(jià)兩個(gè)目標(biāo),是一個(gè)典型的多目標(biāo)優(yōu)化問(wèn)題。求解該多目標(biāo)優(yōu)化問(wèn)題可以直接采用多目標(biāo)優(yōu)化算法求解前沿解集或者將該多目標(biāo)優(yōu)化問(wèn)題轉(zhuǎn)換為單目標(biāo)優(yōu)化問(wèn)題進(jìn)行求解,而后一種求解策略又可以分為兩種方法:1)將多目標(biāo)優(yōu)化問(wèn)題轉(zhuǎn)化成單目標(biāo)約束優(yōu)化問(wèn)題,在測(cè)試代價(jià)敏感學(xué)習(xí)中將分類精度看作約束條件而將測(cè)試代價(jià)看作目標(biāo)函數(shù);2)將優(yōu)化問(wèn)題中的多個(gè)目標(biāo)整合成一個(gè)新的目標(biāo)函數(shù),在測(cè)試代價(jià)敏感學(xué)習(xí)中將分類精度與測(cè)試代價(jià)整合成新的目標(biāo)函數(shù)進(jìn)行最優(yōu)解搜索。近年來(lái)隨著信息社會(huì)中數(shù)據(jù)的爆炸式發(fā)展,數(shù)據(jù)的維度也呈現(xiàn)出指數(shù)級(jí)增長(zhǎng)的趨勢(shì),過(guò)多的屬性不僅會(huì)增加算法的空間存儲(chǔ)消耗和時(shí)間復(fù)雜度,而且大量的不相關(guān)/冗余屬性甚至?xí)档退惴ㄗ罱K的分類精度。屬性選擇方法旨在從原始的屬性集空間為算法學(xué)習(xí)選擇出最佳的屬性子集,成為改進(jìn)樸素貝葉斯分類器的主要方向之一,當(dāng)前大量研究成果表明屬性選擇對(duì)樸素貝葉斯分類器精度提升有著顯著的效果。然而,現(xiàn)有的研究很少將樸素貝葉斯的屬性選擇問(wèn)題和測(cè)試代價(jià)敏感問(wèn)題相結(jié)合,專門研究測(cè)試代價(jià)敏感的樸素貝葉斯分類器。本文以樸素貝葉斯分類器為基本研究對(duì)象,使用上述兩種方法將測(cè)試代價(jià)敏感學(xué)習(xí)這一多目標(biāo)優(yōu)化問(wèn)題轉(zhuǎn)化為單目標(biāo)優(yōu)化問(wèn)題,分別提出了基于約束優(yōu)化的測(cè)試代價(jià)敏感樸素貝葉斯分類器(COTCSNB)和基于優(yōu)化目標(biāo)的測(cè)試代價(jià)敏感樸素貝葉斯分類器(OOTCSNB)。通過(guò)基于WEKA平臺(tái)的實(shí)驗(yàn)證明了所提出的兩種新算法在保持模型高分類精度和最小化模型測(cè)試代價(jià)方面優(yōu)秀的性能表現(xiàn),最后以不同的醫(yī)療診斷問(wèn)題為例詳細(xì)地研究了新算法在實(shí)際問(wèn)題中的應(yīng)用轉(zhuǎn)換以及性能表現(xiàn)。論文的主要?jiǎng)?chuàng)新及貢獻(xiàn)包括:1)提出了基于約束優(yōu)化的測(cè)試代價(jià)敏感樸素貝葉斯分類器(COTCSNB)新算法。在傳統(tǒng)貪婪搜索策略中,每一次的屬性選擇都旨在選擇最能使分類器精度提高的一個(gè)屬性,從而期望最終達(dá)到模型分類性能最大化的目標(biāo)。而在代價(jià)敏感學(xué)習(xí)中,COTCSNB將刪除屬性不會(huì)降低模型的分類精度作為約束條件,在后向每一步的屬性選擇搜索中刪除滿足約束條件的測(cè)試代價(jià)最大的一個(gè)屬性,直至任何屬性的刪除都會(huì)違背約束條件時(shí)停止搜索。2)給出了測(cè)試代價(jià)敏感的包裝法屬性選擇學(xué)習(xí)框架,然后通過(guò)將分類精度指標(biāo)與測(cè)試代價(jià)指標(biāo)做差提出了一個(gè)新的測(cè)試代價(jià)敏感的屬性選擇目標(biāo)函數(shù),最后基于新的目標(biāo)函數(shù)和最優(yōu)搜索策略提出了一種基于優(yōu)化目標(biāo)的測(cè)試代價(jià)敏感樸素貝葉斯分類器(OOTCSNB)新算法。3)分析了醫(yī)療診斷中病理值獲取的測(cè)試代價(jià)問(wèn)題,以心臟病、肝炎、糖尿病和甲狀腺疾的實(shí)際醫(yī)療診斷問(wèn)題為例探討了新算法(COTCSNB、OOTCSNB)在實(shí)際問(wèn)題中的應(yīng)用情況,實(shí)驗(yàn)結(jié)果表明新算法在保持模型分類精度的同時(shí)能顯著降低醫(yī)療診斷過(guò)程中所需的測(cè)試代價(jià)。
[Abstract]:It has attracted many scholars' attention as the hot spot research direction in data mining and machine learning, and has been widely used in the fields of customer loss, intrusion detection, medical diagnosis, text classification and so on. In the traditional classification research and its application, it is often assumed that the case data has been stored in the database or assumed to be available for free and will be used at will, so the objective of the study is to construct a classification model to maximize the prediction accuracy, and the common classification method has a Bayesian network. Decision tree, artificial neural network, support vector machine, etc. However, the above-mentioned assumption is not practical in most of the practical applications, and the acquisition of each instance property value needs to pay a certain price (money, time, risk, etc.), which is simply referred to as the test cost. In order to enable a conventional algorithm to be converted into an available actual system, the researchers need to try to minimize the cost of testing required to model the model in addition to how to maximize the classification accuracy of the model, so the test-cost-sensitive learning is important. It is a typical multi-objective optimization problem that both the classification accuracy and the test cost of the model need to be optimized simultaneously in the cost-sensitive learning of the test. In order to solve the multi-objective optimization problem, a multi-objective optimization algorithm can be used to solve the forward solution set, or the multi-objective optimization problem can be converted into a single-objective optimization problem to be solved, and then a solution strategy can be divided into two methods: 1) the multi-objective optimization problem is converted into a single objective constraint optimization problem, the classification accuracy is considered as a constraint condition in the test cost sensitive learning, the test cost is regarded as a target function, and 2) the multiple targets in the optimization problem are integrated into a new target function, and the classification accuracy and the test cost are integrated into a new target function for optimal solution search in the testing cost sensitive learning. In recent years, with the explosion of the data in the information society, the data's dimension also shows the trend of exponential growth, and too many attributes will not only increase the space storage consumption and the time complexity of the algorithm, and a large number of non-related/ redundant attributes may even reduce the final classification accuracy of the algorithm. The attribute selection method is to select the best subset of attributes from the original attribute set space as one of the main directions to improve the Naive Bayes classifier, and the current large number of research results show that the attribute selection has a significant effect on the accuracy improvement of the naive Bayesian classifier. However, the existing research seldom combines the attribute selection problem of Naive Bayes and the sensitive problem of the test cost, and studies the naive Bayesian classifier which is sensitive to the test cost. In this paper, a simple Bayesian classifier is used as the basic research object, and the two methods are used to transform the multi-objective optimization problem into a single-objective optimization problem. A test-cost-sensitive naive Bayesian classifier (COTCSNB) based on constrained optimization and a test-cost-sensitive naive Bayesian classifier (OOTCSNB) based on the optimization objective are presented. Based on the WEKA platform, the two new algorithms are proved to be excellent performance in maintaining the high classification accuracy of the model and minimizing the cost of the model test. Finally, the application and performance of the new algorithm in the real problem are studied in detail with different medical diagnosis. The main innovation and contribution of the thesis include: 1) a new algorithm based on the constraint-optimized naive Bayesian classifier (COTCSNB) is proposed. In the traditional greedy search strategy, each attribute selection is designed to select a property that is most likely to improve the accuracy of the classifier, so that the goal of maximizing the model classification performance is expected. in that cost-sensitive study, the COTCSNB will delete the attribute without reducing the classification accuracy of the model as a constraint condition, and then, the attribute of the maximum of the test cost satisfying the constraint condition is deleted in the attribute selection search of each step, if the deletion of any attribute is in violation of the constraint condition, the search is stopped. 2) the packaging method attribute sensitive to the test cost is given, the learning framework is selected, and then a new test cost-sensitive attribute selection target function is proposed by comparing the classification accuracy index with the test cost index, Finally, based on the new objective function and the optimal search strategy, a new algorithm for testing the cost-sensitive naive Bayesian classifier (OOTCSNB) based on the optimization objective is presented. The application of the new algorithm (COTCSNB, OOTCSNB) in the real problem is discussed in this paper. The experimental results show that the new algorithm can significantly reduce the cost of the test during the medical diagnosis.
【學(xué)位授予單位】:中國(guó)地質(zhì)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP18
[Abstract]:It has attracted many scholars' attention as the hot spot research direction in data mining and machine learning, and has been widely used in the fields of customer loss, intrusion detection, medical diagnosis, text classification and so on. In the traditional classification research and its application, it is often assumed that the case data has been stored in the database or assumed to be available for free and will be used at will, so the objective of the study is to construct a classification model to maximize the prediction accuracy, and the common classification method has a Bayesian network. Decision tree, artificial neural network, support vector machine, etc. However, the above-mentioned assumption is not practical in most of the practical applications, and the acquisition of each instance property value needs to pay a certain price (money, time, risk, etc.), which is simply referred to as the test cost. In order to enable a conventional algorithm to be converted into an available actual system, the researchers need to try to minimize the cost of testing required to model the model in addition to how to maximize the classification accuracy of the model, so the test-cost-sensitive learning is important. It is a typical multi-objective optimization problem that both the classification accuracy and the test cost of the model need to be optimized simultaneously in the cost-sensitive learning of the test. In order to solve the multi-objective optimization problem, a multi-objective optimization algorithm can be used to solve the forward solution set, or the multi-objective optimization problem can be converted into a single-objective optimization problem to be solved, and then a solution strategy can be divided into two methods: 1) the multi-objective optimization problem is converted into a single objective constraint optimization problem, the classification accuracy is considered as a constraint condition in the test cost sensitive learning, the test cost is regarded as a target function, and 2) the multiple targets in the optimization problem are integrated into a new target function, and the classification accuracy and the test cost are integrated into a new target function for optimal solution search in the testing cost sensitive learning. In recent years, with the explosion of the data in the information society, the data's dimension also shows the trend of exponential growth, and too many attributes will not only increase the space storage consumption and the time complexity of the algorithm, and a large number of non-related/ redundant attributes may even reduce the final classification accuracy of the algorithm. The attribute selection method is to select the best subset of attributes from the original attribute set space as one of the main directions to improve the Naive Bayes classifier, and the current large number of research results show that the attribute selection has a significant effect on the accuracy improvement of the naive Bayesian classifier. However, the existing research seldom combines the attribute selection problem of Naive Bayes and the sensitive problem of the test cost, and studies the naive Bayesian classifier which is sensitive to the test cost. In this paper, a simple Bayesian classifier is used as the basic research object, and the two methods are used to transform the multi-objective optimization problem into a single-objective optimization problem. A test-cost-sensitive naive Bayesian classifier (COTCSNB) based on constrained optimization and a test-cost-sensitive naive Bayesian classifier (OOTCSNB) based on the optimization objective are presented. Based on the WEKA platform, the two new algorithms are proved to be excellent performance in maintaining the high classification accuracy of the model and minimizing the cost of the model test. Finally, the application and performance of the new algorithm in the real problem are studied in detail with different medical diagnosis. The main innovation and contribution of the thesis include: 1) a new algorithm based on the constraint-optimized naive Bayesian classifier (COTCSNB) is proposed. In the traditional greedy search strategy, each attribute selection is designed to select a property that is most likely to improve the accuracy of the classifier, so that the goal of maximizing the model classification performance is expected. in that cost-sensitive study, the COTCSNB will delete the attribute without reducing the classification accuracy of the model as a constraint condition, and then, the attribute of the maximum of the test cost satisfying the constraint condition is deleted in the attribute selection search of each step, if the deletion of any attribute is in violation of the constraint condition, the search is stopped. 2) the packaging method attribute sensitive to the test cost is given, the learning framework is selected, and then a new test cost-sensitive attribute selection target function is proposed by comparing the classification accuracy index with the test cost index, Finally, based on the new objective function and the optimal search strategy, a new algorithm for testing the cost-sensitive naive Bayesian classifier (OOTCSNB) based on the optimization objective is presented. The application of the new algorithm (COTCSNB, OOTCSNB) in the real problem is discussed in this paper. The experimental results show that the new algorithm can significantly reduce the cost of the test during the medical diagnosis.
【學(xué)位授予單位】:中國(guó)地質(zhì)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP18
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 劉Pr;秦亮曦;;模糊決策粗糙集代價(jià)敏感屬性約簡(jiǎn)研究[J];計(jì)算機(jī)科學(xué);2016年S2期
2 曹瑩;苗啟廣;劉家辰;高琳;;具有Fisher一致性的代價(jià)敏感Boosting算法[J];軟件學(xué)報(bào);2013年11期
3 谷瓊;袁磊;寧彬;熊啟軍;華麗;李文新;;一種基于重取樣的代價(jià)敏感學(xué)習(xí)算法[J];計(jì)算機(jī)工程與科學(xué);2011年09期
4 凌曉峰;SHENG Victor S.;;代價(jià)敏感分類器的比較研究(英文)[J];計(jì)算機(jī)學(xué)報(bào);2007年08期
5 鄧維斌;王國(guó)胤;王燕;;基于Rough Set的加權(quán)樸素貝葉斯分類算法[J];計(jì)算機(jī)科學(xué);2007年02期
6 陳雪,戴芹,馬建文,李小文;貝葉斯網(wǎng)絡(luò)分類算法在遙感數(shù)據(jù)變化檢測(cè)上的應(yīng)用[J];北京師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年01期
7 王志海,張t,
本文編號(hào):2390398
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2390398.html
最近更新
教材專著