基于ProGEP的代價(jià)敏感分類算法研究
發(fā)布時(shí)間:2018-05-10 21:42
本文選題:數(shù)據(jù)挖掘 + 代價(jià)敏感 ; 參考:《安徽財(cái)經(jīng)大學(xué)》2015年碩士論文
【摘要】:近年來數(shù)據(jù)挖掘技術(shù)被廣泛應(yīng)用在市場營銷、商業(yè)管理、企業(yè)危機(jī)管理、產(chǎn)品制造和Internet等方面。目前全世界計(jì)算機(jī)存儲的未使用的海量數(shù)據(jù)還在快速增長,數(shù)據(jù)類型和結(jié)構(gòu)也愈發(fā)復(fù)雜,這對降低挖掘成本,提高算法效能均帶來嚴(yán)峻的挑戰(zhàn)。因此,改進(jìn)挖掘算法流程、提高算法運(yùn)行效率對于高效取得較為滿意的挖掘結(jié)果有著重要意義。 本文就對數(shù)據(jù)挖掘中常用的遺傳算法衍生的一種新算法——基因表達(dá)式編程展開研究并作出相關(guān)改進(jìn),提出并設(shè)計(jì)ProGEP算法,并將該算法應(yīng)用于代價(jià)敏感分類問題,設(shè)計(jì)并實(shí)現(xiàn)了CSC-ProGEP算法。主要工作有以下四個(gè)方面: 1.綜述了國內(nèi)外GEP及代價(jià)敏感學(xué)習(xí)算法的研究現(xiàn)狀;概述了GEP算法的構(gòu)成及流程;簡述了目前幾種常用的代價(jià)敏感分類算法。 2.改進(jìn)GEP算法并提出ProGEP算法。針對基本GEP算法重復(fù)遍歷表達(dá)式樹的染色體評估方法效率低下的不足,在研究目前流行的改進(jìn)算法——基因閱讀運(yùn)算器的改進(jìn)思想后,提出了逆波蘭表達(dá)式——堆棧法評估(RPE_SD),通過后續(xù)遍歷一次表達(dá)式樹獲得逆波蘭表達(dá)式,采用重復(fù)讀取線性的堆棧結(jié)構(gòu)進(jìn)行存儲和計(jì)算,實(shí)現(xiàn)染色體評估效率的提高;其次,就基本GEP未給定具體常數(shù)參數(shù)的生成方法和完全隨機(jī)化的初始種群生成方式指出給定合理的常數(shù)參數(shù)的必要性和向種群插入優(yōu)勢個(gè)體對進(jìn)化初期的促進(jìn)作用,提出粗糙的多元線性回歸初始化——自適應(yīng)修正常數(shù)(RMLR_AC),該算法將多元回歸獲得的全變量系數(shù)參數(shù)作為常數(shù)變量引入染色體的基因表達(dá)式結(jié)構(gòu)中,并通過進(jìn)化過程實(shí)現(xiàn)系數(shù)常數(shù)的修正;再次,觀察發(fā)現(xiàn)基本GEP在進(jìn)化種群中存在染色體個(gè)體基因型相同的現(xiàn)象,定義了重復(fù)染色體及隱重復(fù)染色體的概念,研究指出該現(xiàn)象的產(chǎn)生原因及對基因片多樣性、進(jìn)化效率的不利影響和對種群其他個(gè)體的惡性同化作用,提出消除(隱)重復(fù)個(gè)體(DSC)算法,并通過創(chuàng)建種群副本進(jìn)行二次選擇(CPCSC)來改進(jìn)GEP選擇流程;最后,再次觀察種群結(jié)構(gòu)特征,指出并定義了GEP的同族染色體和種族斷層現(xiàn)象,為避免該現(xiàn)象存在導(dǎo)致的基因片在全種族范圍內(nèi)的交流受阻及進(jìn)化結(jié)果向局部最優(yōu)解收斂,提出基于線程機(jī)制的周期性種群多樣性分化(TM_PDI)改進(jìn)進(jìn)化流程,并給出對主線程的種群進(jìn)行排序后再分段克隆,補(bǔ)充隨機(jī)化個(gè)體(SHS_RRI)的初始化子線程種群算法。融合基本GEP算法和上述的四點(diǎn)改進(jìn),本文提出并描述了ProGEP算法。 3.將ProGEP應(yīng)用于代價(jià)敏感分類問題。通過構(gòu)建代價(jià)敏感矩陣并融入ProGEP的適應(yīng)度函數(shù),獲得CSC-ProGEP算法。在描述該算法流程的基礎(chǔ)上,本文給出了對稀有類分類效果的評判方法。 4.實(shí)驗(yàn)環(huán)境的構(gòu)建與算法的驗(yàn)證及應(yīng)用。由于對基本GEP的基因評估算法、選擇流程、進(jìn)化流程等方面均作出修改,為了能方便地描述算法細(xì)節(jié),靈活地進(jìn)行實(shí)驗(yàn)結(jié)果的統(tǒng)計(jì)計(jì)算,本文基于Microsoft Visual Studio2012,使用C#語言,采用面向?qū)ο蟮脑O(shè)計(jì)方法實(shí)現(xiàn)了GEP基本模型結(jié)構(gòu)以及ProGEP相關(guān)改進(jìn)。實(shí)驗(yàn)驗(yàn)證了ProGEP的算法性能及CSC-ProGEP的應(yīng)用效果。為獨(dú)立觀察每個(gè)改進(jìn)帶來的提升,將四個(gè)改進(jìn)分步引入GEP,多次實(shí)驗(yàn)后觀察比對引入前后的效果。在驗(yàn)證ProGEP的有效性之后,選取五組UCI數(shù)據(jù)集,采用10-折交叉驗(yàn)證法進(jìn)行CSC實(shí)驗(yàn),并將獲得的分類器和其他分類算法訓(xùn)練的分類器比較,實(shí)驗(yàn)表明CSC-ProGEP在解決代價(jià)敏感分類問題中,相對于傳統(tǒng)分類算法(C4.5、BN、BP)和代價(jià)敏感分類算法(AdaCost),在保證了分類準(zhǔn)確率的同時(shí)也獲得了更高的稀有類召回率及精度。 本文所做研究的意義,一方面是對GEP算法理論的完善和提高,對染色體評估效率、種群結(jié)構(gòu)和進(jìn)化流程的改進(jìn)豐富了其理論研究;另一方面推廣了GEP實(shí)際應(yīng)用,通過CSC-ProGEP的挖掘?qū)嶒?yàn),驗(yàn)證了ProGEP算法,這對于預(yù)測患病與否及預(yù)防欺詐性客戶等稀有類挖掘應(yīng)用具有一定的指導(dǎo)意義。
[Abstract]:In recent years , data mining has been widely used in marketing , business management , enterprise crisis management , product manufacture and Internet .
In this paper , we study and design a new algorithm _ gene expression program derived from genetic algorithms commonly used in data mining , propose and design the ProGEP algorithm , and apply the algorithm to the cost - sensitive classification problem , and design and implement the CSC - ProGEP algorithm . The main work has the following four aspects :
1 . The research status of GEP and cost - sensitive learning algorithms at home and abroad is reviewed .
The constitution and flow of GEP algorithm are summarized .
Several common cost - sensitive classification algorithms are briefly described .
2 . The GEP algorithm is improved and ProGEP algorithm is proposed . After studying the improvement thought of the current improved algorithm _ gene reading operator , an inverse Polish expression _ stack method evaluation ( RPE _ SD ) is proposed , and then the inverse Polish expression is obtained by traversing an expression tree .
Secondly , a rough multivariate linear regression initialization _ adaptive correction constant ( RMLR _ AC ) is proposed on the basis of the generation method and initial population generation method of the basic GEP not given specific constant parameter and the initial population generation mode of complete randomization , and a rough multivariate linear regression initialization _ adaptive correction constant ( RMLR _ AC ) is proposed . The algorithm uses the variable coefficient parameter obtained by multiple regression as a constant variable into the gene expression structure of the chromosome , and realizes the correction of the coefficient constant through the evolution process ;
Thirdly , we find out that the basic GEP has the same genotype in the evolutionary population , and defines the concept of repeated chromosomes and recessive repeat chromosomes . It is pointed out that the causes of this phenomenon and the adverse effects on gene chip diversity , evolutionary efficiency and the malignant assimilation of other individuals of the population are pointed out , and the elimination ( implicit ) repetitive individual ( DSC ) algorithm is proposed , and the GEP selection process is improved by creating a population copy for secondary selection ( CPCSC ) ;
Finally , we observed the structural features of population , pointed out and defined the phenomenon of homogeneous chromosome and racial fault of GEP . In order to avoid the blocking and evolution of gene fragments caused by this phenomenon , we proposed the improved evolutionary process of periodic population diversity differentiation ( TM _ PDI ) based on thread mechanism , and proposed the initialization sub - thread population algorithm based on thread mechanism . The basic GEP algorithm and the four - point improvement described above are given , and the ProGEP algorithm is proposed and described .
3 . ProGEP is applied to the cost - sensitive classification problem . By constructing the cost - sensitive matrix and integrating the fitness function of ProGEP , the CSC - ProGEP algorithm is obtained . On the basis of describing the algorithm flow , this paper presents a method for judging the rare class classification effect .
4 . The construction of the experimental environment and the application of the algorithm are introduced . In order to easily describe the algorithm details and to flexibly carry out the statistical calculation of the experimental results , this paper introduces the algorithm performance of the ProGEP and the improvement of the application of the ProGEP . After verifying the validity of the ProGEP , the results show that the CSC - ProGEP is used to solve the cost - sensitive classification problem , and the results show that CSC - ProGEP has a higher accuracy of recall and recall with respect to the traditional classification algorithm ( C4.5 , BN , BP ) and cost - sensitive classification algorithm ( AdaCost ) .
The significance of this research is to improve and improve the theory of GEP algorithm , and enrich its theoretical research on the improvement of chromosome assessment efficiency , population structure and evolutionary process .
On the other hand , the application of GEP is generalized . Through the mining experiment of CSC - ProGEP , the ProGEP algorithm is validated , which has certain guiding significance for predicting the disease and preventing fraudulent customers .
【學(xué)位授予單位】:安徽財(cái)經(jīng)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP181;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 謝方軍,唐常杰,元昌安,左R,
本文編號:1871016
本文鏈接:http://sikaile.net/guanlilunwen/yingxiaoguanlilunwen/1871016.html
最近更新
教材專著