數(shù)據(jù)挖掘領(lǐng)域中若干預(yù)處理方法研究
本文選題:粗糙集 + 離散化; 參考:《中國(guó)石油大學(xué)(北京)》2016年碩士論文
【摘要】:現(xiàn)實(shí)世界中數(shù)據(jù)具有不完整,不一致等特點(diǎn),為了提高數(shù)據(jù)挖掘的質(zhì)量產(chǎn)生了數(shù)據(jù)預(yù)處理技術(shù)。本文介紹了粗糙集的理論知識(shí),并在此基礎(chǔ)上,主要做了以下兩個(gè)方面的研究:1、在傳統(tǒng)基于屬性依賴(lài)度的約簡(jiǎn)方法基礎(chǔ)上,定義更精確的強(qiáng)化正域概念。通過(guò)對(duì)邊界域的精確劃分,確定各條件屬性對(duì)決策屬性的強(qiáng)化依賴(lài)度,并用自頂向下的啟發(fā)式搜索算法得到約簡(jiǎn)結(jié)果。通過(guò)對(duì)UCI數(shù)據(jù)集實(shí)驗(yàn),結(jié)果表明,相比于經(jīng)典方法,REPR能更有效地對(duì)決策表進(jìn)行屬性約簡(jiǎn)。2、首先對(duì)離散化問(wèn)題形式化描述,并采用最優(yōu)化方法進(jìn)行離散化定義;其次基于信息熵思想分別定義修正信息增益率IIGR和統(tǒng)計(jì)相似性SIS作為離散化的最優(yōu)化目標(biāo)函數(shù),并給出離散化約束條件;最后采用遺傳算法實(shí)現(xiàn)連續(xù)屬性的離散化。采用UCI數(shù)據(jù)集實(shí)驗(yàn)對(duì)比,在統(tǒng)計(jì)意義下,本文離散化方法實(shí)現(xiàn)離散區(qū)間數(shù)少,離散后數(shù)據(jù)集構(gòu)建決策樹(shù)的規(guī)模小,分類(lèi)精度高,表明以最優(yōu)化為指導(dǎo),多個(gè)連續(xù)屬性并行離散化兼顧屬性間的關(guān)聯(lián)關(guān)系,數(shù)據(jù)離散化更加有效。
[Abstract]:In order to improve the quality of data mining, data preprocessing technology is produced in order to improve the quality of data mining because of the incomplete and inconsistent data in the real world. In this paper, the theory of rough set is introduced, and on this basis, the following two aspects of research: 1 are mainly done. On the basis of the traditional reduction method based on attribute dependence, the concept of enhanced positive domain is defined more accurately. Through the precise partition of the boundary domain, the degree of dependence of each conditional attribute on the decision attribute is determined, and the reduction result is obtained by using the top-down heuristic search algorithm. Through the experiment of UCI data set, the results show that compared with the classical method, REPR is more effective in attribute reduction of decision table. Firstly, the discretization problem is described formally, and the discretization is defined by optimization method. Secondly, the modified information gain rate IIGR and statistical similarity SIS are defined as the optimization objective function of discretization based on the idea of information entropy, and the discretization constraints are given. Finally, genetic algorithm is used to realize the discretization of continuous attributes. By using UCI data set experiments, in the statistical sense, the discretization method has less discrete interval number, smaller scale and higher classification accuracy of discrete data sets, which indicates that optimization is the guide. Parallel discretization of multiple continuous attributes takes into account the relationship between attributes, and data discretization is more effective.
【學(xué)位授予單位】:中國(guó)石油大學(xué)(北京)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP311.13;TP18
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 安利平;劉森;;屬性約簡(jiǎn)的兩階段遺傳算法[J];系統(tǒng)工程理論與實(shí)踐;2014年11期
2 鄧維斌;王國(guó)胤;胡峰;;基于優(yōu)勢(shì)關(guān)系粗糙集的自主式學(xué)習(xí)模型[J];計(jì)算機(jī)學(xué)報(bào);2014年12期
3 楊波;徐章艷;舒文豪;;一種快速的Rough集屬性約簡(jiǎn)遺傳算法[J];小型微型計(jì)算機(jī)系統(tǒng);2012年01期
4 楊傳健;葛浩;汪志圣;;基于粗糙集的屬性約簡(jiǎn)方法研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2012年01期
5 孫娓娓;王春生;姚云飛;;基于自適應(yīng)遺傳算法的粗糙集屬性約簡(jiǎn)算法[J];計(jì)算機(jī)工程與應(yīng)用;2011年33期
6 楊明;;決策表中基于條件信息熵的近似約簡(jiǎn)[J];電子學(xué)報(bào);2007年11期
7 陳果;;基于遺傳算法的決策表連續(xù)屬性離散化方法[J];儀器儀表學(xué)報(bào);2007年09期
8 謝宏,程浩忠,牛東曉;基于信息熵的粗糙集連續(xù)屬性離散化算法[J];計(jì)算機(jī)學(xué)報(bào);2005年09期
9 李國(guó)和,趙沁平;信息系統(tǒng)的一種分塊特征選取方法[J];北京航空航天大學(xué)學(xué)報(bào);2003年03期
10 王國(guó)胤,于洪,楊大春;基于條件信息熵的決策表約簡(jiǎn)[J];計(jì)算機(jī)學(xué)報(bào);2002年07期
,本文編號(hào):1987699
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1987699.html