天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 軟件論文 >

高維類別數(shù)據(jù)集的粗糙聚類算法的研究與應(yīng)用

發(fā)布時(shí)間:2018-05-07 19:31

  本文選題:信息熵 + 加權(quán)重疊距離 ; 參考:《大連海事大學(xué)》2017年碩士論文


【摘要】:聚類分析是數(shù)據(jù)挖掘的重要技術(shù)之一,所處理的數(shù)據(jù)分為數(shù)值型、類別型和混合型。針對(duì)數(shù)值型數(shù)據(jù),聚類算法已經(jīng)取得了非常卓越的成果。而對(duì)于類別數(shù)據(jù),由于不能進(jìn)行傳統(tǒng)意義上的幾何距離計(jì)算,所以有很多問題需要解決:比如,設(shè)計(jì)合理的差異度函數(shù),探求有效的聚類初始化機(jī)制。大數(shù)據(jù)時(shí)代出現(xiàn)了高維海量數(shù)據(jù),其屬性個(gè)數(shù)達(dá)到幾十、幾百乃至上千個(gè),它們通常具有不完備、不精確、不一致性等特征,傳統(tǒng)聚類算法很難滿足這些數(shù)據(jù)的聚類需求,但是,不斷豐富的數(shù)據(jù)帶來(lái)了更多有價(jià)值的信息。如何從高維數(shù)據(jù)中發(fā)掘到有用的信息,已成為當(dāng)今聚類分析領(lǐng)域最前沿的研究課題;其中,設(shè)計(jì)高維數(shù)據(jù)下的"距離"度量成為一項(xiàng)嚴(yán)峻的任務(wù)。針對(duì)高維聚類,目前最為常見的方法主要有維度約簡(jiǎn)和子空間聚類。維度約簡(jiǎn)是解決高維數(shù)據(jù)聚類分析的特別有效的方法,降維方法主要包括特征變換和特征選擇,特征選擇是數(shù)據(jù)挖掘中常見的降維技術(shù)。到目前為止,對(duì)類別型數(shù)據(jù)的初始化問題研究較少,如果初始類中心選擇的不合理,不僅得不到最佳的聚類簇,還會(huì)增加算法的復(fù)雜度。特別是高維類別數(shù)據(jù),初始類中心的選擇尤為重要。目前仍然沒有一種被廣泛接受的針對(duì)類別數(shù)據(jù)的初始類中心選擇算法。因此,為高維類別數(shù)據(jù)聚類提出一種初始類中心選擇算法是非常必要的。經(jīng)典粗糙集的擴(kuò)展模型,能夠很好地處理不完備的、不精確的、有噪聲的數(shù)據(jù)集。將擴(kuò)展粗糙集方法運(yùn)用到高維不完備的數(shù)據(jù)集的處理中,已經(jīng)取得了一些很好的聚類算法。針對(duì)以上提出的問題,本文運(yùn)用擴(kuò)展的粗糙集模型——限制容差關(guān)系,對(duì)高維不完備的類別數(shù)據(jù)進(jìn)行特征選擇、設(shè)計(jì)聚類算法,主要工作包括以下兩個(gè)部分:(1)針對(duì)高維類別不完備數(shù)據(jù)的特征選擇:使用限制容差關(guān)系擴(kuò)展粗糙集模型,重新定義信息熵以及條件信息熵,構(gòu)造基于條件熵的高維類別不完備數(shù)據(jù)的維度約簡(jiǎn)算法CEHDAR。(2)基于加權(quán)重疊距離和加權(quán)平均密度的初始類中心選擇算法:在算法中,我們使用限制容差關(guān)系的信息熵定義屬性重要度,進(jìn)而定義各屬性的權(quán)重。在計(jì)算對(duì)象間的距離和對(duì)象的密度時(shí),不同的屬性被賦予相應(yīng)的權(quán)重,從而體現(xiàn)不同屬性對(duì)聚類貢獻(xiàn)的不同。實(shí)驗(yàn)證明,相比于現(xiàn)有的聚類初始化方法,WDADI算法是最優(yōu)的。然后,在UCI數(shù)據(jù)庫(kù)的數(shù)據(jù)集上運(yùn)行,證明了這種改進(jìn)算法的有效性。
[Abstract]:Clustering analysis is one of the most important techniques in data mining. The data can be classified into numerical type, category type and mixed type. For the numerical data, the clustering algorithm has achieved remarkable results. However, for class data, there are many problems to be solved because of the traditional geometric distance calculation: for example, to design a reasonable difference function and to explore an effective clustering initialization mechanism. In the era of big data, there appeared massive high-dimensional data, whose attributes reached tens, hundreds or even thousands. They are usually incomplete, inaccurate, inconsistent and so on. The traditional clustering algorithm is difficult to meet the clustering needs of these data. But the growing wealth of data brings more valuable information. How to extract useful information from high-dimensional data has become the most advanced research topic in the field of clustering analysis, and the design of "distance" measurement under high-dimensional data has become a severe task. For high dimensional clustering, the most common methods are dimensionality reduction and subspace clustering. Dimension reduction is a very effective method to solve high dimensional data clustering analysis. Dimension reduction methods mainly include feature transformation and feature selection. Feature selection is a common dimensionality reduction technique in data mining. Up to now, there is little research on the initialization of class data. If the initial cluster center is not reasonable, it will not only get the best clustering cluster, but also increase the complexity of the algorithm. Especially for high-dimensional class data, the selection of initial class centers is particularly important. There is still no widely accepted initial class center selection algorithm for class data. Therefore, it is necessary to propose an initial cluster center selection algorithm for high dimensional data clustering. The extended model of classical rough sets can deal with incomplete, inexact and noisy data sets well. The extended rough set method has been applied to the processing of high dimensional incomplete data sets and some good clustering algorithms have been obtained. In order to solve the above problems, the extended rough set model-restricted tolerance relation is used to select the feature of high dimensional incomplete class data, and the clustering algorithm is designed. The main work includes the following two parts: (1) the feature selection for the incomplete data of high dimensional classes: using the restricted tolerance relation to extend the rough set model, redefining the information entropy and conditional information entropy. This paper constructs a dimensionality reduction algorithm based on conditional entropy for high dimensional class incomplete data CEHDAR. 2) an initial class center selection algorithm based on weighted overlap distance and weighted average density: in the algorithm, The information entropy of the restricted tolerance relationship is used to define the importance of attributes and then to define the weights of each attribute. When calculating the distance between objects and the density of objects, different attributes are given corresponding weights, which reflects the different contributions of different attributes to clustering. Experiments show that the WDADI algorithm is optimal compared with the existing clustering initialization method. Then, the improved algorithm is proved to be effective by running on the data set of UCI database.
【學(xué)位授予單位】:大連海事大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前8條

1 陳圣兵;王曉峰;;基于信息熵的不完備數(shù)據(jù)特征選擇算法[J];模式識(shí)別與人工智能;2014年12期

2 景明利;;高維數(shù)據(jù)降維算法綜述[J];西安文理學(xué)院學(xué)報(bào)(自然科學(xué)版);2014年04期

3 尹華;胡玉平;;基于隨機(jī)森林的不平衡特征選擇算法[J];中山大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年05期

4 李梓;蔣慶豐;程曉旭;賈美娟;;一種基于信任值的分類屬性聚類算法[J];微型機(jī)與應(yīng)用;2012年22期

5 王麗娟;楊習(xí)貝;楊靜宇;吳陳;;一種新的不完備多粒度粗糙集[J];南京大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年04期

6 戴平;李寧;;一種基于SVM的快速特征選擇方法[J];山東大學(xué)學(xué)報(bào)(工學(xué)版);2010年05期

7 任永功;張琰渝;;一種基于最大頻繁項(xiàng)目集的挖掘事務(wù)間關(guān)聯(lián)規(guī)則方法[J];計(jì)算機(jī)科學(xué);2008年11期

8 王國(guó)胤;Rough集理論在不完備信息系統(tǒng)中的擴(kuò)充[J];計(jì)算機(jī)研究與發(fā)展;2002年10期

相關(guān)博士學(xué)位論文 前1條

1 官禮和;基于Rough集的不完備信息處理方法研究[D];西南交通大學(xué);2012年



本文編號(hào):1858179

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1858179.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶a11f1***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com