天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

一類基于密度的聚類算法研究

發(fā)布時間:2018-05-22 15:26

  本文選題:聚類算法 + 聚類有效性指標(biāo); 參考:《山東師范大學(xué)》2017年碩士論文


【摘要】:聚類分析中,基于密度的聚類算法占有非常重要的地位,在信息的過濾、檢索、醫(yī)療衛(wèi)生和公共服務(wù)等各個領(lǐng)域都得到廣泛地應(yīng)用,是聚類分析的重點研究內(nèi)容。本文對層次聚類算法的特征和密度聚類算法的特征進(jìn)行研究,提出了基于層次的密度聚類算法,結(jié)果表明新算法聚類的準(zhǔn)確率和聚類的效率均得到提高。根據(jù)Alex Rodriguez和Alessandro Laio提出的一種新的密度聚類算法CFSFDP(Clustering by Fast Search and Find of Density Peaks),提出了Map Reduce框架下該算法的并行化模型。和其他密度聚類算法一樣,該算法在并行條件下能對復(fù)雜形狀的聚類進(jìn)行處理,并且數(shù)據(jù)中類的數(shù)量也不需要提前指定,同時,CFSFDP算法需要用戶指定的參數(shù)較少。和需要迭代的聚類算法相比,該算法的運(yùn)行時間得到很大程度地降低。本文主要的研究工作包括:(1)針對傳統(tǒng)的聚類算法需要反復(fù)地對數(shù)據(jù)集聚類,且計算效率在大規(guī)模數(shù)據(jù)集上欠佳的問題,提出了一種改進(jìn)算法,即基于層次聚類確定最佳聚類數(shù)和初始聚類中心的CODHD算法。該算法研究計算過程,對數(shù)據(jù)集不需要反復(fù)進(jìn)行聚類。首先,通過對數(shù)據(jù)集進(jìn)行掃描,進(jìn)而獲得聚類特征的所有的統(tǒng)計值;其次,采用自下而上的方法生成層次不相同的數(shù)據(jù)劃分,對每個劃分的數(shù)據(jù)點的密度進(jìn)行計算,將密度最大的點定為中心點,計算中心點距離更高密度點的最小距離,將最小距離與中心點的密度作乘積,取乘積之和的平均值作為有效性指標(biāo),根據(jù)聚類結(jié)果,增量地構(gòu)建一條屬于不同層次的曲線;最后,曲線極值點處對應(yīng)的劃分,用來估計初始的聚類中心和最佳的聚類數(shù)。實驗結(jié)果表明,相比較COPS算法,本文提出的CODHD算法,聚類準(zhǔn)確率和效率均得到提高。(2)傳統(tǒng)的CFSFDP算法能夠很好地識別空間中任意形狀和任意維度的聚類,但是當(dāng)處理大規(guī)模數(shù)據(jù)集時,兩點之間距離的計算耗費太長時間,為克服提到的缺點,本文提出了一種基于Map Reduce的CFSFDP算法,又稱mr CFSFDP。mr CFSFDP只需要讀取數(shù)據(jù)集一遍,因此運(yùn)行時間很快,運(yùn)行在多個節(jié)點的mr CFSFDP算法的每個階段都劃分為兩步:Map階段和Reduce階段。在許多數(shù)據(jù)集上測試了這個算法,實驗結(jié)果表明,此算法模型是可行的,并且在準(zhǔn)確率和效率上都有很好的效果。本文數(shù)據(jù)集全部取自UCI真實數(shù)據(jù)集。根據(jù)經(jīng)典的聚類模型,建立了兩種新的聚類模型。文中與其他算法進(jìn)行一些比較,證明了新提出算法在聚類方面具有更好的聚類效果。
[Abstract]:In clustering analysis, density-based clustering algorithm plays a very important role. It is widely used in the fields of information filtering, retrieval, medical and health, public services and so on. In this paper, the characteristics of hierarchical clustering algorithm and density clustering algorithm are studied, and a hierarchical density clustering algorithm is proposed. The results show that the accuracy and efficiency of the new algorithm are improved. Based on a new density clustering algorithm CFSFDP(Clustering by Fast Search and Find of Density Peaks proposed by Alex Rodriguez and Alessandro Laio, a parallelization model of the algorithm under Map Reduce framework is proposed. Like other density clustering algorithms, the algorithm can deal with the clustering of complex shapes under parallel conditions, and the number of classes in the data does not need to be specified in advance, and the CFSFDP algorithm requires fewer parameters specified by the user. Compared with the clustering algorithm which needs iteration, the running time of the algorithm is greatly reduced. The main research work of this paper includes: (1) aiming at the problem that the traditional clustering algorithms need to repeatedly cluster data and the computational efficiency is poor in large-scale data sets, an improved algorithm is proposed. That is, CODHD algorithm based on hierarchical clustering to determine the best clustering number and initial clustering center. The algorithm does not need to cluster data sets repeatedly. Firstly, all the statistical values of the clustering feature are obtained by scanning the data set. Secondly, the bottom-up method is used to generate the data partition with different levels, and the density of the data points in each partition is calculated. Taking the maximum density point as the center point and calculating the minimum distance between the center point and the higher density point, the minimum distance between the minimum distance and the density of the center point is multiplied, and the average value of the sum of the product is taken as the validity index, according to the clustering result, Finally, the corresponding partition at the extreme point of the curve is used to estimate the initial cluster center and the optimal clustering number. The experimental results show that compared with the COPS algorithm, the clustering accuracy and efficiency of the CODHD algorithm proposed in this paper are improved. (2) the traditional CFSFDP algorithm can recognize the clustering of arbitrary shape and dimension in space very well. However, when dealing with large-scale data sets, the calculation of the distance between two points takes too long. In order to overcome the mentioned shortcomings, this paper proposes a CFSFDP algorithm based on Map Reduce, which is also called mr CFSFDP.mr CFSFDP, which only needs to read the data set once. Therefore, the running time is very fast, and each phase of the Mr CFSFDP algorithm running on multiple nodes is divided into two steps: map stage and Reduce stage. The algorithm is tested on many data sets. The experimental results show that the algorithm model is feasible and has good accuracy and efficiency. All the data sets in this paper are taken from UCI real data sets. According to the classical clustering model, two new clustering models are established. Compared with other algorithms, it is proved that the new algorithm has better clustering effect.
【學(xué)位授予單位】:山東師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前6條

1 劉貝貝;馬儒寧;丁軍娣;;大數(shù)據(jù)的密度統(tǒng)計合并算法[J];軟件學(xué)報;2015年11期

2 顧瑞春;王靜宇;;一種基于MapReduce的并行聚類模型[J];計算機(jī)與現(xiàn)代化;2014年01期

3 陳黎飛;姜青山;王聲瑞;;基于層次劃分的最佳聚類數(shù)確定方法[J];軟件學(xué)報;2008年01期

4 孫吉貴;劉杰;趙連宇;;聚類算法研究[J];軟件學(xué)報;2008年01期

5 左榮國;;一本面向中高級讀者的數(shù)據(jù)挖掘好書——評《數(shù)據(jù)挖掘:概念與技術(shù)》[J];計算機(jī)教育;2006年09期

6 歐陽為民,蔡慶生;基于垂直數(shù)據(jù)分布的關(guān)聯(lián)規(guī)則高效發(fā)現(xiàn)算法[J];軟件學(xué)報;1999年07期

相關(guān)碩士學(xué)位論文 前4條

1 張文開;基于密度的層次聚類算法研究[D];中國科學(xué)技術(shù)大學(xué);2015年

2 李偉雄;基于密度的聚類算法研究[D];湖南大學(xué);2010年

3 方洪鷹;數(shù)據(jù)挖掘中數(shù)據(jù)預(yù)處理的方法研究[D];西南大學(xué);2009年

4 鄧景毅;事務(wù)間數(shù)值型關(guān)聯(lián)規(guī)則的數(shù)據(jù)挖掘[D];暨南大學(xué);2003年

,

本文編號:1922631

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1922631.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶2e315***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com