一類基于密度的聚類算法研究

發(fā)布時(shí)間：2018-05-22 15:26

本文選題：聚類算法 + 聚類有效性指標(biāo)　；參考：《山東師范大學(xué)》2017年碩士論文

【摘要】：聚類分析中,基于密度的聚類算法占有非常重要的地位,在信息的過濾、檢索、醫(yī)療衛(wèi)生和公共服務(wù)等各個(gè)領(lǐng)域都得到廣泛地應(yīng)用,是聚類分析的重點(diǎn)研究內(nèi)容。本文對層次聚類算法的特征和密度聚類算法的特征進(jìn)行研究,提出了基于層次的密度聚類算法,結(jié)果表明新算法聚類的準(zhǔn)確率和聚類的效率均得到提高。根據(jù)Alex Rodriguez和Alessandro Laio提出的一種新的密度聚類算法CFSFDP(Clustering by Fast Search and Find of Density Peaks),提出了Map Reduce框架下該算法的并行化模型。和其他密度聚類算法一樣,該算法在并行條件下能對復(fù)雜形狀的聚類進(jìn)行處理,并且數(shù)據(jù)中類的數(shù)量也不需要提前指定,同時(shí),CFSFDP算法需要用戶指定的參數(shù)較少。和需要迭代的聚類算法相比,該算法的運(yùn)行時(shí)間得到很大程度地降低。本文主要的研究工作包括:(1)針對傳統(tǒng)的聚類算法需要反復(fù)地對數(shù)據(jù)集聚類,且計(jì)算效率在大規(guī)模數(shù)據(jù)集上欠佳的問題,提出了一種改進(jìn)算法,即基于層次聚類確定最佳聚類數(shù)和初始聚類中心的CODHD算法。該算法研究計(jì)算過程,對數(shù)據(jù)集不需要反復(fù)進(jìn)行聚類。首先,通過對數(shù)據(jù)集進(jìn)行掃描,進(jìn)而獲得聚類特征的所有的統(tǒng)計(jì)值;其次,采用自下而上的方法生成層次不相同的數(shù)據(jù)劃分,對每個(gè)劃分的數(shù)據(jù)點(diǎn)的密度進(jìn)行計(jì)算,將密度最大的點(diǎn)定為中心點(diǎn),計(jì)算中心點(diǎn)距離更高密度點(diǎn)的最小距離,將最小距離與中心點(diǎn)的密度作乘積,取乘積之和的平均值作為有效性指標(biāo),根據(jù)聚類結(jié)果,增量地構(gòu)建一條屬于不同層次的曲線;最后,曲線極值點(diǎn)處對應(yīng)的劃分,用來估計(jì)初始的聚類中心和最佳的聚類數(shù)。實(shí)驗(yàn)結(jié)果表明,相比較COPS算法,本文提出的CODHD算法,聚類準(zhǔn)確率和效率均得到提高。(2)傳統(tǒng)的CFSFDP算法能夠很好地識別空間中任意形狀和任意維度的聚類,但是當(dāng)處理大規(guī)模數(shù)據(jù)集時(shí),兩點(diǎn)之間距離的計(jì)算耗費(fèi)太長時(shí)間,為克服提到的缺點(diǎn),本文提出了一種基于Map Reduce的CFSFDP算法,又稱mr CFSFDP。mr CFSFDP只需要讀取數(shù)據(jù)集一遍,因此運(yùn)行時(shí)間很快,運(yùn)行在多個(gè)節(jié)點(diǎn)的mr CFSFDP算法的每個(gè)階段都劃分為兩步:Map階段和Reduce階段。在許多數(shù)據(jù)集上測試了這個(gè)算法,實(shí)驗(yàn)結(jié)果表明,此算法模型是可行的,并且在準(zhǔn)確率和效率上都有很好的效果。本文數(shù)據(jù)集全部取自UCI真實(shí)數(shù)據(jù)集。根據(jù)經(jīng)典的聚類模型,建立了兩種新的聚類模型。文中與其他算法進(jìn)行一些比較,證明了新提出算法在聚類方面具有更好的聚類效果。
[Abstract]:In clustering analysis, density-based clustering algorithm plays a very important role. It is widely used in the fields of information filtering, retrieval, medical and health, public services and so on. In this paper, the characteristics of hierarchical clustering algorithm and density clustering algorithm are studied, and a hierarchical density clustering algorithm is proposed. The results show that the accuracy and efficiency of the new algorithm are improved. Based on a new density clustering algorithm CFSFDP(Clustering by Fast Search and Find of Density Peaks proposed by Alex Rodriguez and Alessandro Laio, a parallelization model of the algorithm under Map Reduce framework is proposed. Like other density clustering algorithms, the algorithm can deal with the clustering of complex shapes under parallel conditions, and the number of classes in the data does not need to be specified in advance, and the CFSFDP algorithm requires fewer parameters specified by the user. Compared with the clustering algorithm which needs iteration, the running time of the algorithm is greatly reduced. The main research work of this paper includes: (1) aiming at the problem that the traditional clustering algorithms need to repeatedly cluster data and the computational efficiency is poor in large-scale data sets, an improved algorithm is proposed. That is, CODHD algorithm based on hierarchical clustering to determine the best clustering number and initial clustering center. The algorithm does not need to cluster data sets repeatedly. Firstly, all the statistical values of the clustering feature are obtained by scanning the data set. Secondly, the bottom-up method is used to generate the data partition with different levels, and the density of the data points in each partition is calculated. Taking the maximum density point as the center point and calculating the minimum distance between the center point and the higher density point, the minimum distance between the minimum distance and the density of the center point is multiplied, and the average value of the sum of the product is taken as the validity index, according to the clustering result, Finally, the corresponding partition at the extreme point of the curve is used to estimate the initial cluster center and the optimal clustering number. The experimental results show that compared with the COPS algorithm, the clustering accuracy and efficiency of the CODHD algorithm proposed in this paper are improved. (2) the traditional CFSFDP algorithm can recognize the clustering of arbitrary shape and dimension in space very well. However, when dealing with large-scale data sets, the calculation of the distance between two points takes too long. In order to overcome the mentioned shortcomings, this paper proposes a CFSFDP algorithm based on Map Reduce, which is also called mr CFSFDP.mr CFSFDP, which only needs to read the data set once. Therefore, the running time is very fast, and each phase of the Mr CFSFDP algorithm running on multiple nodes is divided into two steps: map stage and Reduce stage. The algorithm is tested on many data sets. The experimental results show that the algorithm model is feasible and has good accuracy and efficiency. All the data sets in this paper are taken from UCI real data sets. According to the classical clustering model, two new clustering models are established. Compared with other algorithms, it is proved that the new algorithm has better clustering effect.
【學(xué)位授予單位】：山東師范大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前6條

1 劉貝貝;馬儒寧;丁軍娣;;大數(shù)據(jù)的密度統(tǒng)計(jì)合并算法[J];軟件學(xué)報(bào);2015年11期

2 顧瑞春;王靜宇;;一種基于MapReduce的并行聚類模型[J];計(jì)算機(jī)與現(xiàn)代化;2014年01期

3 陳黎飛;姜青山;王聲瑞;;基于層次劃分的最佳聚類數(shù)確定方法[J];軟件學(xué)報(bào);2008年01期

4 孫吉貴;劉杰;趙連宇;;聚類算法研究[J];軟件學(xué)報(bào);2008年01期

5 左榮國;;一本面向中高級讀者的數(shù)據(jù)挖掘好書——評《數(shù)據(jù)挖掘:概念與技術(shù)》[J];計(jì)算機(jī)教育;2006年09期

6 歐陽為民,蔡慶生;基于垂直數(shù)據(jù)分布的關(guān)聯(lián)規(guī)則高效發(fā)現(xiàn)算法[J];軟件學(xué)報(bào);1999年07期

相關(guān)碩士學(xué)位論文前4條

1 張文開;基于密度的層次聚類算法研究[D];中國科學(xué)技術(shù)大學(xué);2015年

2 李偉雄;基于密度的聚類算法研究[D];湖南大學(xué);2010年

3 方洪鷹;數(shù)據(jù)挖掘中數(shù)據(jù)預(yù)處理的方法研究[D];西南大學(xué);2009年

4 鄧景毅;事務(wù)間數(shù)值型關(guān)聯(lián)規(guī)則的數(shù)據(jù)挖掘[D];暨南大學(xué);2003年

，

本文編號：1922631

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1922631.html

上一篇：基于大數(shù)據(jù)的動車組故障關(guān)聯(lián)關(guān)系規(guī)則挖掘算法研究與實(shí)現(xiàn)
下一篇：基于城軌列車在途監(jiān)測數(shù)據(jù)的安全預(yù)測系統(tǒng)開發(fā)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

一類基于密度的聚類算法研究