一種基于密度的分布式聚類方法
發(fā)布時間:2018-12-12 08:55
【摘要】:聚類是數(shù)據(jù)挖掘領(lǐng)域中的一種重要的數(shù)據(jù)分析方法.它根據(jù)數(shù)據(jù)間的相似度,將無標(biāo)注數(shù)據(jù)劃分為若干聚簇.CSDP是一種基于密度的聚類算法,當(dāng)數(shù)據(jù)量較大或數(shù)據(jù)維數(shù)較高時,聚類的效率相對較低.為了提高聚類算法的效率,提出了一種基于密度的分布式聚類方法 MRCSDP,利用MapReduce框架對實驗數(shù)據(jù)進行聚類.該方法定義了獨立計算單元和獨立計算塊的概念.首先,將數(shù)據(jù)拆分為若干數(shù)據(jù)塊,構(gòu)建獨立計算單元和獨立計算塊,在集群中分配獨立計算塊的任務(wù);然后進行分布式計算,得到數(shù)據(jù)塊的局部密度,將局部密度合并得到全局密度,根據(jù)全局密度計算中心值,由全局密度和中心值得到每個數(shù)據(jù)塊中候選聚簇中心;最后,從候選聚簇中心選舉出最終的聚簇中心.MRCSDP在充分降低時間復(fù)雜度的基礎(chǔ)上得到較好的聚類效果.實驗結(jié)果表明,分布式環(huán)境下的聚類方法MRCSDP相對于CSDP更能快速、有效地處理大規(guī)模數(shù)據(jù),并使各節(jié)點負載均衡.
[Abstract]:Clustering is an important data analysis method in the field of data mining. CSDP is a density-based clustering algorithm, and the clustering efficiency is relatively low when the amount of data is large or the dimension of data is high. In order to improve the efficiency of the clustering algorithm, a density based distributed clustering method, MRCSDP, is proposed to cluster experimental data using the MapReduce framework. This method defines the concepts of independent computing unit and independent computing block. Firstly, the data is divided into several data blocks, the independent computing unit and the independent computing block are constructed, and the task of the independent computing block is assigned in the cluster. Then the local density of the data block is obtained by distributed computation, and the global density is combined to get the global density. According to the global density, the global density and center are worth to the candidate cluster center in each data block. Finally, the final cluster center is selected from the candidate cluster center. MRCSDP can get better clustering effect on the basis of fully reducing the time complexity. The experimental results show that the clustering method MRCSDP in distributed environment can deal with large scale data more quickly and effectively than CSDP and make each node load balance.
【作者單位】: 吉林大學(xué)計算機科學(xué)與技術(shù)學(xué)院;吉林大學(xué)符號計算與知識工程教育部重點實驗室;
【分類號】:TP311.13
本文編號:2374295
[Abstract]:Clustering is an important data analysis method in the field of data mining. CSDP is a density-based clustering algorithm, and the clustering efficiency is relatively low when the amount of data is large or the dimension of data is high. In order to improve the efficiency of the clustering algorithm, a density based distributed clustering method, MRCSDP, is proposed to cluster experimental data using the MapReduce framework. This method defines the concepts of independent computing unit and independent computing block. Firstly, the data is divided into several data blocks, the independent computing unit and the independent computing block are constructed, and the task of the independent computing block is assigned in the cluster. Then the local density of the data block is obtained by distributed computation, and the global density is combined to get the global density. According to the global density, the global density and center are worth to the candidate cluster center in each data block. Finally, the final cluster center is selected from the candidate cluster center. MRCSDP can get better clustering effect on the basis of fully reducing the time complexity. The experimental results show that the clustering method MRCSDP in distributed environment can deal with large scale data more quickly and effectively than CSDP and make each node load balance.
【作者單位】: 吉林大學(xué)計算機科學(xué)與技術(shù)學(xué)院;吉林大學(xué)符號計算與知識工程教育部重點實驗室;
【分類號】:TP311.13
【相似文獻】
相關(guān)會議論文 前1條
1 任瑞瑞;蔡正敏;楊菊生;;導(dǎo)向隨鉆測量儀在扭-壓荷載下的強度校核[A];第14屆全國結(jié)構(gòu)工程學(xué)術(shù)會議論文集(第三冊)[C];2005年
相關(guān)重要報紙文章 前1條
1 郭見冽;“分離”計算惹人盼[N];計算機世界;2002年
,本文編號:2374295
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2374295.html
最近更新
教材專著