基于MapReduce的分布式聚類算法的研究
[Abstract]:Clustering analysis is one of the most basic data analysis techniques in data mining, which is widely used in economics, social science and computer science. However, with the rapid development of Internet technology, the data generated by various network applications increase rapidly, which brings great technical challenges to the traditional clustering analysis methods. How to obtain valuable information from massive data quickly and effectively has become an urgent problem in many industries. With the maturity of cloud computing technology, it is possible to deal with massive data quickly and effectively. Hadoop is an open source distributed cloud computing platform. Its core design is distributed file system (HDFS) and MapReduce. in which HDFS provides a programming model for storing large amounts of data for parallelization of data. Compared with the traditional parallel programming model, this programming model encapsulates the details of data segmentation, task scheduling, parallel processing, etc. Users can develop distributed applications without understanding the distributed low-level details. It greatly facilitates the design of parallelization program. K-means algorithm is applied to many industries as a classical algorithm in clustering analysis. However, with the increase of data scale, the number of iterations of the algorithm will increase obviously, which will affect the efficiency of the algorithm. In order to apply it to the clustering analysis of large-scale data, this paper firstly realizes the parallelization of the algorithm on the Hadoop platform according to the programming principle of MapReduce. Then the blindness of random selection of cluster centers in K-means algorithm and the problem that clustering results are prone to fall into local optimum are improved accordingly. The main work of this paper is as follows: (1) based on the analysis of traditional K-means algorithm and the idea of maximum and minimum distance, a K-means parallelization algorithm based on maximum and minimum distance is proposed. The cluster center is selected according to the idea of maximum and minimum distance and used as the initial center of K-means algorithm to avoid the situation that the initial center is too close to the random selection of the center point, so as to improve the quality of the clustering results. In order to improve its efficiency, the parallelization of the algorithm is designed and implemented. (2) the principle, advantages and disadvantages of the one-trip clustering algorithm are analyzed, and combining with the characteristics of the traditional K-means algorithm, the OPKMEANS parallelization algorithm is proposed. Based on the simple and efficient feature of one-trip clustering algorithm, the algorithm firstly clusters the data set quickly "coarse", and then takes the obtained center as the initial center of the K-means algorithm to avoid the blindness of the random selection of the center points in the K-means algorithm. To reduce the number of iterations of the K-means algorithm to reduce the data transmission overhead of the parallelization process and improve the efficiency of the algorithm. (3) in order to verify the effectiveness of the improved algorithm, this paper studies the principle of Hadoop. The Hadoop distributed computing platform is built on the virtual machine, and many experiments are carried out to verify the superiority of the above algorithm in terms of clustering quality, speedup and extensibility.
【學(xué)位授予單位】:江西理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 牛怡晗;海沫;;Hadoop平臺下Mahout聚類算法的比較研究[J];計(jì)算機(jī)科學(xué);2015年S1期
2 成衛(wèi)青;盧艷紅;;一種基于最大最小距離和SSE的自適應(yīng)聚類算法[J];南京郵電大學(xué)學(xué)報(bào)(自然科學(xué)版);2015年02期
3 王蕾;崔慧敏;陳莉;馮曉兵;;任務(wù)并行編程模型研究與進(jìn)展[J];軟件學(xué)報(bào);2013年01期
4 李霞;蔣盛益;張倩生;朱靖;;適用于大規(guī)模文本處理的動態(tài)密度聚類算法[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年01期
5 蔣盛益;苗邦;余雯;;基于一趟聚類的不平衡數(shù)據(jù)下抽樣算法[J];小型微型計(jì)算機(jī)系統(tǒng);2012年02期
6 熊忠陽;陳若田;張玉芳;;一種有效的K-means聚類中心初始化方法[J];計(jì)算機(jī)應(yīng)用研究;2011年11期
7 趙衛(wèi)中;馬慧芳;傅燕翔;史忠植;;基于云計(jì)算平臺Hadoop的并行k-means聚類算法設(shè)計(jì)研究[J];計(jì)算機(jī)科學(xué);2011年10期
8 江小平;李成華;向文;張新訪;顏海濤;;k-means聚類算法的MapReduce并行化實(shí)現(xiàn)[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年S1期
9 蔣盛益;龐觀松;張黎莎;;Chameleon算法的改進(jìn)[J];小型微型計(jì)算機(jī)系統(tǒng);2010年08期
10 楊燕;靳蕃;KAMEL Mohamed;;聚類有效性評價綜述[J];計(jì)算機(jī)應(yīng)用研究;2008年06期
相關(guān)博士學(xué)位論文 前1條
1 許玉杰;云計(jì)算環(huán)境下海量數(shù)據(jù)的并行聚類算法研究[D];大連海事大學(xué);2014年
相關(guān)碩士學(xué)位論文 前10條
1 侯s,
本文編號:2144761
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2144761.html