基于Mahout的并行化k-means聚類算法優(yōu)化研究

發(fā)布時(shí)間：2018-05-13 06:56

本文選題：聚類分析 + k-means算法��；參考：《華中科技大學(xué)》2016年碩士論文

【摘要】：聚類分析是從大量數(shù)據(jù)中獲取有效信息的重要手段,用于聚類分析的算法稱為聚類算法。k-means聚類算法具備簡(jiǎn)單、快速、有效等諸多優(yōu)點(diǎn),是使用范圍最廣泛的經(jīng)典聚類算法之一。如今,快速發(fā)展的互聯(lián)網(wǎng)產(chǎn)業(yè)導(dǎo)致了數(shù)據(jù)量的劇增,傳統(tǒng)k-means算法已無法滿足海量數(shù)據(jù)的聚類需求,因此,k-means算法MapReduce并行化以及針對(duì)并行化k-means算法的優(yōu)化研究尤為重要。本文將探討并行化k-means算法的實(shí)現(xiàn)方式,并以此為基礎(chǔ)采用適合海量數(shù)據(jù)處理需求的算法優(yōu)化策略,目標(biāo)是降低算法時(shí)間和空間復(fù)雜度的同時(shí)獲取更優(yōu)質(zhì)的聚類結(jié)果。本文從k-means算法優(yōu)化與并行化的研究現(xiàn)狀出發(fā),分析了目前k-means算法優(yōu)化方法主要是針對(duì)串行k-means,與此同時(shí)k-means并行化研究主要圍繞算法設(shè)計(jì)展開,由此可知現(xiàn)階段國(guó)內(nèi)外對(duì)并行化k-means算法的優(yōu)化研究尚屬薄弱環(huán)節(jié),因此本文確立了采用時(shí)間復(fù)雜度較低的算法對(duì)并行化k-means進(jìn)行優(yōu)化的研究思路。作為鋪墊,本文介紹了分布式開源框架Hadoop、編程模式MapReduce以及提供協(xié)同過濾、聚類、分類等大規(guī)模機(jī)器學(xué)習(xí)算法分布式實(shí)現(xiàn)的算法庫Mahout;然后著重研究了k-means算法原理、算法缺陷以及它在Mahout中的并行化實(shí)現(xiàn)方式;最后,采用針對(duì)并行化k-means算法優(yōu)化方法,即利用時(shí)間復(fù)雜度極低的“粗聚類”算法Canopy對(duì)并行化k-means進(jìn)行優(yōu)化。在算法性能測(cè)試階段,本文利用Mahout算法庫提供的算法驅(qū)動(dòng)等接口將Canopy優(yōu)化前后的k-means算法予以實(shí)現(xiàn),并將優(yōu)化前后的算法應(yīng)用在Hadoop分布式測(cè)試平臺(tái)上,采用控制變量法調(diào)整參數(shù),將算法應(yīng)用在呈高斯分布的數(shù)據(jù)集上進(jìn)行聚類性能測(cè)試。分析實(shí)驗(yàn)數(shù)據(jù)可知,優(yōu)化算法的聚類性能明顯更優(yōu)——在保證算法效率的前提下,以更少的迭代次數(shù)收斂于更準(zhǔn)確的質(zhì)心,并且在算法穩(wěn)定性方面也有顯著的提升。總體來看,基于Canopy的k-means算法優(yōu)化效果明顯。
[Abstract]:Clustering analysis is an important means to obtain effective information from a large number of data. The clustering algorithm called .k-means clustering algorithm has many advantages, such as simple, fast, effective and so on. It is one of the most widely used classical clustering algorithms. Today, the rapid development of the Internet industry has led to a sharp increase in the amount of data, the traditional k-means algorithm can no longer meet the needs of massive data clustering, so MapReduce parallelization of k-means algorithm and optimization of parallel k-means algorithm is particularly important. In this paper, we will discuss the implementation of parallel k-means algorithm, and based on this, we will adopt an algorithm optimization strategy suitable for mass data processing requirements. The goal is to reduce the time and space complexity of the algorithm and obtain better clustering results at the same time. Based on the research status of optimization and parallelization of k-means algorithm, this paper analyzes that the optimization method of k-means algorithm is mainly aimed at serial k-means, while the research of k-means parallelization mainly focuses on the design of algorithm. It can be seen that the research on parallelization k-means optimization is still weak at present, so this paper establishes the research idea of using the algorithm with low time complexity to optimize parallelized k-means. As a paver, this paper introduces the distributed open source framework Hadoop, the programming pattern MapReduce and the algorithm library Mahoutwhich provides distributed implementation of large-scale machine learning algorithms, such as collaborative filtering, clustering and classification, and then focuses on the principle of k-means algorithm. Finally, the optimization method for parallelized k-means algorithm is adopted, that is, the "coarse clustering" algorithm Canopy, which has a very low time complexity, is used to optimize the parallelized k-means. In the performance testing phase of the algorithm, the k-means algorithm before and after Canopy optimization is realized by using the interface provided by Mahout algorithm library, and the algorithm before and after optimization is applied to the Hadoop distributed test platform, and the control variable method is used to adjust the parameters. The algorithm is applied to the data set with Gao Si distribution to test the clustering performance. By analyzing the experimental data, we can see that the clustering performance of the optimization algorithm is obviously better-converging to the more accurate centroid with less iteration times, while ensuring the efficiency of the algorithm, and the stability of the algorithm is also improved significantly. In general, the optimization effect of k-means algorithm based on Canopy is obvious.
【學(xué)位授予單位】：華中科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 張智;龔宇;;分布式存儲(chǔ)系統(tǒng)HBase關(guān)鍵技術(shù)研究[J];現(xiàn)代計(jì)算機(jī)(專業(yè)版);2014年32期

2 謝雪蓮;李蘭友;;基于云計(jì)算的并行K-means聚類算法研究[J];計(jì)算機(jī)測(cè)量與控制;2014年05期

3 賈瑞玉;管玉勇;李亞龍;;基于MapReduce模型的并行遺傳k-means聚類算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2014年02期

4 賈歐陽;阮樹驊;田興;楊峻興;李丹;;MapReduce中Combine優(yōu)化機(jī)制的利用[J];計(jì)算機(jī)時(shí)代;2013年09期

5 周婷;張君瑛;羅成;;基于Hadoop的K-means聚類算法的實(shí)現(xiàn)[J];計(jì)算機(jī)技術(shù)與發(fā)展;2013年07期

6 彭輔權(quán);金蒼宏;吳明暉;應(yīng)晶;;MapReduce中shuffle優(yōu)化與重構(gòu)[J];中國(guó)科技論文;2012年04期

7 仝雪姣;孟凡榮;王志曉;;對(duì)k-means初始聚類中心的優(yōu)化[J];計(jì)算機(jī)工程與設(shè)計(jì);2011年08期

8 江小平;李成華;向文;張新訪;顏海濤;;k-means聚類算法的MapReduce并行化實(shí)現(xiàn)[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年S1期

9 欒亞建;黃爛，

本文編號(hào)：1882143

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1882143.html

上一篇：中小企業(yè)支付決策系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
下一篇：基于句法信息的微博情緒識(shí)別方法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Mahout的并行化k-means聚類算法優(yōu)化研究