天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于MapReduce的分布式聚類算法的研究

發(fā)布時間:2018-07-25 18:55
【摘要】:聚類分析是數(shù)據(jù)挖掘中最基礎(chǔ)的數(shù)據(jù)分析技術(shù)之一,被廣泛應(yīng)用于經(jīng)濟(jì)學(xué)、社會科學(xué)以及計(jì)算機(jī)科學(xué)等領(lǐng)域。然而,隨著互聯(lián)網(wǎng)技術(shù)的快速發(fā)展,各種網(wǎng)絡(luò)應(yīng)用產(chǎn)生的數(shù)據(jù)急劇增加,給傳統(tǒng)的聚類分析方法帶來了巨大的技術(shù)挑戰(zhàn)。如何快速有效地從海量數(shù)據(jù)中獲取到有價值的信息,成為諸多行業(yè)急需解決的問題。云計(jì)算技術(shù)的日趨成熟使得快速有效的處理海量數(shù)據(jù)成為可能。Hadoop是一種開源的分布式云計(jì)算平臺,其核心設(shè)計(jì)是分布式文件系統(tǒng)(HDFS)和MapReduce,其中HDFS提供海量數(shù)據(jù)的存儲,MapReduce編程模型用于對數(shù)據(jù)進(jìn)行并行化處理。相對于傳統(tǒng)的并行編程模型,該編程模型對底層的數(shù)據(jù)分割、任務(wù)調(diào)度、并行處理等細(xì)節(jié)進(jìn)行封裝,用戶可在不明白分布式底層細(xì)節(jié)的情況下開發(fā)分布式應(yīng)用程序,極大地方便了對并行化程序的設(shè)計(jì)。K-means算法作為聚類分析中的經(jīng)典算法被應(yīng)用于多個行業(yè)領(lǐng)域,但隨著數(shù)據(jù)規(guī)模的增大,該算法的迭代次數(shù)會明顯增加,影響算法的執(zhí)行效率。為使其能夠較好的應(yīng)用于大規(guī)模數(shù)據(jù)的聚類分析中,本文首先根據(jù)MapReduce的編程原理實(shí)現(xiàn)該算法在Hadoop平臺上的并行化,然后針對K-means算法隨機(jī)選取簇中心的盲目性及聚類結(jié)果易陷入局部最優(yōu)的問題進(jìn)行相應(yīng)的改進(jìn)。論文的主要工作如下:(1)在分析了傳統(tǒng)K-means算法基礎(chǔ)上,借鑒最大最小距離的思想,提出基于最大最小距離的K-means并行化算法。根據(jù)最大最小距離的思想選取簇中心并將其作為K-means算法的初始中心點(diǎn),避免隨機(jī)選取中心點(diǎn)容易出現(xiàn)的初始中心點(diǎn)過于鄰近的情況,從而提高聚類結(jié)果的質(zhì)量。為提高其效率,設(shè)計(jì)并實(shí)現(xiàn)了該算法的并行化。(2)對一趟聚類算法的原理及其優(yōu)缺點(diǎn)進(jìn)行分析,并結(jié)合傳統(tǒng)K-means算法的特性,提出OPKMEANS并行化算法。該算法利用一趟聚類算法簡單高效的特性,先將數(shù)據(jù)集進(jìn)行快速的“粗”聚類,然后把得到的中心點(diǎn)作為K-means算法的初始中心點(diǎn),避免K-means算法隨機(jī)選取中心點(diǎn)的盲目性,減少K-means算法的迭代次數(shù),以降低并行化過程的數(shù)據(jù)傳輸開銷,從而提高算法的執(zhí)行效率。(3)為了驗(yàn)證改進(jìn)算法的有效性,本文在研究Hadoop原理的基礎(chǔ)上,在虛擬機(jī)上搭建了Hadoop分布式計(jì)算平臺,并進(jìn)行多組實(shí)驗(yàn),從聚類質(zhì)量、加速比及可擴(kuò)展性方面驗(yàn)證上述算法的優(yōu)越性。
[Abstract]:Clustering analysis is one of the most basic data analysis techniques in data mining, which is widely used in economics, social science and computer science. However, with the rapid development of Internet technology, the data generated by various network applications increase rapidly, which brings great technical challenges to the traditional clustering analysis methods. How to obtain valuable information from massive data quickly and effectively has become an urgent problem in many industries. With the maturity of cloud computing technology, it is possible to deal with massive data quickly and effectively. Hadoop is an open source distributed cloud computing platform. Its core design is distributed file system (HDFS) and MapReduce. in which HDFS provides a programming model for storing large amounts of data for parallelization of data. Compared with the traditional parallel programming model, this programming model encapsulates the details of data segmentation, task scheduling, parallel processing, etc. Users can develop distributed applications without understanding the distributed low-level details. It greatly facilitates the design of parallelization program. K-means algorithm is applied to many industries as a classical algorithm in clustering analysis. However, with the increase of data scale, the number of iterations of the algorithm will increase obviously, which will affect the efficiency of the algorithm. In order to apply it to the clustering analysis of large-scale data, this paper firstly realizes the parallelization of the algorithm on the Hadoop platform according to the programming principle of MapReduce. Then the blindness of random selection of cluster centers in K-means algorithm and the problem that clustering results are prone to fall into local optimum are improved accordingly. The main work of this paper is as follows: (1) based on the analysis of traditional K-means algorithm and the idea of maximum and minimum distance, a K-means parallelization algorithm based on maximum and minimum distance is proposed. The cluster center is selected according to the idea of maximum and minimum distance and used as the initial center of K-means algorithm to avoid the situation that the initial center is too close to the random selection of the center point, so as to improve the quality of the clustering results. In order to improve its efficiency, the parallelization of the algorithm is designed and implemented. (2) the principle, advantages and disadvantages of the one-trip clustering algorithm are analyzed, and combining with the characteristics of the traditional K-means algorithm, the OPKMEANS parallelization algorithm is proposed. Based on the simple and efficient feature of one-trip clustering algorithm, the algorithm firstly clusters the data set quickly "coarse", and then takes the obtained center as the initial center of the K-means algorithm to avoid the blindness of the random selection of the center points in the K-means algorithm. To reduce the number of iterations of the K-means algorithm to reduce the data transmission overhead of the parallelization process and improve the efficiency of the algorithm. (3) in order to verify the effectiveness of the improved algorithm, this paper studies the principle of Hadoop. The Hadoop distributed computing platform is built on the virtual machine, and many experiments are carried out to verify the superiority of the above algorithm in terms of clustering quality, speedup and extensibility.
【學(xué)位授予單位】:江西理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 牛怡晗;海沫;;Hadoop平臺下Mahout聚類算法的比較研究[J];計(jì)算機(jī)科學(xué);2015年S1期

2 成衛(wèi)青;盧艷紅;;一種基于最大最小距離和SSE的自適應(yīng)聚類算法[J];南京郵電大學(xué)學(xué)報(bào)(自然科學(xué)版);2015年02期

3 王蕾;崔慧敏;陳莉;馮曉兵;;任務(wù)并行編程模型研究與進(jìn)展[J];軟件學(xué)報(bào);2013年01期

4 李霞;蔣盛益;張倩生;朱靖;;適用于大規(guī)模文本處理的動態(tài)密度聚類算法[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年01期

5 蔣盛益;苗邦;余雯;;基于一趟聚類的不平衡數(shù)據(jù)下抽樣算法[J];小型微型計(jì)算機(jī)系統(tǒng);2012年02期

6 熊忠陽;陳若田;張玉芳;;一種有效的K-means聚類中心初始化方法[J];計(jì)算機(jī)應(yīng)用研究;2011年11期

7 趙衛(wèi)中;馬慧芳;傅燕翔;史忠植;;基于云計(jì)算平臺Hadoop的并行k-means聚類算法設(shè)計(jì)研究[J];計(jì)算機(jī)科學(xué);2011年10期

8 江小平;李成華;向文;張新訪;顏海濤;;k-means聚類算法的MapReduce并行化實(shí)現(xiàn)[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年S1期

9 蔣盛益;龐觀松;張黎莎;;Chameleon算法的改進(jìn)[J];小型微型計(jì)算機(jī)系統(tǒng);2010年08期

10 楊燕;靳蕃;KAMEL Mohamed;;聚類有效性評價綜述[J];計(jì)算機(jī)應(yīng)用研究;2008年06期

相關(guān)博士學(xué)位論文 前1條

1 許玉杰;云計(jì)算環(huán)境下海量數(shù)據(jù)的并行聚類算法研究[D];大連海事大學(xué);2014年

相關(guān)碩士學(xué)位論文 前10條

1 侯s,

本文編號:2144761


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2144761.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶24060***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
国产韩国日本精品视频| 国产午夜福利在线观看精品| 台湾综合熟女一区二区| 亚洲最新av在线观看| 成人日韩在线播放视频| 日韩精品一区二区三区四区| 亚洲中文字幕在线观看黑人| 久久国产亚洲精品成人| 熟女少妇久久一区二区三区| 中文精品人妻一区二区| 冬爱琴音一区二区中文字幕| 好吊妞视频只有这里有精品| 国产午夜福利片在线观看| 一二区不卡不卡在线观看| 日本午夜免费福利视频| 午夜精品在线观看视频午夜| 麻豆果冻传媒一二三区| 亚洲一区二区三区在线免费| 国产成人国产精品国产三级 | 国产精品美女午夜视频| 成人午夜在线视频观看| 欧美丰满大屁股一区二区三区| 国产欧美日韩在线精品一二区| 又大又长又粗又猛国产精品| 国产精品午夜小视频观看| 观看日韩精品在线视频| 成年人免费看国产视频| 老外那个很粗大做起来很爽| 99久久精品午夜一区| 亚洲欧美黑人一区二区| 午夜国产福利在线播放| 亚洲日本久久国产精品久久| 最新日韩精品一推荐日韩精品| 日本免费熟女一区二区三区| 亚洲中文字幕在线综合视频| 最近日韩在线免费黄片| 亚洲a码一区二区三区| 亚洲高清亚洲欧美一区二区| 亚洲国产欧美精品久久| 欧美日本亚欧在线观看| 欧美午夜一级特黄大片|