大數(shù)據(jù)分析中的聚類算法研究

發(fā)布時間：2018-06-24 17:03

本文選題：聚類分析 + Hadoop�。� 參考：《安徽理工大學(xué)》2016年碩士論文

【摘要】：隨著信息技術(shù)特別是移動通訊技術(shù)的發(fā)展,社交網(wǎng)絡(luò)、物聯(lián)網(wǎng)、云計算等相繼進入人們的日常工作和生活中,人們積累了大量數(shù)據(jù),并且數(shù)據(jù)仍然呈快速增長趨勢。面對海量的數(shù)據(jù),如何從中挖掘出有價值的信息成為許多領(lǐng)域廣泛研究的問題。聚類分析是數(shù)據(jù)挖掘和機器學(xué)習(xí)中常見的技術(shù),在在學(xué)術(shù)和工業(yè)領(lǐng)域被大量使用。然而,傳統(tǒng)的聚類算法以串行方法對數(shù)據(jù)進行處理,當(dāng)應(yīng)用于海量數(shù)據(jù)分析時,由于內(nèi)存限制等原因,其效率不高,不能滿足當(dāng)前對海量數(shù)據(jù)處理的需要。為應(yīng)對海量數(shù)據(jù)的挑戰(zhàn),提高聚類算法的效率,并行聚類技術(shù)成為當(dāng)前研究的熱點。Hadoop當(dāng)前廣泛使用數(shù)據(jù)分析平臺,它是對MapRedcue計算模型和分布式存儲系統(tǒng)GFS(Google File System)的開源實現(xiàn)。Hadoop因其易用性和良好的擴展性,已成為大數(shù)據(jù)分析的核心之一。Spark是當(dāng)前十分流行的分布式計算計算平臺,它實現(xiàn)了一種基于內(nèi)存的分布式數(shù)據(jù)結(jié)構(gòu),并且提供了簡單且強度的的編程接口,可以被用來構(gòu)建大數(shù)據(jù)分析中的聚類算法。本文分析了對上述大數(shù)據(jù)處理平臺進行了對比,詳細分析了其并行化原理,論述了如何將聚類算法并行化以對海量數(shù)據(jù)進行處理。本文分析大數(shù)據(jù)分析中典型的聚類算法,分析了它們各自的特點及應(yīng)用場景,同時提出一種基于預(yù)測強度大數(shù)據(jù)集k-均值聚類算法,并給出其在上述兩個平臺上的實現(xiàn)。
[Abstract]:With the development of information technology, especially mobile communication technology, social networks, Internet of things, cloud computing and so on have entered people's daily work and life, people have accumulated a lot of data, and the data is still growing rapidly. In the face of massive data, how to extract valuable information from it has become a widely studied problem in many fields. Clustering analysis, a common technology in data mining and machine learning, is widely used in academic and industrial fields. However, the traditional clustering algorithm uses serial method to process data. When applied to mass data analysis, due to memory constraints and other reasons, its efficiency is not high, which can not meet the needs of mass data processing. In order to meet the challenge of massive data and improve the efficiency of clustering algorithm, parallel clustering technology has become a hot topic in current research. Hadoop is widely used in data analysis platform. It is an open source implementation of the MapRedcue computing model and the distributed storage system. Hadoop has become one of the core of big data analysis because of its ease of use and good expansibility. It implements a memory-based distributed data structure and provides a simple and powerful programming interface which can be used to construct clustering algorithms in big data analysis. This paper analyzes and compares the above big data processing platform, analyzes the principle of parallelization in detail, and discusses how to parallelize the clustering algorithm to deal with massive data. In this paper, the typical clustering algorithms in big data analysis are analyzed, and their respective characteristics and application scenarios are analyzed. At the same time, a big data set k-means clustering algorithm based on predictive strength is proposed, and its implementation on the above two platforms is given.
【學(xué)位授予單位】：安徽理工大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP311.13

【參考文獻】

相關(guān)期刊論文前9條

1 唐杰;陳文光;;面向大社交數(shù)據(jù)的深度分析與挖掘[J];科學(xué)通報;2015年Z1期

2 古凌嵐;;面向大數(shù)據(jù)集的有效聚類算法[J];計算機工程與設(shè)計;2014年06期

3 陳思慧;;基于MIP和改進模糊K-Means算法的大數(shù)據(jù)聚類設(shè)計[J];計算機測量與控制;2014年04期

4 于艷華;宋美娜;;大數(shù)據(jù)[J];中興通訊技術(shù);2013年01期

5 孟小峰;慈祥;;大數(shù)據(jù)管理:概念、技術(shù)與挑戰(zhàn)[J];計算機研究與發(fā)展;2013年01期

6 魯偉明;杜晨陽;魏寶剛;沈春輝;葉振超;;基于MapReduce的分布式近鄰傳播聚類算法[J];計算機研究與發(fā)展;2012年08期

7 陳麗敏;楊靜;張健沛;;一種基于加速迭代的大數(shù)據(jù)集譜聚類方法[J];計算機科學(xué);2012年05期

8 趙衛(wèi)中;馬慧芳;傅燕翔;史忠植;;基于云計算平臺Hadoop的并行k-means聚類算法設(shè)計研究[J];計算機科學(xué);2011年10期

9 卞亦文;;大樣本數(shù)據(jù)聚類的改進方法[J];統(tǒng)計與決策;2009年01期

，

本文編號：2062310

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2062310.html

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

大數(shù)據(jù)分析中的聚類算法研究