基于Hadoop平臺(tái)的網(wǎng)絡(luò)數(shù)據(jù)并行處理系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-06-04 13:40
本文選題:聚類算法 + Hadoop; 參考:《東南大學(xué)》2017年碩士論文
【摘要】:隨著移動(dòng)互聯(lián)網(wǎng)時(shí)代的到來,給人們的生活帶來各種各樣的便利,同時(shí)也意味著會(huì)產(chǎn)生越來越多的數(shù)據(jù),如何從這海量的數(shù)據(jù)中挖掘價(jià)值將是一個(gè)非常有價(jià)值的課題。聚類算法就是其中一種從海量數(shù)據(jù)中挖掘價(jià)值的工具,它有著非常廣泛的使用場景,包括對(duì)一些未知的物品進(jìn)行分類,同時(shí)可以進(jìn)行相應(yīng)應(yīng)用。隨著數(shù)據(jù)量的劇增,聚類算法在單機(jī)環(huán)境下開始越來越吃力,越來越面臨瓶頸。因此,海量數(shù)據(jù)對(duì)聚類算法以及相應(yīng)的處理系統(tǒng)提出了新的要求。本文是基于Hadoop平臺(tái)的網(wǎng)絡(luò)數(shù)據(jù)并行處理系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)。本文首先對(duì)Spark相關(guān)性能進(jìn)行優(yōu)化研究,主要包括兩部分:開發(fā)過程中相關(guān)性能優(yōu)化研究,shuffle性能優(yōu)化研究。開發(fā)過程中相關(guān)性能優(yōu)化研究主要研究了避免使用shuffle算子以及對(duì)多次使用的RDD進(jìn)行持久化這兩個(gè)方面。shuffle性能優(yōu)化主要研究了 sort shuffle和hash shuffle各自的適用場景以及相應(yīng)的優(yōu)化,并通過實(shí)驗(yàn)來驗(yàn)證。聚類算法面臨海量數(shù)據(jù)處理遇到的瓶頸越來越大,為了開發(fā)并行化聚類算法來應(yīng)對(duì)海量數(shù)據(jù)處理難的問題,本文引入Hadoop平臺(tái)并在該平臺(tái)上搭建Spark平臺(tái)。針對(duì)k-means算法存在隨機(jī)選取初始中心導(dǎo)致迭代次數(shù)過多的問題,本文提出了一種基于Spark平臺(tái)的由克洛斯卡爾算法改進(jìn)的k-means算法來解決初始中心選擇問題,通過迭代次數(shù)和迭代時(shí)間這兩個(gè)指標(biāo)來評(píng)價(jià)實(shí)際效果。為了更好展示實(shí)驗(yàn)結(jié)果,本文將Spark的k-means++算法作為比較對(duì)象,實(shí)驗(yàn)結(jié)果顯示,基于Spark平臺(tái)的由克洛斯卡爾算法改進(jìn)的k-means算法比Spark的k-means++算法有更少的運(yùn)行時(shí)間以及更少的迭代次數(shù)。針對(duì)k-means算法沒有考慮向量之間相似性的問題,本文提出了一種基于Spark平臺(tái)的由克洛斯卡爾算法和谷本距離改進(jìn)的k-means算法,使用誤差平方函數(shù)作為評(píng)價(jià)指標(biāo),與Spark的k-means++算法以及基于Spark平臺(tái)的由克洛斯卡爾算法改進(jìn)的k-means算法相比,具有更少的誤差平方函數(shù)值,也就得到更好的聚類結(jié)果。本文最后搭建了一個(gè)完整的基于Hadoop平臺(tái)的網(wǎng)絡(luò)數(shù)據(jù)并行處理系統(tǒng),該網(wǎng)絡(luò)數(shù)據(jù)并行處理系統(tǒng)的架構(gòu)設(shè)計(jì)使得系統(tǒng)本身具有大數(shù)據(jù)、高復(fù)雜度數(shù)據(jù)計(jì)算的能力。Hadoop計(jì)算平臺(tái)的引入使得系統(tǒng)可以依賴廉價(jià)硬件資源,提供高計(jì)算能力與存儲(chǔ)能力,同時(shí)也使系統(tǒng)具備很好的橫向擴(kuò)展能力,面對(duì)數(shù)據(jù)規(guī)模的上升,只需要通過簡單添加機(jī)器來增強(qiáng)集群處理能力。此外,該網(wǎng)絡(luò)數(shù)據(jù)并行處理系統(tǒng)具有普遍適用性,不僅僅適用于電影推薦,網(wǎng)絡(luò)異常檢測,也適用于任何使用聚類算法進(jìn)行數(shù)據(jù)處理的場景。
[Abstract]:With the advent of the mobile Internet era, people's life brings a variety of convenience, but also means that more and more data will be produced, how to mine the value from this massive data will be a very valuable topic. Clustering algorithm is one of the tools to mine the value from the massive data. It has a very wide range of usage scenarios, including the classification of some unknown items, and can be applied accordingly. With the rapid increase of data volume, clustering algorithm in a single computer environment began to become more and more difficult, more and more faced with bottlenecks. Therefore, massive data put forward new requirements for clustering algorithm and corresponding processing system. This paper is based on Hadoop platform network data parallel processing system design and implementation. In this paper, the performance optimization of Spark is studied, which includes two parts: the research of correlation performance optimization in the development process and the optimization of the performance of the shuffle. In the process of development, the performance optimization of sort shuffle and hash shuffle is mainly studied in the aspects of avoiding the use of shuffle operator and persisting the RDD used many times. It mainly studies the applicable scenarios of sort shuffle and hash shuffle and the corresponding optimization. And through the experiment to verify. Clustering algorithm is facing the bottleneck of mass data processing more and more. In order to develop parallel clustering algorithm to deal with the problem of mass data processing, this paper introduces Hadoop platform and builds Spark platform on the platform. In view of the problem that the k-means algorithm has too many iterations due to the random selection of the initial center, this paper proposes a k-means algorithm based on Spark platform, which is improved by the Crocal algorithm to solve the problem of selecting the initial center. The actual effect is evaluated by the number of iterations and the time of iteration. In order to better display the experimental results, the k-means algorithm of Spark is taken as a comparison object. The experimental results show that, The improved k-means algorithm based on Spark platform has less running time and fewer iterations than Spark's k-means algorithm. In view of the fact that the k-means algorithm does not consider the similarity between vectors, this paper proposes a new k-means algorithm based on Spark platform, which is improved by Crocal algorithm and Goramoto distance. The error square function is used as the evaluation index. Compared with the k-means algorithm of Spark and the improved k-means algorithm based on Spark platform, it has less error square function and better clustering result. At the end of this paper, a complete network data parallel processing system based on Hadoop platform is built. The architecture of the network data parallel processing system makes the system has big data. The introduction of Hadoop computing platform enables the system to rely on cheap hardware resources to provide high computing power and storage capacity. At the same time, the system also has a good lateral expansion ability, facing the increase of data scale. You only need to add machines simply to enhance cluster processing power. In addition, the network data parallel processing system is of universal applicability, not only for movie recommendation, network anomaly detection, but also for any data processing scenarios using clustering algorithm.
【學(xué)位授予單位】:東南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
,
本文編號(hào):1977563
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1977563.html
最近更新
教材專著