基于Spark的機器學習模型分析與研究
發(fā)布時間:2018-08-26 18:16
【摘要】:在分布式計算為主流的時代背景下,基于MapReduce框架的分布式應(yīng)用頻繁的I/O操作使得它的效率和性能不能夠得到完美的體現(xiàn);赗DD的Spark分布式計算框架能夠?qū)?shù)據(jù)加載進內(nèi)存,極大的適應(yīng)了迭代式機器學習模型的特定需求。針對目前基于MapReduce設(shè)計實現(xiàn)的機器學習模型存在的問題(主要是MR的本質(zhì)問題),研究了基于Spark的機器學習模型,主要包括KMeans聚類、ALS協(xié)同過濾。并且研究了基于Spark Streaming的在線機器學習模型。以下是文章的主要分析與研究內(nèi)容簡介:(1)文章基于Spark分布式計算框架設(shè)計并實現(xiàn)了并行KMeans聚類模型,并通過該模型在不同規(guī)模的MovieLens數(shù)據(jù)集上進行訓練比對實驗,結(jié)果表明,該并行KMeans聚類模型適合運行在分布式集群環(huán)境下,且并行化計算效率也有不俗的表現(xiàn);其次通過repartition算子設(shè)計分片加載數(shù)據(jù),優(yōu)化并行方案,有效減少了模型的訓練時間。(2)針對基于MapReduce框架處理海量數(shù)據(jù)實時響應(yīng)能力較差的問題,設(shè)計并實現(xiàn)了基于Spark Streaming的在線計算模型進行大規(guī)模的KMeans聚類分析。該模型將整個過程分為數(shù)據(jù)接入、在線訓練等模塊,各模塊通過數(shù)據(jù)流連通形成任務(wù)實體,提交到Spark分布式集群運行完成。通過比對分析實驗和性能檢測,驗證了該在線KMeans聚類模型具有高吞吐、低延遲的優(yōu)勢,且集群運行狀況良好。(3)ALS(最小二乘法)協(xié)同過濾推薦算法是通過矩陣分解進行推薦,它通過綜合大量的用戶評分數(shù)據(jù)進行計算,并存儲計算過程中產(chǎn)生的大量特征矩陣。Hadoop的HA(高可用性)用來解決HDFS分布式文件系統(tǒng)的NameNode單點故障問題。Spark作為一種基于內(nèi)存的新型分布式大數(shù)據(jù)計算框架,具有優(yōu)異的計算性能。文章基于QJM(Quorum Journal Manager)構(gòu)建了 HA下的Hadoop大數(shù)據(jù)平臺,并在Spark計算框架基礎(chǔ)上研究使用ALS協(xié)同過濾算法,實現(xiàn)基于ALS協(xié)同過濾算法在Spark上的并行化運行;通過和基于Hadoop的MapReduce思想的ALS協(xié)同過濾算法在Netflix數(shù)據(jù)集上的比對實驗表明,基于Spark平臺的ALS協(xié)同過濾算法的并行化計算效率有明顯提升,并且更適合處理海量數(shù)據(jù)。
[Abstract]:Under the background of the mainstream of distributed computing, the efficiency and performance of distributed applications based on MapReduce framework can not be reflected perfectly because of the frequent I / O operations. The Spark distributed computing framework based on RDD can load data into memory, which greatly meets the specific requirements of iterative machine learning model. Aiming at the problems of the machine learning model based on MapReduce (mainly the essential problem of MR), this paper studies the machine learning model based on Spark, including KMeans clustering and collaborative filtering. An online machine learning model based on Spark Streaming is also studied. The following are the main analysis and research contents: (1) this paper designs and implements a parallel KMeans clustering model based on Spark distributed computing framework, and carries out training and comparison experiments on MovieLens data sets of different scales through this model. The results show that, The parallel KMeans clustering model is suitable for running in the distributed cluster environment, and the parallel computing efficiency is also good. Secondly, the parallel scheme is optimized by using repartition operator to design piecewise data loading. The training time of the model is reduced effectively. (2) aiming at the problem of poor real-time response ability of processing massive data based on MapReduce framework, an online computing model based on Spark Streaming is designed and implemented for large-scale KMeans clustering analysis. The model divides the whole process into data access, online training and other modules. Each module is connected by data flow to form a task entity, which is submitted to the Spark distributed cluster to run. By comparing and analyzing experiments and performance testing, it is proved that the online KMeans clustering model has the advantages of high throughput and low delay, and the cluster runs well. (3) ALS (least square) collaborative filtering recommendation algorithm is recommended by matrix decomposition. It's calculated by synthesizing a lot of user rating data, And store a large number of feature matrices. Hadoop HA (high availability) used to solve the HDFS distributed file system NameNode single point problem. Spark as a new memory based distributed big data computing framework. Excellent computing performance. In this paper, the Hadoop big data platform under HA is constructed based on QJM (Quorum Journal Manager), and the ALS collaborative filtering algorithm is studied on the basis of Spark computing framework to realize the parallel running of ALS based collaborative filtering algorithm on Spark. The comparison experiment with ALS collaborative filtering algorithm based on MapReduce based on Hadoop on Netflix dataset shows that the parallel computing efficiency of ALS collaborative filtering algorithm based on Spark platform is obviously improved and it is more suitable to deal with mass data.
【學位授予單位】:昆明理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13;TP181
[Abstract]:Under the background of the mainstream of distributed computing, the efficiency and performance of distributed applications based on MapReduce framework can not be reflected perfectly because of the frequent I / O operations. The Spark distributed computing framework based on RDD can load data into memory, which greatly meets the specific requirements of iterative machine learning model. Aiming at the problems of the machine learning model based on MapReduce (mainly the essential problem of MR), this paper studies the machine learning model based on Spark, including KMeans clustering and collaborative filtering. An online machine learning model based on Spark Streaming is also studied. The following are the main analysis and research contents: (1) this paper designs and implements a parallel KMeans clustering model based on Spark distributed computing framework, and carries out training and comparison experiments on MovieLens data sets of different scales through this model. The results show that, The parallel KMeans clustering model is suitable for running in the distributed cluster environment, and the parallel computing efficiency is also good. Secondly, the parallel scheme is optimized by using repartition operator to design piecewise data loading. The training time of the model is reduced effectively. (2) aiming at the problem of poor real-time response ability of processing massive data based on MapReduce framework, an online computing model based on Spark Streaming is designed and implemented for large-scale KMeans clustering analysis. The model divides the whole process into data access, online training and other modules. Each module is connected by data flow to form a task entity, which is submitted to the Spark distributed cluster to run. By comparing and analyzing experiments and performance testing, it is proved that the online KMeans clustering model has the advantages of high throughput and low delay, and the cluster runs well. (3) ALS (least square) collaborative filtering recommendation algorithm is recommended by matrix decomposition. It's calculated by synthesizing a lot of user rating data, And store a large number of feature matrices. Hadoop HA (high availability) used to solve the HDFS distributed file system NameNode single point problem. Spark as a new memory based distributed big data computing framework. Excellent computing performance. In this paper, the Hadoop big data platform under HA is constructed based on QJM (Quorum Journal Manager), and the ALS collaborative filtering algorithm is studied on the basis of Spark computing framework to realize the parallel running of ALS based collaborative filtering algorithm on Spark. The comparison experiment with ALS collaborative filtering algorithm based on MapReduce based on Hadoop on Netflix dataset shows that the parallel computing efficiency of ALS collaborative filtering algorithm based on Spark platform is obviously improved and it is more suitable to deal with mass data.
【學位授予單位】:昆明理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13;TP181
【參考文獻】
相關(guān)期刊論文 前10條
1 趙玲玲;劉杰;王偉;;基于Spark的流程化機器學習分析方法[J];計算機系統(tǒng)應(yīng)用;2016年12期
2 武海麗;李彩玲;;基于Google云計算的在線學習系統(tǒng)設(shè)計研究[J];山西煤炭管理干部學院學報;2016年04期
3 岑凱倫;于紅巖;楊騰霄;;大數(shù)據(jù)下基于Spark的電商實時推薦系統(tǒng)的設(shè)計與實現(xiàn)[J];現(xiàn)代計算機(專業(yè)版);2016年24期
4 海沫;;大數(shù)據(jù)聚類算法綜述[J];計算機科學;2016年S1期
5 原默晗;唐晉韜;王挺;;一種高效的分布式相似短文本聚類算法[J];計算機與數(shù)字工程;2016年05期
6 劉澤q,
本文編號:2205752
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2205752.html
最近更新
教材專著