基于Spark的分布式協(xié)同過濾及工具研究

發(fā)布時(shí)間：2018-07-02 21:29

本文選題：Spark + 協(xié)同過濾�。� 參考：《南京大學(xué)》2017年碩士論文

【摘要】：隨著移動(dòng)互聯(lián)網(wǎng)與物聯(lián)網(wǎng)的技術(shù)的飛速發(fā)展,人類收集的數(shù)據(jù)量呈指數(shù)級(jí)增加。分布式計(jì)算已經(jīng)成為大數(shù)據(jù)處理、分析過程中不可或缺的關(guān)鍵技術(shù)。分布式計(jì)算通過將計(jì)算任務(wù)分解為可并發(fā)執(zhí)行的多個(gè)子問題并在互連的多臺(tái)計(jì)算節(jié)點(diǎn)上同時(shí)運(yùn)行,解決了傳統(tǒng)算法面臨的單機(jī)性能瓶頸、難以擴(kuò)展的問題。關(guān)于分布式機(jī)器學(xué)習(xí)算法的研究也成為了工業(yè)界和產(chǎn)業(yè)界的研究熱點(diǎn)。在眾多的分布式計(jì)算框架中,Spark以其高容錯(cuò)、高可擴(kuò)展和易用的特點(diǎn)得到了廣泛的應(yīng)用。但對(duì)其上實(shí)現(xiàn)的分布式算法的復(fù)雜度分析和比較仍缺乏同一的分析框架。因此,對(duì)具體算法在Spark平臺(tái)上的可伸縮性以及性能無法進(jìn)行理論上的分析與對(duì)比,只能進(jìn)行經(jīng)驗(yàn)分析。本文基于對(duì)Spark分布式平臺(tái)的研究,提出了一種對(duì)Spark上分布式算法的復(fù)雜度分析框架,并以基于Spark的協(xié)同過濾算法作為應(yīng)用場景。證明了通過該框架能夠有效的指導(dǎo)算法的開發(fā)與運(yùn)行時(shí)環(huán)境配置。具體地,本文做了如下工作:首先,本文首先對(duì)分布式計(jì)算和協(xié)同過濾技術(shù)做了介紹。分布式計(jì)算部分對(duì)流行的Hadoop和Spark分布式計(jì)算平臺(tái)的計(jì)算模型、運(yùn)行模型、設(shè)計(jì)理念都給出了具體分析,并對(duì)其原理給出了解釋。協(xié)同過濾部分中,對(duì)基于內(nèi)存的協(xié)同過濾和基于矩陣分解的協(xié)同過濾技術(shù)進(jìn)行了分析,介紹了多種經(jīng)典算法。然后,本文提出了一種對(duì)Spark上分布式算法的復(fù)雜度分析框架,并在此基礎(chǔ)上對(duì)多種基于Spark的分布式協(xié)同過濾算法做了復(fù)雜度分析和實(shí)驗(yàn)分析,包括基于內(nèi)存的協(xié)同過濾算法的三種并行化方法和基于矩陣分解的三種并行化方法。最后,本文設(shè)計(jì)了一款基于Spark的數(shù)據(jù)挖掘工具箱。工具箱通過將數(shù)據(jù)挖掘算法組件化,提供基于配置的數(shù)據(jù)分析應(yīng)用開發(fā)模型,解決了分析人員難以使用Spark的問題。通過使用該工具箱,用戶可以方便的使用各種分布式數(shù)據(jù)挖掘算法處理海量數(shù)據(jù)而無需編程能力,本文詳細(xì)介紹了工具箱的功能與開發(fā)設(shè)計(jì)過程。
[Abstract]:With the rapid development of mobile Internet and Internet of things, the amount of data collected increases exponentially. Distributed computing has become an indispensable key technology in big data processing. By decomposing computing tasks into multiple concurrent execution sub-problems and running simultaneously on multiple interconnected computing nodes, distributed computing solves the problem of single machine performance bottleneck faced by traditional algorithms, which is difficult to extend. The research on distributed machine learning algorithm has also become a hotspot in industry and industry. Spark has been widely used in many distributed computing frameworks because of its high fault tolerance, high scalability and ease of use. However, there is still a lack of the same analysis framework for the complexity analysis and comparison of the distributed algorithms implemented on it. Therefore, the scalability and performance of the algorithm on Spark platform can not be theoretically analyzed and compared, but empirical analysis can only be carried out. Based on the research of Spark distributed platform, this paper presents a complexity analysis framework for Spark distributed algorithm, and uses Spark based collaborative filtering algorithm as the application scenario. It is proved that this framework can effectively guide the development and runtime environment configuration of the algorithm. Specifically, this paper does the following work: first, this paper introduces distributed computing and collaborative filtering technology. In the part of distributed computing, the calculation models, operation models and design concepts of the popular Hadoop and Spark distributed computing platforms are analyzed in detail, and their principles are explained. In the part of collaborative filtering, memory based collaborative filtering and matrix decomposition based collaborative filtering are analyzed, and several classical algorithms are introduced. Then, this paper presents a complexity analysis framework for distributed algorithms on Spark, and makes complexity analysis and experimental analysis on various distributed collaborative filtering algorithms based on Spark. It includes three parallelization methods of memory-based collaborative filtering algorithm and three parallelization methods based on matrix decomposition. Finally, this paper designs a data mining toolbox based on Spark. By compartmentalizing data mining algorithms, the toolbox provides a configurable data analysis application development model, which solves the problem that it is difficult for analysts to use Spark. Through the use of the toolbox, users can easily use a variety of distributed data mining algorithms to process mass data without programming ability. This paper introduces the function of toolbox and the process of development and design in detail.
【學(xué)位授予單位】：南京大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.3

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 楊風(fēng)召;;一種基于特征表的協(xié)同過濾算法[J];計(jì)算機(jī)工程與應(yīng)用;2007年06期

2 王嵐;翟正軍;;基于時(shí)間加權(quán)的協(xié)同過濾算法[J];計(jì)算機(jī)應(yīng)用;2007年09期

3 曾子明;張李義;;基于多屬性決策和協(xié)同過濾的智能導(dǎo)購系統(tǒng)[J];武漢大學(xué)學(xué)報(bào)(工學(xué)版);2008年02期

4 張富國;;用戶多興趣下基于信任的協(xié)同過濾算法研究[J];小型微型計(jì)算機(jī)系統(tǒng);2008年08期

5 侯翠琴;焦李成;張文革;;一種壓縮稀疏用戶評(píng)分矩陣的協(xié)同過濾算法[J];西安電子科技大學(xué)學(xué)報(bào);2009年04期

6 廖新考;;基于用戶特征和項(xiàng)目屬性的混合協(xié)同過濾推薦[J];福建電腦;2010年07期

7 沈磊;周一民;李舟軍;;基于心理學(xué)模型的協(xié)同過濾推薦方法[J];計(jì)算機(jī)工程;2010年20期

8 徐紅;彭黎;郭艾寅;徐云劍;;基于用戶多興趣的協(xié)同過濾策略改進(jìn)研究[J];計(jì)算機(jī)技術(shù)與發(fā)展;2011年04期

9 焦晨斌;王世卿;;基于模型填充的混合協(xié)同過濾算法[J];微計(jì)算機(jī)信息;2011年11期

10 鄭婕;鮑海琴;;基于協(xié)同過濾推薦技術(shù)的個(gè)性化網(wǎng)絡(luò)教學(xué)平臺(tái)研究[J];科技風(fēng);2012年06期

相關(guān)會(huì)議論文前10條

1 沈杰峰;杜亞軍;唐俊;;一種基于項(xiàng)目分類的協(xié)同過濾算法[A];第二十二屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（技術(shù)報(bào)告篇）[C];2005年

2 周軍鋒;湯顯;郭景峰;;一種優(yōu)化的協(xié)同過濾推薦算法[A];第二十一屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2004年

3 董全德;;基于雙信息源的協(xié)同過濾算法研究[A];全國第20屆計(jì)算機(jī)技術(shù)與應(yīng)用學(xué)術(shù)會(huì)議（CACIS·2009）暨全國第1屆安全關(guān)鍵技術(shù)與應(yīng)用學(xué)術(shù)會(huì)議論文集（上冊(cè)）[C];2009年

4 張光衛(wèi);康建初;李鶴松;劉常昱;李德毅;;面向場景的協(xié)同過濾推薦算法[A];中國系統(tǒng)仿真學(xué)會(huì)第五次全國會(huì)員代表大會(huì)暨2006年全國學(xué)術(shù)年會(huì)論文集[C];2006年

5 李建國;姚良超;湯庸;郭歡;;基于認(rèn)知度的協(xié)同過濾推薦算法[A];第26屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（B輯）[C];2009年

6 王明文;陶紅亮;熊小勇;;雙向聚類迭代的協(xié)同過濾推薦算法[A];第三屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集[C];2007年

7 胡必云;李舟軍;王君;;基于心理測量學(xué)的協(xié)同過濾相似度方法(英文)[A];NDBC2010第27屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集(B輯)[C];2010年

8 林麗冰;師瑞峰;周一民;李月雷;;基于雙聚類的協(xié)同過濾推薦算法[A];2008'中國信息技術(shù)與應(yīng)用學(xué)術(shù)論壇論文集（一）[C];2008年

9 羅喜軍;王韜丞;杜小勇;劉紅巖;何軍;;基于類別的推薦——一種解決協(xié)同推薦中冷啟動(dòng)問題的方法[A];第二十四屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2007年

10 黃創(chuàng)光;印鑒;汪靜;劉玉葆;王甲海;;不確定近鄰的協(xié)同過濾推薦算法[A];NDBC2010第27屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集A輯一[C];2010年

相關(guān)博士學(xué)位論文前10條

1 紀(jì)科;融合上下文信息的混合協(xié)同過濾推薦算法研究[D];北京交通大學(xué);2016年

2 程殿虎;基于協(xié)同過濾的社會(huì)網(wǎng)絡(luò)推薦系統(tǒng)關(guān)鍵技術(shù)研究[D];中國海洋大學(xué);2015年

3 于程遠(yuǎn);基于QoS的Web服務(wù)推薦技術(shù)研究[D];上海交通大學(xué);2015年

4 李聰;電子商務(wù)推薦系統(tǒng)中協(xié)同過濾瓶頸問題研究[D];合肥工業(yè)大學(xué);2009年

5 郭艷紅;推薦系統(tǒng)的協(xié)同過濾算法與應(yīng)用研究[D];大連理工大學(xué);2008年

6 羅恒;基于協(xié)同過濾視角的受限玻爾茲曼機(jī)研究[D];上海交通大學(xué);2011年

7 薛福亮;電子商務(wù)協(xié)同過濾推薦質(zhì)量影響因素及其改進(jìn)機(jī)制研究[D];天津大學(xué);2012年

8 高e，

本文編號(hào)：2090982

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/xixikjs/2090982.html

上一篇：電能質(zhì)量全指標(biāo)監(jiān)測系統(tǒng)的研究與應(yīng)用
下一篇：《紐約時(shí)報(bào)》、《華爾街日?qǐng)?bào)》浙商形象建構(gòu)的媒介框架

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Spark的分布式協(xié)同過濾及工具研究