面向聚類分析的迭代MapReduce計(jì)算模型研究

發(fā)布時(shí)間：2018-05-08 00:29

本文選題：聚類算法 + MapReduce　；參考：《天津大學(xué)》2012年碩士論文

【摘要】：MapReduce計(jì)算模型是一種高效的大規(guī)模數(shù)據(jù)處理方式，廣泛應(yīng)用于搜索引擎電子商務(wù)以及社交網(wǎng)絡(luò)等領(lǐng)域然而，運(yùn)行環(huán)境重復(fù)初始化靜態(tài)數(shù)據(jù)重復(fù)載入中間結(jié)果對網(wǎng)絡(luò)的負(fù)載壓力等原因造成了MapReduce計(jì)算模型無法高效的處理迭代計(jì)算的問題為此，本文將數(shù)據(jù)劃分為可以被分散的緩存在分布式環(huán)境節(jié)點(diǎn)內(nèi)存中的中等規(guī)模數(shù)據(jù)以及不能被分散的緩存在分布式環(huán)境節(jié)點(diǎn)內(nèi)存中的大規(guī)模數(shù)據(jù)，并且設(shè)計(jì)了兩種針對不同規(guī)模數(shù)據(jù)的迭代MapReduce效率的優(yōu)化方案首先，本文設(shè)計(jì)了用于提高M(jìn)apReduce計(jì)算模型以迭代方式處理中等規(guī)模數(shù)據(jù)時(shí)效率的MapCombine方案MapCombine通過給Combine任務(wù)添加緩存數(shù)據(jù)的功能，避免了靜態(tài)數(shù)據(jù)重復(fù)載入；增加了一個(gè)名為Controller的新組件，以其來調(diào)度迭代，避免了分布式環(huán)境重復(fù)初始化；設(shè)計(jì)了基于HBase的交互層，用于持久化中間數(shù)據(jù)，保證設(shè)計(jì)方案的健壯性其次，，本文設(shè)計(jì)了用于提高M(jìn)apReduce計(jì)算模型以迭代方式處理大規(guī)模數(shù)據(jù)時(shí)效率的CycleMap方案CycleMap通過增加一個(gè)名為Collector的新組件來替代Reduce任務(wù)的工作，避免了排序和洗牌這兩個(gè)過程對執(zhí)行效率的影響；通過流水線的方式運(yùn)行任務(wù)，間接的達(dá)成了整個(gè)迭代任務(wù)僅需要完成一次初始化工作的設(shè)計(jì)初衷，避免了分布式環(huán)境重復(fù)初始化最后，本設(shè)計(jì)基于以上兩個(gè)方案，分別實(shí)現(xiàn)了K-Means Fuzzy K-Means以及Dirichlet Process三個(gè)聚類算法在與基于MapReduce計(jì)算模型的Mahout算法庫中的相同聚類算法的性能比對中，MapCombine和CycleMap分別取得了1.10和1.05的加速比
[Abstract]:MapReduce computing model is an efficient large-scale data processing method, widely used in search engine e-commerce and social networks and other fields. The running environment repeatedly initializes the static data repeatedly loads the intermediate result to the network load pressure and so on causes the MapReduce computation model to be unable to deal with the iterative computation question efficiently. In this paper, the data can be divided into medium scale data that can be cached in distributed environment node memory and large scale data that can not be dispersed cache in distributed environment node memory. Two optimization schemes of iterative MapReduce efficiency for different scale data are designed. Firstly, this paper designs a MapCombine scheme to improve the efficiency of MapReduce computing model in iterative processing of medium scale data. By adding the function of caching data to Combine task, MapCombine avoids static data loading repeatedly. A new component called Controller is added to schedule iteration to avoid repeated initialization in distributed environment. An interactive layer based on HBase is designed to persist intermediate data to ensure the robustness of the design scheme. Secondly, this paper designs a CycleMap scheme to improve the efficiency of the MapReduce computing model when processing large scale data iteratively. CycleMap replaces the Reduce task by adding a new component named Collector. It avoids the influence of sorting and shuffling on the execution efficiency, and indirectly achieves the original intention that the whole iterative task only needs to complete one initialization by running the task in a pipeline way. Avoid repeated initialization in distributed environment Finally, based on the above two schemes, the performance of K-Means Fuzzy K-Means and Dirichlet Process clustering algorithms in the same clustering algorithm as Mahout algorithm library based on MapReduce computing model is realized. The speedup ratios of 1.10 and 1.05 are obtained for MapCombine and CycleMap, respectively.
【學(xué)位授予單位】：天津大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP311.13

【引證文獻(xiàn)】

相關(guān)碩士學(xué)位論文前1條

1 趙欣;面向大規(guī)模文本數(shù)據(jù)的并行SVM算法的設(shè)計(jì)與實(shí)現(xiàn)[D];天津大學(xué);2013年

本文編號：1859157

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1859157.html

上一篇：四大搜索引擎網(wǎng)站聲譽(yù)評價(jià)能力研究
下一篇：網(wǎng)頁自動分類算法的設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向聚類分析的迭代MapReduce計(jì)算模型研究