當(dāng)前位置：主頁(yè) > 科技論文 > 計(jì)算機(jī)論文 >

基于MapReduce的迭代型分布式數(shù)據(jù)處理研究

發(fā)布時(shí)間：2018-05-04 18:47

本文選題：MapReduce + 分布式��；參考：《山東大學(xué)》2013年碩士論文

【摘要】：信息時(shí)代即數(shù)據(jù)的時(shí)代,隨著數(shù)據(jù)規(guī)模的急劇增加,數(shù)據(jù)處理在諸多領(lǐng)域已遠(yuǎn)遠(yuǎn)超出了個(gè)人電腦的能力,越來(lái)越呈現(xiàn)出海量和并行的特點(diǎn)。而傳統(tǒng)的并行編程技術(shù)如MPI、網(wǎng)格計(jì)算等存在開發(fā)復(fù)雜,擴(kuò)展性不好等問題,無(wú)法滿足日益增長(zhǎng)的大規(guī)模數(shù)據(jù)處理的要求,迫切需要一種新的更加優(yōu)秀的大規(guī)模數(shù)據(jù)處理編程模型。面對(duì)挑戰(zhàn),MapReduce應(yīng)運(yùn)而生。 MapReduce是由Google首先提出的一種用于大規(guī)模數(shù)據(jù)集并行運(yùn)算的分布式編程框架,具有編程簡(jiǎn)單,容錯(cuò)性好,易于擴(kuò)展等特點(diǎn),極大地簡(jiǎn)化了集群上的海量數(shù)據(jù)并行處理實(shí)現(xiàn)。自其誕生的那一刻起,MapReduce就受到了高度關(guān)注,吸引了大量的相關(guān)研究,并在越來(lái)越多的實(shí)際場(chǎng)景中得到了廣泛應(yīng)用。然而,現(xiàn)有的傳統(tǒng)MapReduce實(shí)現(xiàn)諸如Hadoop和Sphere,不能有效的支持迭代型數(shù)據(jù)處理,而迭代計(jì)算在現(xiàn)實(shí)中是一類非常重要的應(yīng)用。在科學(xué)計(jì)算、數(shù)據(jù)挖掘、信息檢索、機(jī)器學(xué)習(xí)等領(lǐng)域,很多算法都是運(yùn)用多次迭代實(shí)現(xiàn)的。這使得如何提高M(jìn)apReduce的迭代型數(shù)據(jù)處理效能成為當(dāng)前一項(xiàng)十分緊迫的研究課題,具有重要的實(shí)用價(jià)值。針對(duì)這個(gè)問題,本文進(jìn)行了深入分析和研究,并在Hadoop的基礎(chǔ)上進(jìn)行擴(kuò)展和修改,提出了一種改進(jìn)的MapReduce框架,myHadoop。 myHadoop通過改進(jìn)編程模型和任務(wù)調(diào)度程序,采用新的任務(wù)并行策略,增加循環(huán)控制模塊以及數(shù)據(jù)緩存模塊,不僅擴(kuò)展了MapReduce對(duì)迭代程序的編程支持,還大大改善了其執(zhí)行效率。本文首先分析了MapReduce對(duì)迭代型程序的處理方法和存在問題,然后詳細(xì)描述了myHadoop的設(shè)計(jì)和實(shí)現(xiàn),最后選取幾個(gè)典型應(yīng)用進(jìn)行了實(shí)驗(yàn),將myHadoop與Hadoop的迭代型分布式數(shù)據(jù)處理效率進(jìn)行分析對(duì)比,并討論了myHadoop在應(yīng)用中Map任務(wù)分割個(gè)數(shù)的設(shè)置以及非迭代型數(shù)據(jù)處理的問題。
[Abstract]:The information age is the era of data. With the rapid increase of data scale, data processing has been far beyond the ability of personal computers in many fields, more and more showing the characteristics of mass and parallelism. However, the traditional parallel programming techniques such as MPI, grid computing and so on have the problems of complex development and poor expansibility, which can not meet the requirements of increasing large-scale data processing. There is an urgent need for a new and better large-scale data processing programming model. Facing the challenge, MapReduce came into being. MapReduce is a distributed programming framework which is first put forward by Google for parallel operation of large data sets. It has the advantages of simple programming, good fault tolerance and easy extension. It greatly simplifies the implementation of parallel processing of massive data on clusters. Since its birth, MapReduce has attracted great attention, attracted a large number of related research, and has been widely used in more and more practical scenes. However, existing traditional MapReduce implementations such as Hadoop and Hadoop can not effectively support iterative data processing, and iterative computing is a very important application in reality. In the fields of scientific computing, data mining, information retrieval and machine learning, many algorithms are implemented with multiple iterations. This makes how to improve the efficiency of iterative data processing of MapReduce becomes a very urgent research topic and has important practical value. In order to solve this problem, this paper analyzes and researches in depth, extends and modifies on the basis of Hadoop, and proposes an improved MapReduce framework named myHadoop. By improving the programming model and task scheduler, adopting a new task parallel strategy, adding cyclic control module and data cache module, myHadoop not only extends the programming support of MapReduce to iterative program, but also greatly improves its execution efficiency. This paper first analyzes the method and existing problems of MapReduce to iterative program, then describes the design and implementation of myHadoop in detail. Finally, several typical applications are selected for experiment. This paper analyzes and compares the efficiency of iterative distributed data processing between myHadoop and Hadoop, and discusses the setting of the number of Map tasks in the application of myHadoop and the problem of non-iterative data processing.
【學(xué)位授予單位】：山東大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP338.8

【參考文獻(xiàn)】

相關(guān)期刊論文前6條

1 王鵬;孟丹;詹劍鋒;涂碧波;;數(shù)據(jù)密集型計(jì)算編程模型研究進(jìn)展[J];計(jì)算機(jī)研究與發(fā)展;2010年11期

2 李麗英;唐卓;李仁發(fā);;基于LATE的Hadoop數(shù)據(jù)局部性改進(jìn)調(diào)度算法[J];計(jì)算機(jī)科學(xué);2011年11期

3 宮學(xué)慶;金澈清;王曉玲;張蓉;周傲英;;數(shù)據(jù)密集型科學(xué)與工程:需求和挑戰(zhàn)[J];計(jì)算機(jī)學(xué)報(bào);2012年08期

4 曹軍;Google的PageRank技術(shù)剖析[J];情報(bào)雜志;2002年10期

5 張正璽,焦占亞,焦沛;關(guān)系代數(shù)中用基本運(yùn)算表示除法運(yùn)算[J];陜西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2003年S1期

6 李遠(yuǎn)方;鄧世昆;聞?dòng)癖?韓月陽(yáng);;Hadoop-MapReduce下的PageRank矩陣分塊算法[J];計(jì)算機(jī)技術(shù)與發(fā)展;2011年08期

相關(guān)碩士學(xué)位論文前5條

1 王凱;MapReduce集群多用戶作業(yè)調(diào)度方法的研究與實(shí)現(xiàn)[D];國(guó)防科學(xué)技術(shù)大學(xué);2010年

2 縣小平;搜索引擎PageRank算法研究[D];西北大學(xué);2010年

3 張釗寧;數(shù)據(jù)密集型計(jì)算中任務(wù)調(diào)度模型的研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2009年

4 張密密;MapReduce模型在Hadoop實(shí)現(xiàn)中的性能分析及改進(jìn)優(yōu)化[D];電子科技大學(xué);2010年

5 陳廣釗;基于MapReduce的海量圖像檢索技術(shù)研究[D];西安電子科技大學(xué);2012年

，

本文編號(hào)：1844205

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1844205.html

上一篇：面向軍隊(duì)文化的大學(xué)計(jì)算機(jī)基礎(chǔ)教學(xué)跨領(lǐng)域應(yīng)用能力培養(yǎng)模式
下一篇：基于閃存陣列的高速數(shù)據(jù)存儲(chǔ)技術(shù)研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于MapReduce的迭代型分布式數(shù)據(jù)處理研究