MapReduce模型的數(shù)據(jù)分配策略研究
發(fā)布時間:2018-05-22 13:28
本文選題:云計算 + Hadoop。 參考:《華中科技大學》2013年碩士論文
【摘要】:自2007年云計算誕生至今,它已經(jīng)逐漸成為國內(nèi)外IT界熱門的概念,得到了廣泛的關(guān)注。在當今互聯(lián)網(wǎng)高速發(fā)達的環(huán)境中,面對數(shù)據(jù)量的急劇增長,如何快速有效的對海量數(shù)據(jù)進行存儲和計算成為亟待解決的問題,這也是云計算誕生的原動力。但是對于云計算而言,它本身只是一種思維方式,雖然有硬件設(shè)施提供必要的環(huán)境,但是能夠支撐云計算思想的編程模型更加重要。由Google提出的MapReduce并行編程模型,為云計算海量數(shù)據(jù)的處理提供了軟件支持。 Hadoop以一種可靠、高效、可伸縮的方式工作,在短短幾年里成為了主流的開源云計算平臺,,但是Hadoop仍然是一個比較年輕的平臺,在很多地方有不夠完善之處,對其進行改進是十分必要的。通過對Hadoop平臺下的MapReduce并行編程模型進行深入研究,主要針對MapReduce并行編程模型在Map端輸出的中間數(shù)據(jù)分布不均衡現(xiàn)象提出解決方案,該方案的設(shè)計思路是用兩個階段MapReduce作業(yè)對上述問題進行處理,第一個MapReduce階段用于對源數(shù)據(jù)集進行并行抽樣,根據(jù)抽樣的結(jié)果估計數(shù)據(jù)信息,提出一種稱為LAB的分配策略,該分配策略對中間數(shù)據(jù)進行均衡分配;第二MapReduce階段按照上述數(shù)據(jù)分配策略執(zhí)行MapReduce作業(yè)。 通過實驗表明,該方案減少了作業(yè)運行時間,Reduce端輸入數(shù)據(jù)達到負載均衡,從而證明改進方案的可行性和其優(yōu)勢所在。該方案能夠充分利用計算資源,避免資源的浪費,提高了程序運行效率。
[Abstract]:Since the birth of cloud computing in 2007, it has gradually become a hot concept in IT field at home and abroad. With the rapid development of the Internet, how to store and compute the massive data quickly and effectively becomes an urgent problem in the face of the rapid growth of data, which is also the driving force of cloud computing. But for cloud computing, it is only a way of thinking. Although there are hardware facilities to provide the necessary environment, the programming model that can support cloud computing is more important. The MapReduce parallel programming model proposed by Google provides software support for cloud computing massive data processing. Hadoop, which works in a reliable, efficient and scalable way, has become the mainstream open source cloud computing platform in just a few years, but Hadoop is still a relatively young platform that is imperfect in many places. It is necessary to improve it. Through the in-depth study of MapReduce parallel programming model based on Hadoop platform, a solution is proposed to solve the problem of uneven distribution of intermediate data output from MapReduce parallel programming model in Map terminal. The design idea of the scheme is to deal with the above problems with two stage MapReduce jobs. The first stage of MapReduce is used to sample the source data set in parallel. According to the result of sampling, the data information is estimated, and an allocation strategy called LAB is proposed. The allocation strategy distributes the intermediate data evenly, and the second MapReduce stage executes the MapReduce job according to the above data allocation strategy. The experimental results show that this scheme can reduce the operation time and reduce the input data to achieve load balance, which proves the feasibility of the improved scheme and its advantages. The program can make full use of computing resources, avoid the waste of resources, and improve the efficiency of program operation.
【學位授予單位】:華中科技大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP333;TP311.1
【參考文獻】
相關(guān)期刊論文 前3條
1 陳康;鄭緯民;;云計算:系統(tǒng)實例與研究現(xiàn)狀[J];軟件學報;2009年05期
2 李玉林;董晶;;基于Hadoop的MapReduce模型的研究與改進[J];計算機工程與設(shè)計;2012年08期
3 孫廣中;肖鋒;熊曦;;MapReduce模型的調(diào)度及容錯機制研究[J];微電子學與計算機;2007年09期
本文編號:1922251
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1922251.html
最近更新
教材專著