MapReduce中基于抽樣技術(shù)的傾斜問(wèn)題研究
發(fā)布時(shí)間:2018-10-08 11:41
【摘要】:隨著互聯(lián)網(wǎng)的快速發(fā)展,信息正在呈爆炸式增長(zhǎng),每天都會(huì)產(chǎn)生海量的數(shù)據(jù),存儲(chǔ)和分析海量數(shù)據(jù)是目前的一個(gè)巨大挑戰(zhàn)。近年來(lái),云計(jì)算這一新計(jì)算模型自從誕生以來(lái)就備受關(guān)注,各大IT巨頭們紛紛將云計(jì)算作為首要發(fā)展戰(zhàn)略,提出了自己的云計(jì)算平臺(tái)和云計(jì)算服務(wù),并且已經(jīng)有了顯著的成果。 MapReduce作為一種大規(guī)模數(shù)據(jù)的并行處理模型在云計(jì)算環(huán)境下受到廣泛的應(yīng)用,它以其簡(jiǎn)單易用,高可擴(kuò)展性和容錯(cuò)性等特點(diǎn)被應(yīng)用于很多領(lǐng)域。然而,它也存在問(wèn)題,它不能有效地處理傾斜的數(shù)據(jù)。當(dāng)MapReduce處理的數(shù)據(jù)分布不均勻時(shí),會(huì)造成有些任務(wù)比其他任務(wù)運(yùn)行較慢的情況,而整個(gè)作業(yè)的執(zhí)行時(shí)間是由最慢的那個(gè)任務(wù)決定的,因此增加了整個(gè)作業(yè)的完成時(shí)間,使系統(tǒng)性能下降。本文對(duì)MapReduce中的傾斜問(wèn)題進(jìn)行了研究,提出了一種處理方法。 本文的出發(fā)點(diǎn)是考慮當(dāng)傾斜的數(shù)據(jù)存在時(shí),如何高效地將MapReduce中Map階段產(chǎn)生的中間結(jié)果劃分給Reduce,使所有Reduce能夠達(dá)到負(fù)載平衡。主要工作為:(1)統(tǒng)計(jì)輸入文件中所有key的頻次分布,由于統(tǒng)計(jì)所有數(shù)據(jù)的開(kāi)銷較大,所以本文采用抽樣技術(shù),估算keys的出現(xiàn)次數(shù)。將統(tǒng)計(jì)key頻次分布這一操作用一個(gè)單獨(dú)MapReduce作業(yè)來(lái)完成。并且,文中給出抽樣的理論分析,證明抽取出的樣本能夠代替源輸入文件進(jìn)行key的頻次估計(jì)。(2)根據(jù)統(tǒng)計(jì)出來(lái)的所有key的頻次分布結(jié)果,提出兩種劃分方法:Cluster組合和Cluster分割,前者在數(shù)據(jù)傾斜度不大的時(shí)候較有效,后者在數(shù)據(jù)傾斜度較大的時(shí)候較有效。(3)實(shí)驗(yàn)證明使用抽樣技術(shù)處理小部分?jǐn)?shù)據(jù)能夠較快地估計(jì)出key的頻次分布,兩種劃分方法可以獲得較快的執(zhí)行時(shí)間,使Reduce得到很好的負(fù)載平衡。
[Abstract]:With the rapid development of the Internet, the information is increasing explosively. Every day, huge amounts of data are produced. It is a great challenge to store and analyze the massive data. In recent years, cloud computing, a new computing model, has attracted much attention since its birth. The major IT giants have put forward their own cloud computing platform and cloud computing services as the primary development strategy, and have made remarkable achievements. As a parallel processing model of large-scale data, MapReduce is widely used in cloud computing environment. It is widely used in many fields because of its simplicity, high scalability and fault tolerance. However, it also has problems, it can not effectively handle skewed data. When the distribution of data processed by MapReduce is uneven, some tasks run slower than others, and the execution time of the entire job is determined by the slowest task, thus increasing the completion time of the entire job. Make the system performance degrade. In this paper, the tilting problem in MapReduce is studied, and a method is proposed to deal with it. The starting point of this paper is to consider how to efficiently divide the intermediate results from the Map phase in MapReduce to Reduce, so that all Reduce can achieve load balance when skewed data exist. The main work is as follows: (1) the frequency distribution of all key in the statistical input file. Because of the high cost of statistics all data, this paper uses sampling technique to estimate the frequency of keys. The operation of statistical key frequency distribution is done with a single MapReduce job. Moreover, the theoretical analysis of sampling is given, and it is proved that the extracted sample can estimate the frequency of key instead of the source input file. (2) according to the frequency distribution results of all key, two partition methods: cluster combination and Cluster partition are proposed. The former is more effective when the data inclination is small, and the latter is more effective when the data inclination is high. (3) the experiment proves that the frequency distribution of key can be estimated quickly by using sampling technique to process a small part of data. The two partitioning methods can obtain faster execution time and make Reduce load balance very good.
【學(xué)位授予單位】:大連海事大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP338.6
本文編號(hào):2256606
[Abstract]:With the rapid development of the Internet, the information is increasing explosively. Every day, huge amounts of data are produced. It is a great challenge to store and analyze the massive data. In recent years, cloud computing, a new computing model, has attracted much attention since its birth. The major IT giants have put forward their own cloud computing platform and cloud computing services as the primary development strategy, and have made remarkable achievements. As a parallel processing model of large-scale data, MapReduce is widely used in cloud computing environment. It is widely used in many fields because of its simplicity, high scalability and fault tolerance. However, it also has problems, it can not effectively handle skewed data. When the distribution of data processed by MapReduce is uneven, some tasks run slower than others, and the execution time of the entire job is determined by the slowest task, thus increasing the completion time of the entire job. Make the system performance degrade. In this paper, the tilting problem in MapReduce is studied, and a method is proposed to deal with it. The starting point of this paper is to consider how to efficiently divide the intermediate results from the Map phase in MapReduce to Reduce, so that all Reduce can achieve load balance when skewed data exist. The main work is as follows: (1) the frequency distribution of all key in the statistical input file. Because of the high cost of statistics all data, this paper uses sampling technique to estimate the frequency of keys. The operation of statistical key frequency distribution is done with a single MapReduce job. Moreover, the theoretical analysis of sampling is given, and it is proved that the extracted sample can estimate the frequency of key instead of the source input file. (2) according to the frequency distribution results of all key, two partition methods: cluster combination and Cluster partition are proposed. The former is more effective when the data inclination is small, and the latter is more effective when the data inclination is high. (3) the experiment proves that the frequency distribution of key can be estimated quickly by using sampling technique to process a small part of data. The two partitioning methods can obtain faster execution time and make Reduce load balance very good.
【學(xué)位授予單位】:大連海事大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP338.6
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 張季,周立柱,蔣旭東,馮建華;基于抽樣的Cube占用空間預(yù)測(cè)算法[J];計(jì)算機(jī)工程與應(yīng)用;2001年24期
2 唐川;;淺談云計(jì)算的概念問(wèn)題[J];科技情報(bào)開(kāi)發(fā)與經(jīng)濟(jì);2010年10期
3 陳康;鄭緯民;;云計(jì)算:系統(tǒng)實(shí)例與研究現(xiàn)狀[J];軟件學(xué)報(bào);2009年05期
4 趙春宇;孟令奎;林志勇;;一種面向并行空間數(shù)據(jù)庫(kù)的數(shù)據(jù)劃分算法研究[J];武漢大學(xué)學(xué)報(bào)(信息科學(xué)版);2006年11期
相關(guān)碩士學(xué)位論文 前1條
1 劉彪;空間數(shù)據(jù)庫(kù)中基于MapReduce的kNN算法研究[D];大連海事大學(xué);2012年
,本文編號(hào):2256606
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2256606.html
最近更新
教材專著