面向性能調(diào)優(yōu)的MapReduce集群模擬器的研究與設(shè)計(jì)
發(fā)布時(shí)間:2018-05-24 04:43
本文選題:云計(jì)算 + MapReduce。 參考:《杭州電子科技大學(xué)》2013年碩士論文
【摘要】:當(dāng)前各種互聯(lián)網(wǎng)應(yīng)用都面臨著海量數(shù)據(jù)的存儲(chǔ)和處理問題,飛速增長(zhǎng)數(shù)據(jù)對(duì)數(shù)據(jù)處理系統(tǒng)的可擴(kuò)展性提出了巨大的挑戰(zhàn)。以MapReduce為典型的云技術(shù)的興起,為海量數(shù)據(jù)的處理提供了一套可行的解決方案。作為MapReduce框架的開源實(shí)現(xiàn),Hadoop也越來(lái)越受到各企業(yè)的青睞,一方面它提供了HDFS,為海量數(shù)據(jù)的存儲(chǔ)提供可靠、高可擴(kuò)展的存儲(chǔ)平臺(tái),另一方面,它實(shí)現(xiàn)了MapReduce框架,簡(jiǎn)化了并行應(yīng)用程序的設(shè)計(jì)難度,為大規(guī)模并行數(shù)據(jù)處理提供了簡(jiǎn)單易用的編程框架。 然而,隨著Hadoop集群規(guī)模的不斷擴(kuò)大,許多基于Hadoop平臺(tái)的benchmark的測(cè)試不能反映生產(chǎn)集群的真實(shí)負(fù)載特征。同時(shí)搭建一個(gè)同等規(guī)模的測(cè)試集群,需要一筆昂貴的開銷。同時(shí),作為Hadoop平臺(tái)性能調(diào)優(yōu)的一個(gè)重要方面,調(diào)度器性能一直都是人們重點(diǎn)關(guān)注的問題。而且隨著集群用戶和作業(yè)的不斷增加,用戶對(duì)作業(yè)的響應(yīng)性能也有不同的要求,共享集群中的作業(yè)調(diào)度問題日漸突出。許多現(xiàn)存的調(diào)度器,如公平調(diào)度器,計(jì)算能力調(diào)度器,HOD等在面對(duì)這些問題特別是面對(duì)作業(yè)類型多樣化問題時(shí),都顯得有些無(wú)能為力。本文在分析Hadoop平臺(tái)原理和技術(shù)的基礎(chǔ)上進(jìn)行以下兩個(gè)方面的研究工作: (1)提出一種負(fù)載生成方法,通過分析真實(shí)負(fù)載中的作業(yè)類型,以及還原真實(shí)負(fù)載的作業(yè)提交模型來(lái)模擬集群中的真實(shí)負(fù)載。同時(shí)本文設(shè)計(jì)了一個(gè)MapReduce模擬器,能使用少量節(jié)點(diǎn)模擬出大規(guī)模集群,并對(duì)作業(yè)的運(yùn)行過程進(jìn)行了精確模擬,,從而提供了一個(gè)完整的Hadoop集群性能測(cè)試平臺(tái),幫助解決大規(guī)模集群的測(cè)試問題。經(jīng)過實(shí)驗(yàn)驗(yàn)證,負(fù)載生成方法可以精確生成反映真實(shí)負(fù)載的模擬負(fù)載,模擬器可以通過少量節(jié)點(diǎn)模擬出大規(guī)模集群,并提供較為精確的作業(yè)運(yùn)行模擬。 (2)針對(duì)作業(yè)多樣化問題提出了基于靜態(tài)優(yōu)先級(jí)的搶占調(diào)度算法(SPPSA,Static Priority based Preemptive Scheduling Algorithm),該調(diào)度算法將調(diào)度問題分解為作業(yè)池調(diào)度,作業(yè)優(yōu)先級(jí)調(diào)度,任務(wù)調(diào)度等三個(gè)問題,從而提供了作業(yè)池級(jí)別的公平性和資源控制、作業(yè)響應(yīng)性保證,以及數(shù)據(jù)本地性保證等功能,經(jīng)過實(shí)驗(yàn)驗(yàn)證,SPPSA可以解決大規(guī)模共享集群下用戶對(duì)作業(yè)的不同響應(yīng)性要求,同時(shí)搶占所帶來(lái)的影響也在可接受范圍之內(nèi)。
[Abstract]:At present, all kinds of Internet applications are facing the problem of storage and processing of massive data. The rapid growth of data poses great challenges to the scalability of data processing systems. The rise of MapReduce as a typical cloud technology provides a feasible solution for the processing of massive data. As an open source implementation of the MapReduce framework, Had OOP is also becoming more and more popular in various enterprises. On the one hand, it provides HDFS to provide reliable and scalable storage platform for mass data storage. On the other hand, it implements the MapReduce framework, simplifies the difficulty of designing parallel applications and provides a simple and easy programming framework for large-scale parallel data processing.
However, with the expansion of the Hadoop cluster scale, many benchmark based testing based on Hadoop platform can not reflect the real load characteristics of the production cluster. At the same time, it takes an expensive cost to build an equal scale test cluster. At the same time, as an important aspect of performance tuning of the Hadoop platform, the performance of the scheduler has been all the time. With the increasing number of users and jobs in the cluster, the response performance of the user to the job is also different. The problem of job scheduling in the shared cluster is becoming more and more prominent. Many existing schedulers, such as the fair scheduler, the computing power scheduler, HOD and so on, are facing these problems, especially the job types. There are some ineffective ways to solve the problem of diversification. Based on the analysis of the principles and techniques of Hadoop platform, the following two aspects are studied:
(1) a load generation method is proposed to simulate the real load in the cluster by analyzing the job types in the real load and the job submission model of the real load. At the same time, this paper designs a MapReduce simulator, which can simulate the large-scale cluster with a small number of nodes, and simulate the operation process accurately. Thus, a complete Hadoop cluster performance testing platform is provided to help solve the test problem of large-scale cluster. It is verified by experiments that the load generation method can accurately generate the simulated load reflecting the real load. The simulator can simulate a large number of clusters through a small number of nodes and provide more accurate operation simulation.
(2) a preemptive scheduling algorithm based on static priority (SPPSA, Static Priority based Preemptive Scheduling Algorithm) is proposed for the problem of job diversification. The scheduling algorithm decomposes the scheduling problem into three problems, such as job pool scheduling, job priority scheduling, task scheduling and so on, thus providing the fairness and resources of the job pool level. Control, job responsiveness guarantee, and data locality assurance functions. Through experimental verification, SPPSA can solve the different responsiveness requirements of users to jobs in large-scale shared clusters, and the impact of preemption is also within acceptable range.
【學(xué)位授予單位】:杭州電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP338
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 于劍,程乾生;模糊聚類方法中的最佳聚類數(shù)的搜索范圍[J];中國(guó)科學(xué)E輯:技術(shù)科學(xué);2002年02期
2 高新波,謝維信;模糊聚類理論發(fā)展及應(yīng)用的研究進(jìn)展[J];科學(xué)通報(bào);1999年21期
本文編號(hào):1927725
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1927725.html
最近更新
教材專著