Hadoop平臺下海量日志數(shù)據(jù)處理模型的研究及改進
本文選題:Hadoop + 分層作業(yè)調(diào)度; 參考:《浙江理工大學(xué)》2013年碩士論文
【摘要】:隨著計算機技術(shù)以及互聯(lián)網(wǎng)高速地運用到人類社會生產(chǎn)生活的各個方面,數(shù)據(jù)量呈現(xiàn)出爆發(fā)性的增長。為滿足海量數(shù)據(jù)應(yīng)用的處理要求,基于大規(guī)模計算機集群的并行計算成為了主要途徑,而MapReduce就是一個最初由谷歌設(shè)計用來在大型集群上執(zhí)行并行計算的框架。它能夠減少開發(fā)人員在進行并發(fā)編程時的復(fù)雜性,使得開發(fā)人員在不了解分布式底層細節(jié)的情況下開發(fā)分布式程序。 Hadoop是一個實現(xiàn)MapReduce的開放源代碼的集群平臺。目前,Hadoop在很多互聯(lián)網(wǎng)公司里都已經(jīng)得到了應(yīng)用,可以說是應(yīng)用最為廣泛的開源云計算軟件平臺。但是,Hadoop還是一個發(fā)展時間較短的平臺,在許多地方還需要提高和改進。本文主要研究工作和貢獻如下: 1)本文對Hadoop平臺的架構(gòu)及其核心技術(shù)進行了深入的研究;闡述了Hadoop平臺下現(xiàn)有的調(diào)度算法FIFO、計算能力調(diào)度算法以及公平調(diào)度算法的設(shè)計思路、實現(xiàn)過程以及算法優(yōu)缺點。針對FIFO調(diào)度策略單一、容易造成大作業(yè)長時間等待、集群CPU利用率低的問題,,提出了基于紅黑樹的分層調(diào)度算法(HSBRB),并將其引入Hadoop平臺。 2) HSBRB調(diào)度算法引入了紅黑樹作為存儲作業(yè)信息的數(shù)據(jù)結(jié)構(gòu)。紅黑樹是一種效率非常高的不完全平衡二叉樹,隨著結(jié)點個數(shù)的增加,紅黑樹會獲得高速的數(shù)據(jù)插入、刪除速度,從而提高整個集群的CPU利用率。同時,HSBRB調(diào)度算法采用了層次調(diào)度模型來調(diào)度作業(yè)。當多用戶共享集群平臺時,每個用戶對應(yīng)一個池,每個池里存放多個作業(yè),從而解決了FIFO只針對單用戶提交作業(yè)的不足導(dǎo)致的集群資源利用率低的問題。 3)海量日志數(shù)據(jù)的處理。本文的海量日志數(shù)據(jù)均來自于NBER的專利數(shù)據(jù)集。為獲得不同引用頻率的專利數(shù)目,搭建了一個小型的Hadoop集群平臺,并在該平臺上開發(fā)分布式并行程序,結(jié)果保存到指定的目錄文件中。 4)為驗證HSBRB算法的性能,本文設(shè)計了兩個不同的實驗場景對Hadoop現(xiàn)有的調(diào)度算法FIFO、Fair Scheduler以及本課題的HSBRB算法進行了實驗對比。實驗結(jié)果驗證了HSBRB算法的合理性以及有效性,而且相對于現(xiàn)有的調(diào)度算法,HSBRB算法能夠更好地減少作業(yè)運行時間、提高CPU的利用率,是一種較為理想的任務(wù)調(diào)度算法。 最后我們對論文工作進行了總結(jié),并討論了對進一步工作的展望。
[Abstract]:With the rapid application of computer technology and Internet to all aspects of human society, the amount of data is increasing explosively. In order to meet the requirements of mass data applications, parallel computing based on large scale computer clusters has become the main approach, and MapReduce is a framework originally designed by Google to perform parallel computing on large clusters. It can reduce the complexity of concurrent programming and enable developers to develop distributed programs without understanding the underlying details of distributed programming. Hadoop is a cluster platform that implements MapReduce's open source code. At present Hadoop has been used in many Internet companies, it can be said to be the most widely used open source cloud computing software platform. But Hadoop is also a relatively short development time platform, in many places still need to be improved and improved. The main research work and contributions of this paper are as follows: 1) in this paper, the architecture and core technology of Hadoop platform are deeply studied, and the design ideas, implementation process, advantages and disadvantages of the existing scheduling algorithms, such as FIFO, computing power scheduling algorithm and fair scheduling algorithm under Hadoop platform are described. Aiming at the problem of single scheduling strategy of FIFO, which is easy to cause long time waiting of large jobs and low utilization of cluster CPU, a hierarchical scheduling algorithm based on red-black tree is proposed and introduced into Hadoop platform. 2) HSBRB scheduling algorithm introduces red-black tree as the data structure to store job information. The red-black tree is a highly efficient binary tree with incomplete balance. With the increase of the number of nodes, the red-black tree will obtain high-speed data insertion, delete speed, and thus improve the CPU utilization of the whole cluster. At the same time, HSBRB scheduling algorithm adopts hierarchical scheduling model to schedule jobs. When multi-users share a cluster platform, each user has a pool, each pool holds more than one job, thus solving the problem of low utilization of cluster resources caused by the shortage of FIFO only for single user to submit jobs. 3) processing of massive log data. The massive log data in this paper come from the patent data set of NBER. In order to obtain the number of patents with different reference frequencies, a small Hadoop cluster platform is built and distributed parallel programs are developed on the platform. The results are saved to a specified directory file. 4) in order to verify the performance of HSBRB algorithm, two different experimental scenarios are designed to compare the existing Hadoop scheduling algorithm, FIFO Fair Scheduler, and the HSBRB algorithm in this paper. The experimental results verify the rationality and validity of the HSBRB algorithm, and it is a more ideal task scheduling algorithm than the existing scheduling algorithm, which can reduce the running time of jobs and improve the utilization of CPU. Finally, we summarize the work of the paper and discuss the prospects for further work.
【學(xué)位授予單位】:浙江理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP301.6;TP338.6
【參考文獻】
相關(guān)期刊論文 前1條
1 高慶;姜凡;;紅黑樹算法及其應(yīng)用[J];軟件導(dǎo)刊;2008年09期
相關(guān)碩士學(xué)位論文 前10條
1 吳貴鑫;云計算中的MapReduce并行編程模式研究[D];河南理工大學(xué);2010年
2 施巖;云計算研究及Hadoop應(yīng)用程序的開發(fā)與測試[D];北京郵電大學(xué);2011年
3 任萱萱;基于Hadoop平臺的作業(yè)調(diào)度研究[D];天津師范大學(xué);2011年
4 鄧光明;分布式工作流引擎的研究和設(shè)計[D];太原科技大學(xué);2011年
5 陳艷金;MapReduce模型在Hadoop平臺下實現(xiàn)作業(yè)調(diào)度算法的研究和改進[D];華南理工大學(xué);2011年
6 吳昊;基于HDFS的分布式文件系統(tǒng)數(shù)據(jù)冗余技術(shù)研究[D];西安電子科技大學(xué);2011年
7 余正祥;基于hadoop平臺作業(yè)調(diào)度算法的研究[D];云南大學(xué);2011年
8 張敏;云計算環(huán)境下的并行數(shù)據(jù)挖掘策略研究[D];南京郵電大學(xué);2011年
9 楊宸鑄;基于HADOOP的數(shù)據(jù)挖掘研究[D];重慶大學(xué);2010年
10 王凱;MapReduce集群多用戶作業(yè)調(diào)度方法的研究與實現(xiàn)[D];國防科學(xué)技術(shù)大學(xué);2010年
本文編號:1906691
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1906691.html