基于Hadoop集群的作業(yè)調(diào)度算法研究與改進
本文選題:Hadoop集群 切入點:作業(yè)調(diào)度 出處:《沈陽工業(yè)大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著大數(shù)據(jù)時代的到來,云計算受到了商業(yè)界和各類研究人員的高度關(guān)注。Hadoop是Apache組織開發(fā)的一種開源的云計算平臺。Hadoop平臺主要由兩部分組成,分別是Hadoop的基本HDFS分布式文件系統(tǒng)和Hadoop的核心MapReduce計算框架。MapReduce計算框架作為Hadoop的核心內(nèi)容,主要功能是數(shù)據(jù)處理。而在MapReduce框架中的作業(yè)調(diào)度技術(shù),在系統(tǒng)中起到分配系統(tǒng)資源的關(guān)鍵性作用。但Hadoop自帶的調(diào)度算法都存在著不同的缺點,所以研究調(diào)度算法的缺點并進行有針對性的改進是有必要的。調(diào)度算法的性能是影響系統(tǒng)性能的重要因素,在Hadoop集群環(huán)境下,系統(tǒng)性能的主要指標有數(shù)據(jù)本地性和作業(yè)的平均完成時間。本地性調(diào)度算法的本質(zhì)是提高Hadoop集群的數(shù)據(jù)本地性,減少網(wǎng)絡(luò)傳輸開銷避免阻塞。為提高數(shù)據(jù)本地性,本文提出一種本地性調(diào)度算法,該算法分別定義了Map任務(wù)和Reduce任務(wù)的節(jié)點選取條件。調(diào)度算法對HDFS中分片后的數(shù)據(jù)進行處理,盡可能使數(shù)據(jù)在本地節(jié)點運行。在本地性調(diào)度算法中,Map任務(wù)的完成時間不同,啟動Early Shuffle機制后Reduce任務(wù)存在空閑等待現(xiàn)象,影響作業(yè)的平均完成時間,使得作業(yè)的完成時間增加。針對上述問題,本文提出一種新的調(diào)度策略,它是一種保證數(shù)據(jù)本地性,集成可搶占式的調(diào)度策略。在Reduce任務(wù)等待時掛起該任務(wù)并釋放資源給其他Map任務(wù),當Map任務(wù)完成一定程度后重新調(diào)度Reduce任務(wù),這樣既滿足了算法的數(shù)據(jù)本地性,也降低了作業(yè)的平均完成時間。本文最后描述了在Hadoop集群平臺下實現(xiàn)新的調(diào)度算法,并通過對集成搶占式的本地性調(diào)度策略和非集成搶占式的本地性調(diào)度策略進行比較,觀察性能的變化。通過在集群環(huán)境下的實驗發(fā)現(xiàn),本文提出的算法在各節(jié)點的本地數(shù)據(jù)平均完成度提高了17%,算法集成搶占調(diào)度策略后平均完成時間降低了14.12%,有效優(yōu)化了數(shù)據(jù)本地性性能,降低了網(wǎng)絡(luò)傳輸,且降低了作業(yè)的平均完成時間。
[Abstract]:With the arrival of big data era, cloud computing has been highly concerned by the business community and all kinds of researchers. Hadoop is an open source cloud computing platform. Hadoop platform is mainly composed of two parts. It is the basic HDFS distributed file system of Hadoop and the core MapReduce computing framework of Hadoop. MapReduce computing framework is the core content of Hadoop, whose main function is data processing. It plays a key role in allocating system resources in the system, but the scheduling algorithms that come with Hadoop have different disadvantages. Therefore, it is necessary to study the shortcomings of scheduling algorithm and improve it. The performance of scheduling algorithm is an important factor affecting system performance. The main indicators of system performance are data nativeness and average job completion time. The essence of local scheduling algorithm is to improve the data locality of Hadoop cluster and reduce the network transmission overhead to avoid blocking. In this paper, a local scheduling algorithm is proposed, which defines the node selection conditions of Map task and Reduce task, respectively. The scheduling algorithm processes the segmented data in HDFS. Make the data run in the local node as far as possible. In the local scheduling algorithm, the completion time of the Early task is different. After the Early Shuffle mechanism is started, the Reduce task has the phenomenon of idle waiting, which affects the average completion time of the job. In order to solve the above problems, a new scheduling strategy is proposed, which guarantees the data locality. Integrating preemptive scheduling strategy, suspending the Reduce task while waiting and releasing resources to other Map tasks, rescheduling the Reduce task when the Map task completes to a certain extent, which satisfies the data locality of the algorithm. Finally, this paper describes the implementation of a new scheduling algorithm based on Hadoop cluster platform, and compares the integrated preemptive local scheduling strategy with the non-integrated preemptive local scheduling strategy. Observe changes in performance. Through experiments in a cluster environment, The algorithm proposed in this paper improves the average completion degree of local data by 17% and reduces the average completion time by 14.12 after the algorithm integrates preemptive scheduling strategy, which effectively optimizes the performance of local data and reduces the network transmission. And reduced the average completion time of the work.
【學(xué)位授予單位】:沈陽工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP301.6
【參考文獻】
相關(guān)期刊論文 前7條
1 帥仁俊;沈陽;陳平;潘靜;董亞楠;;基于logistic回歸模型的Hadoop本地任務(wù)調(diào)度優(yōu)化算法[J];計算機應(yīng)用研究;2017年03期
2 盧慧;高弘博;張豐滿;王梅;肖震;;Hadoop云平臺下基于資源預(yù)估的作業(yè)調(diào)度算法[J];計算機應(yīng)用研究;2016年08期
3 燕明磊;;Hadoop集群中作業(yè)調(diào)度研究[J];軟件導(dǎo)刊;2015年04期
4 陶永才;李文潔;石磊;劉磊;衛(wèi)琳;曹仰杰;;基于負載均衡的Hadoop動態(tài)延遲調(diào)度機制[J];小型微型計算機系統(tǒng);2015年03期
5 劉再明;;騰訊云上的開放游戲生態(tài)圈——專訪騰訊云計算公司總裁陳磊[J];互聯(lián)網(wǎng)周刊;2014年16期
6 寧文瑜;吳慶波;譚郁松;;面向MapReduce的自適應(yīng)延遲調(diào)度算法[J];計算機工程與科學(xué);2013年03期
7 王凱;吳泉源;楊樹強;;一種多用戶MapReduce集群的作業(yè)調(diào)度算法的設(shè)計與實現(xiàn)[J];計算機與現(xiàn)代化;2010年10期
相關(guān)碩士學(xué)位論文 前4條
1 陶昌俊;Hadoop平臺的作業(yè)調(diào)度算法研究與改進[D];中國科學(xué)技術(shù)大學(xué);2015年
2 徐淑琦;基于MapReduce的高性能云計算任務(wù)調(diào)度技術(shù)的研究[D];北京工業(yè)大學(xué);2013年
3 何文峰;基于任務(wù)特征與公平策略的Hadoop作業(yè)調(diào)度算法研究[D];華中科技大學(xué);2013年
4 周俊清;基于Hadoop平臺的分布式任務(wù)調(diào)度算法研究[D];湖南大學(xué);2012年
,本文編號:1584725
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1584725.html