分布式數(shù)據(jù)處理系統(tǒng)中作業(yè)性能優(yōu)化關(guān)鍵技術(shù)研究
發(fā)布時(shí)間:2019-05-29 22:24
【摘要】:隨著各行業(yè)中數(shù)據(jù)規(guī)模地增長,分布式數(shù)據(jù)處理技術(shù)被廣泛應(yīng)用于各行業(yè)數(shù)據(jù)分析中。Map Reduce具有使用方便、易于編程、容錯(cuò)性強(qiáng)和高性價(jià)比等優(yōu)勢,目前已經(jīng)成為主流的分布式處理模型,并在各行業(yè)的大規(guī)模數(shù)據(jù)分析中得到了廣泛的應(yīng)用。然而隨著數(shù)據(jù)處理需求的不斷增長,MapReduce自身存在的一些缺陷也逐漸顯露,最常見的缺陷包括:MapReduce中較多的配置參數(shù)、不完善的任務(wù)調(diào)度策略、數(shù)據(jù)本地化有效性低和資源槽分配不合理等。這些不足導(dǎo)致MapReduce作業(yè)執(zhí)行效率低下。MapReduce作業(yè)性能調(diào)優(yōu)是通過優(yōu)化MapReduce中各方面的不足來改善MapReduce作業(yè)性能,使得作業(yè)在MapReduce中的執(zhí)行時(shí)間大大降低,因此,MapReduce作業(yè)性能優(yōu)化的研究具有重要的科學(xué)意義和應(yīng)用價(jià)值。本文針對(duì)MapReduce作業(yè)性能優(yōu)化的若干關(guān)鍵問題進(jìn)行研究。在總結(jié)作業(yè)性能優(yōu)化相關(guān)研究成果的基礎(chǔ)上,建立了I/O代價(jià)函數(shù)來闡述配置參數(shù)對(duì)作業(yè)執(zhí)行時(shí)間的重要性;提出了通過特征選擇的方法來選擇影響作業(yè)執(zhí)行時(shí)間的重要參數(shù),同時(shí)通過優(yōu)化數(shù)據(jù)本地化、數(shù)據(jù)副本置放策略和任務(wù)調(diào)度的方法來改善作業(yè)執(zhí)行時(shí)間。本文的研究內(nèi)容和創(chuàng)新點(diǎn)具體包含以下幾個(gè)方面:(1)通過建立I/O讀寫字節(jié)數(shù)和I/O請(qǐng)求數(shù)目函數(shù)證明了存在部分配置參數(shù)會(huì)直接影響MapReduce作業(yè)的執(zhí)行時(shí)間。并驗(yàn)證了各配置參數(shù)對(duì)MapReduce作業(yè)執(zhí)行時(shí)間的影響程度不同。(2)提出了基于核函數(shù)懲罰的聚類特征選擇算法(IK-means),解決了MapReduce中因配置參數(shù)過多而使得平臺(tái)管理人員配置困難的問題。在IK-means中,為了準(zhǔn)確地判斷各特征參數(shù)的影響程度,采用各向異性高斯核函數(shù)代替了傳統(tǒng)的高斯核函數(shù),通過各向異性高斯核函數(shù)不同方向的參數(shù)(也被稱為核寬)來反映每個(gè)特征的重要程度。提出利用梯度下降算法來最小化各向異性高斯核的核寬向量,使得所選擇的特征進(jìn)行聚類的效果能最接近原始特征的聚類效果,從而達(dá)到特征選擇的目的。針對(duì)聚類特征選擇算法對(duì)初始點(diǎn)選擇敏感的問題,提出了全局感知的局部密度初始點(diǎn)選擇算法。通過理論證明和實(shí)驗(yàn)結(jié)果表明,本文提出的特征選擇算法在配置參數(shù)的選擇中具有良好的效果。(3)提出了基于二部圖最小權(quán)匹配的數(shù)據(jù)本地化算法,解決了MapReduce中同時(shí)滿足多任務(wù)數(shù)據(jù)本地化的問題,同時(shí)提出了動(dòng)態(tài)副本自適應(yīng)算法,通過熱數(shù)據(jù)的識(shí)別解決了動(dòng)態(tài)副本置放技術(shù)中的如何確定備份副本的問題。理論論證和實(shí)驗(yàn)結(jié)果表明動(dòng)態(tài)自適應(yīng)副本算法有效地支撐了二部圖最小權(quán)匹配算法,并改善了多任務(wù)數(shù)據(jù)本地化的有效性。(4)提出了滿足用戶時(shí)間需求和資源優(yōu)化的任務(wù)調(diào)度算法,利用歷史作業(yè)描述文件中的時(shí)間和資源消耗信息對(duì)新作業(yè)執(zhí)行時(shí)間和槽資源的消耗進(jìn)行計(jì)算,不僅達(dá)到了滿足用戶時(shí)間需求的目的,還解決了MapReduce作業(yè)運(yùn)行過程中資源消耗過高的問題。算法的有效性不僅從作業(yè)執(zhí)行過程的理論分析得到了驗(yàn)證,且實(shí)驗(yàn)結(jié)果也驗(yàn)證了該算法的在作業(yè)執(zhí)行時(shí)間和槽資源消耗的優(yōu)勢1。
[Abstract]:With the growth of data in all industries, distributed data processing technology is widely used in data analysis of all industries. Map Reduce has the advantages of convenient use, easy programming, high fault tolerance and high cost performance. It has become the mainstream distributed processing model, and has been widely used in the large-scale data analysis of all industries. However, with the increasing demand of data processing, some of the defects of MapReduce themselves are gradually revealed. The most common defects include the more configuration parameters in MapReduce, the incomplete task scheduling strategy, the low localization efficiency of the data, and the unreasonable allocation of the resource slots. These deficiencies have led to an inefficient implementation of the MapReduce job. MapReduce job performance optimization is to improve the MapReduce job performance by optimizing the lack of various aspects in MapReduce, so that the execution time of the operation in MapReduce is greatly reduced, and therefore, the research of MapReduce job performance optimization has important scientific significance and application value. In this paper, some key problems of the performance optimization of MapReduce are studied. Based on the research results of the optimization of operation performance, I/ O cost function is set up to illustrate the importance of the configuration parameters on the execution time of the job. The method of feature selection is proposed to select the important parameters that affect the execution time of the job, while the data localization is optimized. A method of data copy placement policy and task scheduling is used to improve job execution time. The research content and innovation point of this paper specifically include the following aspects: (1) The execution time of MapReduce job can be directly affected by setting up I/ O read-write number and I/ O request number function. And the influence degree of each configuration parameter on the execution time of the MapReduce job is verified to be different. (2) The clustering feature selection algorithm (IK-means) based on kernel function penalty is proposed, which solves the problem that the configuration parameters of the platform management are difficult due to the too much configuration parameters in MapReduce. In the IK-means, in order to accurately determine the degree of influence of each characteristic parameter, an anisotropic Gaussian kernel function is used instead of the traditional Gaussian kernel function, and the important degree of each feature is reflected by the parameters of different directions of the anisotropic Gaussian kernel function (also referred to as the kernel width). A gradient descent algorithm is proposed to minimize the kernel width vector of the anisotropic Gaussian kernel, so that the effect of clustering the selected features can be the closest to the clustering effect of the original feature, so as to achieve the purpose of feature selection. In this paper, the initial point selection algorithm is proposed for the initial point selection by the clustering feature selection algorithm. The theoretical and experimental results show that the feature selection algorithm proposed in this paper has a good effect in the selection of configuration parameters. and (3) a data localization algorithm based on a two-part graph minimum weight matching is proposed, the problem that the multi-task data localization is met in the MapReduce is solved, and a dynamic copy adaptive algorithm is proposed, The problem of how to determine the backup copy is solved by the identification of the hot data. The theoretical and experimental results show that the dynamic adaptive replica algorithm effectively supports the least-weight matching algorithm of the two-part graph, and improves the validity of the multi-task data localization. (4) the task scheduling algorithm for meeting the time requirement and resource optimization of the user is proposed, the time and the resource consumption information in the historical operation description file are used to calculate the execution time of the new operation and the consumption of the slot resources, And the problem of high resource consumption in the operation process of the MapReduce operation is also solved. The validity of the algorithm is not only verified from the theoretical analysis of the job execution process, but also the experimental results also verify the advantage of the algorithm in the operation execution time and the slot resource consumption.
【學(xué)位授予單位】:重慶大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
,
本文編號(hào):2488270
[Abstract]:With the growth of data in all industries, distributed data processing technology is widely used in data analysis of all industries. Map Reduce has the advantages of convenient use, easy programming, high fault tolerance and high cost performance. It has become the mainstream distributed processing model, and has been widely used in the large-scale data analysis of all industries. However, with the increasing demand of data processing, some of the defects of MapReduce themselves are gradually revealed. The most common defects include the more configuration parameters in MapReduce, the incomplete task scheduling strategy, the low localization efficiency of the data, and the unreasonable allocation of the resource slots. These deficiencies have led to an inefficient implementation of the MapReduce job. MapReduce job performance optimization is to improve the MapReduce job performance by optimizing the lack of various aspects in MapReduce, so that the execution time of the operation in MapReduce is greatly reduced, and therefore, the research of MapReduce job performance optimization has important scientific significance and application value. In this paper, some key problems of the performance optimization of MapReduce are studied. Based on the research results of the optimization of operation performance, I/ O cost function is set up to illustrate the importance of the configuration parameters on the execution time of the job. The method of feature selection is proposed to select the important parameters that affect the execution time of the job, while the data localization is optimized. A method of data copy placement policy and task scheduling is used to improve job execution time. The research content and innovation point of this paper specifically include the following aspects: (1) The execution time of MapReduce job can be directly affected by setting up I/ O read-write number and I/ O request number function. And the influence degree of each configuration parameter on the execution time of the MapReduce job is verified to be different. (2) The clustering feature selection algorithm (IK-means) based on kernel function penalty is proposed, which solves the problem that the configuration parameters of the platform management are difficult due to the too much configuration parameters in MapReduce. In the IK-means, in order to accurately determine the degree of influence of each characteristic parameter, an anisotropic Gaussian kernel function is used instead of the traditional Gaussian kernel function, and the important degree of each feature is reflected by the parameters of different directions of the anisotropic Gaussian kernel function (also referred to as the kernel width). A gradient descent algorithm is proposed to minimize the kernel width vector of the anisotropic Gaussian kernel, so that the effect of clustering the selected features can be the closest to the clustering effect of the original feature, so as to achieve the purpose of feature selection. In this paper, the initial point selection algorithm is proposed for the initial point selection by the clustering feature selection algorithm. The theoretical and experimental results show that the feature selection algorithm proposed in this paper has a good effect in the selection of configuration parameters. and (3) a data localization algorithm based on a two-part graph minimum weight matching is proposed, the problem that the multi-task data localization is met in the MapReduce is solved, and a dynamic copy adaptive algorithm is proposed, The problem of how to determine the backup copy is solved by the identification of the hot data. The theoretical and experimental results show that the dynamic adaptive replica algorithm effectively supports the least-weight matching algorithm of the two-part graph, and improves the validity of the multi-task data localization. (4) the task scheduling algorithm for meeting the time requirement and resource optimization of the user is proposed, the time and the resource consumption information in the historical operation description file are used to calculate the execution time of the new operation and the consumption of the slot resources, And the problem of high resource consumption in the operation process of the MapReduce operation is also solved. The validity of the algorithm is not only verified from the theoretical analysis of the job execution process, but also the experimental results also verify the advantage of the algorithm in the operation execution time and the slot resource consumption.
【學(xué)位授予單位】:重慶大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
,
本文編號(hào):2488270
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/2488270.html
最近更新
教材專著