基于任務(wù)結(jié)構(gòu)優(yōu)化的Spark緩存策略研究
發(fā)布時(shí)間:2018-08-30 16:58
【摘要】:大數(shù)據(jù)計(jì)算框架Spark運(yùn)用內(nèi)存空間極大提升了任務(wù)的執(zhí)行效率,但由于內(nèi)存空間的局限性,Spark任務(wù)常常因?yàn)閮?nèi)存瓶頸導(dǎo)致執(zhí)行效率低下,甚至任務(wù)失敗,這與框架本身的缺陷和RDD(Resilient Distributed Datasets)的緩存策略密切相關(guān)。Spark自誕生至今,一直采用LRU(Least Recently Used)作為緩存替換算法,但由于Spark的緩存調(diào)度器無法準(zhǔn)確預(yù)測(cè)整個(gè)任務(wù)數(shù)據(jù)的使用情況,導(dǎo)致部分情況下LRU算法效果欠佳。為了減小任務(wù)執(zhí)行時(shí)間,提升內(nèi)存利用率,通過解析Spark的任務(wù)結(jié)構(gòu),對(duì)其進(jìn)行一定的優(yōu)化,并獲取整個(gè)任務(wù)過程中數(shù)據(jù)和內(nèi)存的使用情況,通過分析結(jié)果優(yōu)化現(xiàn)有的緩存策略,這是本文研究的重點(diǎn)。本文首先對(duì)Spark現(xiàn)有的緩存機(jī)制進(jìn)行分析,比較不同緩存方式對(duì)于任務(wù)性能的影響,通過實(shí)際例子證明現(xiàn)有的緩存策略還有較大的優(yōu)化空間。接著提出了任務(wù)結(jié)構(gòu)分析和任務(wù)結(jié)構(gòu)優(yōu)化的方法,對(duì)于任務(wù)結(jié)構(gòu)分析,通過動(dòng)態(tài)分析的方法提取出Spark任務(wù)的關(guān)鍵信息,根據(jù)RDD之間的依賴關(guān)系解析出整個(gè)任務(wù)的依賴關(guān)系圖,同時(shí)解析出任務(wù)運(yùn)行過程中數(shù)據(jù)和內(nèi)存的使用情況;對(duì)于任務(wù)結(jié)構(gòu)優(yōu)化,在獲取了Spark的任務(wù)信息后,通過調(diào)整Stage的位置使得任務(wù)計(jì)算過程中同一RDD的使用更加集中,減少了內(nèi)存替換的概率,提高了整個(gè)任務(wù)的執(zhí)行效率。在分析和優(yōu)化任務(wù)結(jié)構(gòu)的基礎(chǔ)上,提出了RDD權(quán)重的概念,綜合多種影響RDD使用情況的因素,包括使用次數(shù)、大小、跨度、分區(qū)與核數(shù)比例、計(jì)算代價(jià)等,建立了合理的RDD權(quán)重模型;赗DD權(quán)重模型,本文提出了一種新的緩存替換策略,RWR(RDD Weight Replace)緩存替換策略,確保內(nèi)存替換過程中相對(duì)更有價(jià)值的數(shù)據(jù)能夠緩存至內(nèi)存中,用于提高緩存命中率和內(nèi)存利用率,減少因?yàn)閮?nèi)存瓶頸造成的計(jì)算錯(cuò)誤,在一定程度上提高了Spark框架的容錯(cuò)性能。最后通過對(duì)比實(shí)驗(yàn),結(jié)合多種負(fù)載用例,通過運(yùn)行單個(gè)任務(wù)、調(diào)整集群配置、混合多種任務(wù)等方式,對(duì)默認(rèn)未修改的Spark和優(yōu)化后的Spark進(jìn)行實(shí)驗(yàn)對(duì)比,實(shí)驗(yàn)結(jié)果表明,本文提出的任務(wù)結(jié)構(gòu)優(yōu)化策略和緩存替換策略能夠有效提高任務(wù)執(zhí)行效率。
[Abstract]:Big data's computational framework, Spark, greatly improves the efficiency of task execution by using memory space. However, due to the limitation of memory space, Spark tasks are often inefficient or even fail due to memory bottlenecks. This is closely related to the defects of the framework itself and the cache policy of RDD (Resilient Distributed Datasets). Since Spark was born, LRU (Least Recently Used) has been used as the cache replacement algorithm. However, the cache scheduler of Spark can not accurately predict the use of the whole task data. In some cases, the effect of LRU algorithm is not good. In order to reduce task execution time and improve memory utilization, the task structure of Spark is optimized by analyzing the task structure, and the data and memory usage during the whole task are obtained, and the existing cache strategy is optimized by analyzing the results. This is the focus of this paper. This paper first analyzes the existing caching mechanism of Spark and compares the effects of different caching methods on task performance. It is proved that the existing cache policy still has a large space for optimization through practical examples. Then, a method of task structure analysis and task structure optimization is proposed. For task structure analysis, the key information of Spark task is extracted by dynamic analysis, and the dependency graph of the whole task is analyzed according to the dependency relationship between RDD. At the same time, the usage of data and memory in the process of task operation is analyzed. For task structure optimization, after obtaining the task information of Spark, the use of the same RDD in the process of task calculation is more centralized by adjusting the position of Stage. It reduces the probability of memory replacement and improves the efficiency of the whole task. Based on the analysis and optimization of the task structure, the concept of RDD weight is put forward, which synthesizes many factors affecting the use of RDD, including the number of times of use, size, span, partition to kernel ratio, calculation cost, etc. A reasonable RDD weight model is established. Based on the RDD weight model, this paper proposes a new cache replacement strategy named RWR (RDD Weight Replace) cache replacement strategy, which ensures that the more valuable data can be cached into memory, which can be used to improve cache hit rate and memory utilization. The error caused by memory bottleneck is reduced, and the fault-tolerant performance of Spark framework is improved to some extent. Finally, through the contrast experiment, combined with various load use cases, by running a single task, adjusting the cluster configuration, mixing a variety of tasks, the default unmodified Spark is compared with the optimized Spark. The experimental results show that, The task structure optimization strategy and cache replacement strategy proposed in this paper can effectively improve the efficiency of task execution.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP333
[Abstract]:Big data's computational framework, Spark, greatly improves the efficiency of task execution by using memory space. However, due to the limitation of memory space, Spark tasks are often inefficient or even fail due to memory bottlenecks. This is closely related to the defects of the framework itself and the cache policy of RDD (Resilient Distributed Datasets). Since Spark was born, LRU (Least Recently Used) has been used as the cache replacement algorithm. However, the cache scheduler of Spark can not accurately predict the use of the whole task data. In some cases, the effect of LRU algorithm is not good. In order to reduce task execution time and improve memory utilization, the task structure of Spark is optimized by analyzing the task structure, and the data and memory usage during the whole task are obtained, and the existing cache strategy is optimized by analyzing the results. This is the focus of this paper. This paper first analyzes the existing caching mechanism of Spark and compares the effects of different caching methods on task performance. It is proved that the existing cache policy still has a large space for optimization through practical examples. Then, a method of task structure analysis and task structure optimization is proposed. For task structure analysis, the key information of Spark task is extracted by dynamic analysis, and the dependency graph of the whole task is analyzed according to the dependency relationship between RDD. At the same time, the usage of data and memory in the process of task operation is analyzed. For task structure optimization, after obtaining the task information of Spark, the use of the same RDD in the process of task calculation is more centralized by adjusting the position of Stage. It reduces the probability of memory replacement and improves the efficiency of the whole task. Based on the analysis and optimization of the task structure, the concept of RDD weight is put forward, which synthesizes many factors affecting the use of RDD, including the number of times of use, size, span, partition to kernel ratio, calculation cost, etc. A reasonable RDD weight model is established. Based on the RDD weight model, this paper proposes a new cache replacement strategy named RWR (RDD Weight Replace) cache replacement strategy, which ensures that the more valuable data can be cached into memory, which can be used to improve cache hit rate and memory utilization. The error caused by memory bottleneck is reduced, and the fault-tolerant performance of Spark framework is improved to some extent. Finally, through the contrast experiment, combined with various load use cases, by running a single task, adjusting the cluster configuration, mixing a variety of tasks, the default unmodified Spark is compared with the optimized Spark. The experimental results show that, The task structure optimization strategy and cache replacement strategy proposed in this paper can effectively improve the efficiency of task execution.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 楊志偉;鄭p,
本文編號(hào):2213697
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2213697.html
最近更新
教材專著