基于內(nèi)存的MapReduce系統(tǒng)效率優(yōu)化機(jī)制研究
發(fā)布時(shí)間:2018-05-30 07:50
本文選題:MapReduce + 內(nèi)存計(jì)算。 參考:《華中科技大學(xué)》2016年碩士論文
【摘要】:大數(shù)據(jù)時(shí)代下數(shù)據(jù)的處理與分析已成為一個(gè)十分重要的環(huán)節(jié)。為了滿足數(shù)據(jù)處理高時(shí)效的需求,基于內(nèi)存計(jì)算的大數(shù)據(jù)處理系統(tǒng)成為了新的研究熱點(diǎn),F(xiàn)有高性能計(jì)算集群由于內(nèi)存配置相對CPU配置明顯不足,當(dāng)運(yùn)行在上面的MapReduce系統(tǒng)用來處理數(shù)據(jù)密集性應(yīng)用,容易導(dǎo)致不必要的數(shù)據(jù)溢出到磁盤的I/O操作,內(nèi)存效率急需優(yōu)化。當(dāng)處理大規(guī)模的數(shù)據(jù)集時(shí),分區(qū)數(shù)量過多,基于哈希的Shuffle機(jī)制會導(dǎo)致過多的文件操作和內(nèi)存的不合理使用。但當(dāng)分區(qū)塊過大,任務(wù)消耗的內(nèi)存量變大,容易導(dǎo)致CPU與內(nèi)存出現(xiàn)協(xié)調(diào)不一致的性能瓶頸問題。同時(shí)每個(gè)工作節(jié)點(diǎn)處理的中間數(shù)據(jù)量分配不合理,容易導(dǎo)致負(fù)載不均衡,影響系統(tǒng)性能。適用于大數(shù)據(jù)處理的內(nèi)存效率優(yōu)化系統(tǒng)針對MapReduce系統(tǒng)在高性能計(jì)算集群中出現(xiàn)的問題,結(jié)合內(nèi)存計(jì)算的特性,提出并實(shí)現(xiàn)了內(nèi)存資源高效使用的優(yōu)化方案,用于構(gòu)建快速、高效的大數(shù)據(jù)處理平臺。首先,優(yōu)化系統(tǒng)設(shè)計(jì)了一種對象復(fù)用的Shuffle機(jī)制,通過復(fù)用文件寫句柄及其附屬對象有效解決了分區(qū)數(shù)量過多時(shí)內(nèi)存申請速度過快的問題,確保內(nèi)存的平穩(wěn)使用;其次,優(yōu)化系統(tǒng)建立了一種基于反饋-采樣-決策的任務(wù)分發(fā)機(jī)制,有效協(xié)調(diào)了分區(qū)塊過大時(shí)CPU與內(nèi)存的使用關(guān)系,極大地減少了中間數(shù)據(jù)溢出到磁盤的I/O開銷;最后,優(yōu)化系統(tǒng)實(shí)現(xiàn)了一種內(nèi)嵌負(fù)載均衡器的任務(wù)調(diào)度機(jī)制,確保每個(gè)工作節(jié)點(diǎn)處理的中間數(shù)據(jù)量幾乎一致,并且最大化地減少網(wǎng)絡(luò)傳輸數(shù)據(jù)量。優(yōu)化系統(tǒng)提出的內(nèi)存效率優(yōu)化方案集成在Spark系統(tǒng)上,實(shí)現(xiàn)了對用戶的透明,可以完全兼容已有的Spark應(yīng)用程序。通過典型案例測試,實(shí)驗(yàn)結(jié)果表明,改進(jìn)后的Spark系統(tǒng)相比原生系統(tǒng),在處理大規(guī)模數(shù)據(jù)集時(shí),內(nèi)存使用效率得到提高,磁盤I/O大量減少,在總的執(zhí)行時(shí)間上有著1.25倍到3.18倍的性能提升。
[Abstract]:Data processing and analysis in big data era has become a very important link. In order to meet the demand of high aging data processing, big data processing system based on memory computing has become a new research hotspot. Because the memory configuration of the existing high performance computing cluster is obviously insufficient compared with the CPU configuration, when the MapReduce system running on it is used to deal with data-intensive applications, it is easy to cause unnecessary data overflow to disk I / O operation, and the memory efficiency needs to be optimized urgently. When dealing with large-scale data sets, there are too many partitions, and the hash based Shuffle mechanism will lead to excessive file manipulation and improper use of memory. However, when the sub-block is too large, the amount of memory consumed by the task becomes larger, which easily leads to the performance bottleneck problem of inconsistent coordination between CPU and memory. At the same time, the allocation of the middle data is unreasonable, which easily leads to the imbalance of the load and affects the performance of the system. The memory efficiency optimization system suitable for big data processing, aiming at the problems of MapReduce system in high performance computing cluster, combining the characteristics of memory computing, proposes and realizes the optimization scheme of efficient use of memory resources, which is used to build rapidly. Efficient big data processing platform. Firstly, an Shuffle mechanism of object reuse is designed for optimizing the system. By reusing the file write handle and its subordinate objects, the problem of excessive request speed of memory when the number of partitions is excessive is effectively solved, and the smooth use of memory is ensured. The optimized system establishes a task distribution mechanism based on feedback, sampling and decision, which effectively coordinates the relationship between CPU and memory when the sub-block is too large, and greatly reduces the I / O overhead of the intermediate data overflow to disk. The optimization system implements a kind of task scheduling mechanism with embedded load balancer, which ensures that the intermediate data amount is almost the same per working node, and maximizes the amount of network transmission data. The memory efficiency optimization scheme proposed by the optimization system is integrated on the Spark system, which is transparent to the users and compatible with the existing Spark applications. The experimental results show that compared with the native system, the improved Spark system can improve the memory efficiency and reduce the I / O of the disk. Performance increases of 1.25 to 3.18 times in total execution time.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報(bào);2011年11期
,本文編號:1954516
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1954516.html
最近更新
教材專著