基于內(nèi)存的MapReduce系統(tǒng)效率優(yōu)化機(jī)制研究

發(fā)布時(shí)間：2018-05-30 07:50

本文選題：MapReduce + 內(nèi)存計(jì)算��；參考：《華中科技大學(xué)》2016年碩士論文

【摘要】：大數(shù)據(jù)時(shí)代下數(shù)據(jù)的處理與分析已成為一個(gè)十分重要的環(huán)節(jié)。為了滿足數(shù)據(jù)處理高時(shí)效的需求,基于內(nèi)存計(jì)算的大數(shù)據(jù)處理系統(tǒng)成為了新的研究熱點(diǎn)�，F(xiàn)有高性能計(jì)算集群由于內(nèi)存配置相對(duì)CPU配置明顯不足,當(dāng)運(yùn)行在上面的MapReduce系統(tǒng)用來(lái)處理數(shù)據(jù)密集性應(yīng)用,容易導(dǎo)致不必要的數(shù)據(jù)溢出到磁盤(pán)的I/O操作,內(nèi)存效率急需優(yōu)化。當(dāng)處理大規(guī)模的數(shù)據(jù)集時(shí),分區(qū)數(shù)量過(guò)多,基于哈希的Shuffle機(jī)制會(huì)導(dǎo)致過(guò)多的文件操作和內(nèi)存的不合理使用。但當(dāng)分區(qū)塊過(guò)大,任務(wù)消耗的內(nèi)存量變大,容易導(dǎo)致CPU與內(nèi)存出現(xiàn)協(xié)調(diào)不一致的性能瓶頸問(wèn)題。同時(shí)每個(gè)工作節(jié)點(diǎn)處理的中間數(shù)據(jù)量分配不合理,容易導(dǎo)致負(fù)載不均衡,影響系統(tǒng)性能。適用于大數(shù)據(jù)處理的內(nèi)存效率優(yōu)化系統(tǒng)針對(duì)MapReduce系統(tǒng)在高性能計(jì)算集群中出現(xiàn)的問(wèn)題,結(jié)合內(nèi)存計(jì)算的特性,提出并實(shí)現(xiàn)了內(nèi)存資源高效使用的優(yōu)化方案,用于構(gòu)建快速、高效的大數(shù)據(jù)處理平臺(tái)。首先,優(yōu)化系統(tǒng)設(shè)計(jì)了一種對(duì)象復(fù)用的Shuffle機(jī)制,通過(guò)復(fù)用文件寫(xiě)句柄及其附屬對(duì)象有效解決了分區(qū)數(shù)量過(guò)多時(shí)內(nèi)存申請(qǐng)速度過(guò)快的問(wèn)題,確保內(nèi)存的平穩(wěn)使用;其次,優(yōu)化系統(tǒng)建立了一種基于反饋-采樣-決策的任務(wù)分發(fā)機(jī)制,有效協(xié)調(diào)了分區(qū)塊過(guò)大時(shí)CPU與內(nèi)存的使用關(guān)系,極大地減少了中間數(shù)據(jù)溢出到磁盤(pán)的I/O開(kāi)銷;最后,優(yōu)化系統(tǒng)實(shí)現(xiàn)了一種內(nèi)嵌負(fù)載均衡器的任務(wù)調(diào)度機(jī)制,確保每個(gè)工作節(jié)點(diǎn)處理的中間數(shù)據(jù)量幾乎一致,并且最大化地減少網(wǎng)絡(luò)傳輸數(shù)據(jù)量。優(yōu)化系統(tǒng)提出的內(nèi)存效率優(yōu)化方案集成在Spark系統(tǒng)上,實(shí)現(xiàn)了對(duì)用戶的透明,可以完全兼容已有的Spark應(yīng)用程序。通過(guò)典型案例測(cè)試,實(shí)驗(yàn)結(jié)果表明,改進(jìn)后的Spark系統(tǒng)相比原生系統(tǒng),在處理大規(guī)模數(shù)據(jù)集時(shí),內(nèi)存使用效率得到提高,磁盤(pán)I/O大量減少,在總的執(zhí)行時(shí)間上有著1.25倍到3.18倍的性能提升。
[Abstract]:Data processing and analysis in big data era has become a very important link. In order to meet the demand of high aging data processing, big data processing system based on memory computing has become a new research hotspot. Because the memory configuration of the existing high performance computing cluster is obviously insufficient compared with the CPU configuration, when the MapReduce system running on it is used to deal with data-intensive applications, it is easy to cause unnecessary data overflow to disk I / O operation, and the memory efficiency needs to be optimized urgently. When dealing with large-scale data sets, there are too many partitions, and the hash based Shuffle mechanism will lead to excessive file manipulation and improper use of memory. However, when the sub-block is too large, the amount of memory consumed by the task becomes larger, which easily leads to the performance bottleneck problem of inconsistent coordination between CPU and memory. At the same time, the allocation of the middle data is unreasonable, which easily leads to the imbalance of the load and affects the performance of the system. The memory efficiency optimization system suitable for big data processing, aiming at the problems of MapReduce system in high performance computing cluster, combining the characteristics of memory computing, proposes and realizes the optimization scheme of efficient use of memory resources, which is used to build rapidly. Efficient big data processing platform. Firstly, an Shuffle mechanism of object reuse is designed for optimizing the system. By reusing the file write handle and its subordinate objects, the problem of excessive request speed of memory when the number of partitions is excessive is effectively solved, and the smooth use of memory is ensured. The optimized system establishes a task distribution mechanism based on feedback, sampling and decision, which effectively coordinates the relationship between CPU and memory when the sub-block is too large, and greatly reduces the I / O overhead of the intermediate data overflow to disk. The optimization system implements a kind of task scheduling mechanism with embedded load balancer, which ensures that the intermediate data amount is almost the same per working node, and maximizes the amount of network transmission data. The memory efficiency optimization scheme proposed by the optimization system is integrated on the Spark system, which is transparent to the users and compatible with the existing Spark applications. The experimental results show that compared with the native system, the improved Spark system can improve the memory efficiency and reduce the I / O of the disk. Performance increases of 1.25 to 3.18 times in total execution time.
【學(xué)位授予單位】：華中科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前1條

1 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報(bào);2011年11期

，

本文編號(hào)：1954516

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1954516.html

上一篇：基于在線評(píng)論情感語(yǔ)義分析和TOPSIS法的酒店服務(wù)質(zhì)量測(cè)量
下一篇：面向虛擬網(wǎng)調(diào)度的新型業(yè)務(wù)支撐系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于內(nèi)存的MapReduce系統(tǒng)效率優(yōu)化機(jī)制研究