面向固態(tài)硬盤的Spark數(shù)據(jù)持久化方法設計

發(fā)布時間：2018-04-14 00:20

本文選題：大數(shù)據(jù) + 混合存儲　；參考：《計算機研究與發(fā)展》2017年06期

【摘要】：基于固態(tài)硬盤(solid-state drive,SSD)和硬盤(hard disk drive,HDD)混合存儲的數(shù)據(jù)中心已經(jīng)成為大數(shù)據(jù)計算領(lǐng)域的高性能載體,數(shù)據(jù)中心負載應該可將不同特性的數(shù)據(jù)按需持久化到SSD或HDD,以提升系統(tǒng)整體性能.Spark是目前產(chǎn)業(yè)界廣泛使用的高效大數(shù)據(jù)計算框架,尤其適用于多次迭代計算的應用領(lǐng)域,其原因在于Spark可以將中間數(shù)據(jù)持久化在內(nèi)存或硬盤中,且持久化數(shù)據(jù)到硬盤打破了內(nèi)存容量不足對數(shù)據(jù)集規(guī)模的限制.然而,當前的Spark實現(xiàn)并未專門提供顯式的面向SSD的持久化接口,盡管可根據(jù)配置信息將數(shù)據(jù)按比例分布到不同的存儲介質(zhì)中,但是用戶無法根據(jù)數(shù)據(jù)特征按需指定RDD的持久化存儲介質(zhì),針對性和靈活性不足.這不僅成為進一步提升Spark性能的瓶頸,而且嚴重影響了混合存儲系統(tǒng)性能的發(fā)揮.有鑒于此,首次提出面向SSD的數(shù)據(jù)持久化策略.探索了Spark數(shù)據(jù)持久化原理,基于混合存儲系統(tǒng)優(yōu)化了Spark的持久化架構(gòu),最終通過提供特定的持久化API實現(xiàn)用戶可顯式、靈活指定RDD的持久化介質(zhì).基于SparkBench的實驗結(jié)果表明,經(jīng)本方案優(yōu)化后的Spark與原生版本相比,其性能平均提升14.02%.
[Abstract]:The data center, which is based on solid state disk (SD) and hard disk (HDD), has become a high performance carrier in big data's computing field.The data center load should be able to persist data with different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is a highly efficient big data computing framework widely used in industry, especially in the field of multiple iterations.The reason is that Spark can persist intermediate data in memory or hard disk, and persistent data to hard disk breaks the limit of data set size due to insufficient memory capacity.However, the current Spark implementation does not specifically provide an explicit persistence interface for SSD, although data can be distributed proportionally to different storage media based on configuration information.However, the user can not specify the persistent storage medium of RDD according to the data characteristics, so it is not specific and flexible.This not only becomes the bottleneck of further improving Spark performance, but also seriously affects the performance of hybrid storage system.In view of this, a data persistence strategy for SSD is proposed for the first time.This paper explores the principle of Spark data persistence, and optimizes the persistence architecture of Spark based on hybrid storage system. Finally, the user can explicitly specify the persistence medium of RDD by providing specific persistent API.The experimental results based on SparkBench show that the performance of the optimized Spark is 14.02% higher than that of the native version.
【作者單位】：深圳大學計算機與軟件學院;廣東工業(yè)大學計算機學院;計算機體系結(jié)構(gòu)國家重點實驗室(中國科學院計算技術(shù)研究所);國家計算機網(wǎng)絡應急技術(shù)處理協(xié)調(diào)中心;中國工程院戰(zhàn)略咨詢中心;
【基金】：國家“八六三”高技術(shù)研究發(fā)展計劃基金項目(2015AA015305) 廣東省自然科學基金項目(2014A030313553) 廣東省省部產(chǎn)學研項目(2013B090500055) 深圳市基礎研究學科布局項目(JCYJ20150529164656096)~~
【分類號】：TP311.13;TP333
，

本文編號：1746881

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1746881.html

上一篇：基于WinCE的網(wǎng)絡傳輸系統(tǒng)設計與實現(xiàn)
下一篇：數(shù)據(jù)中心電源系統(tǒng)分析與節(jié)能探討

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向固態(tài)硬盤的Spark數(shù)據(jù)持久化方法設計