集群計算引擎Spark中的內(nèi)存優(yōu)化研究與實現(xiàn)
發(fā)布時間:2018-12-16 15:46
【摘要】:在迭代之間使用內(nèi)存做數(shù)據(jù)傳輸?shù)牟⑿杏嬎憧蚣苁钱斍暗囊粋研究熱點。與傳統(tǒng)的基于硬盤和網(wǎng)絡的計算方式相比,使用內(nèi)存可以減少數(shù)據(jù)傳輸?shù)臅r間。對于數(shù)據(jù)密集類型的任務,可以將運行時間提升十幾倍。在新一代框架快速發(fā)展的同時,如何充分利用相對仍然緊缺的內(nèi)存資源,保證任務的運行效率,成為一個亟待解決的問題。 本文基于集群計算引擎Spark,研究了并行計算集群對于內(nèi)存的使用行為。通過對內(nèi)存行為進行建模與分析,對內(nèi)存的使用進行了決策自動化以及替換策略優(yōu)化。提高了任務在資源有限情況下的運行效率,以及在不同集群環(huán)境下任務效率的穩(wěn)定性。本文的貢獻主要有: 通過對代碼的語義進行分析,實現(xiàn)了內(nèi)存策略的自動化。即調(diào)度器可以自動識別出價值的數(shù)據(jù)集(RDD)放入緩存,,避免緩存存污染的同時,也減輕了程序員的編程負擔。 在對代碼語義分析,獲得任務詳細信息的基礎上,對內(nèi)存使用的替換策略進行了優(yōu)化。主要包括RDD大小和權重的計算,操作順序的優(yōu)化重排,使用寄存器分配模型加權重信息形成新的替換算法,代替原有的LRU算法以及多級緩存模型的智能化。最后對內(nèi)存在異構集群群上的行為也進行了初步的分析。 最后通過不同的實驗,驗證了優(yōu)化后的方案可以提高任務對不同集群環(huán)境的適應性,并且在在內(nèi)存資源相對有限的情況下使任務運行效率更高,使系統(tǒng)的實用性整體增強,對于其他并行系統(tǒng)中的內(nèi)存使用也有實際的參考價值。
[Abstract]:A parallel computing framework using memory for data transfer between iterations is a hot topic. Compared with the traditional hard disk and network based computing, the use of memory can reduce the time of data transmission. For data-intensive types of tasks, you can increase the running time more than ten times. With the rapid development of the new generation framework, how to make full use of the relatively scarce memory resources and ensure the operational efficiency of the task has become a problem to be solved urgently. This paper studies the memory usage behavior of parallel computing clusters based on cluster computing engine Spark,. Through modeling and analysis of memory behavior, the decision automation and substitution strategy optimization of memory usage are carried out. The efficiency of task is improved under the condition of limited resources and the stability of task efficiency in different cluster environment. The main contributions of this paper are as follows: by analyzing the semantics of the code, the memory strategy is automated. That is, the scheduler can automatically recognize the value of the data set (RDD) into the cache, to avoid cache pollution, but also reduce the programmer's programming burden. On the basis of code semantic analysis and task details, the memory replacement strategy is optimized. It mainly includes the calculation of RDD size and weight, the optimal rearrangement of operation sequence, the use of register allocation model and weight information to form a new replacement algorithm, which replaces the original LRU algorithm and the intelligence of multi-level buffer model. Finally, the behavior of heterogeneous cluster is also analyzed. Finally, through different experiments, it is proved that the optimized scheme can improve the adaptability of the task to different cluster environments, and make the task run more efficiently under the condition of relatively limited memory resources, so that the practicability of the system is enhanced as a whole. It also has practical reference value for memory usage in other parallel systems.
【學位授予單位】:清華大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP333.1
本文編號:2382595
[Abstract]:A parallel computing framework using memory for data transfer between iterations is a hot topic. Compared with the traditional hard disk and network based computing, the use of memory can reduce the time of data transmission. For data-intensive types of tasks, you can increase the running time more than ten times. With the rapid development of the new generation framework, how to make full use of the relatively scarce memory resources and ensure the operational efficiency of the task has become a problem to be solved urgently. This paper studies the memory usage behavior of parallel computing clusters based on cluster computing engine Spark,. Through modeling and analysis of memory behavior, the decision automation and substitution strategy optimization of memory usage are carried out. The efficiency of task is improved under the condition of limited resources and the stability of task efficiency in different cluster environment. The main contributions of this paper are as follows: by analyzing the semantics of the code, the memory strategy is automated. That is, the scheduler can automatically recognize the value of the data set (RDD) into the cache, to avoid cache pollution, but also reduce the programmer's programming burden. On the basis of code semantic analysis and task details, the memory replacement strategy is optimized. It mainly includes the calculation of RDD size and weight, the optimal rearrangement of operation sequence, the use of register allocation model and weight information to form a new replacement algorithm, which replaces the original LRU algorithm and the intelligence of multi-level buffer model. Finally, the behavior of heterogeneous cluster is also analyzed. Finally, through different experiments, it is proved that the optimized scheme can improve the adaptability of the task to different cluster environments, and make the task run more efficiently under the condition of relatively limited memory resources, so that the practicability of the system is enhanced as a whole. It also has practical reference value for memory usage in other parallel systems.
【學位授予單位】:清華大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP333.1
【共引文獻】
相關期刊論文 前2條
1 董新華;李瑞軒;周灣灣;王聰;薛正元;廖東杰;;Hadoop系統(tǒng)性能優(yōu)化與功能增強綜述[J];計算機研究與發(fā)展;2013年S2期
2 張永;尹傳曄;吳崇正;;基于MapReduce的PageRank算法優(yōu)化研究[J];計算機應用研究;2014年02期
相關博士學位論文 前2條
1 劉智;二進制代碼級的漏洞攻擊檢測研究[D];電子科技大學;2013年
2 王榮華;動態(tài)二進制翻譯優(yōu)化研究[D];浙江大學;2013年
相關碩士學位論文 前3條
1 賴海明;MapReduce作業(yè)調(diào)度算法分析與優(yōu)化研究[D];杭州電子科技大學;2013年
2 羅杰;基于GCC的YHFT-Matrix編譯器關鍵技術研究與實現(xiàn)[D];國防科學技術大學;2012年
3 蔣慧斐;海量日志分布式處理系統(tǒng)的研究與應用[D];北京交通大學;2014年
本文編號:2382595
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2382595.html
最近更新
教材專著