基于內存緩存的異步檢查點容錯技術
發(fā)布時間:2018-11-15 20:35
【摘要】:高性能計算機系統(tǒng)規(guī)模越來越大,系統(tǒng)可靠性問題越來越嚴重.檢查點技術是最典型的容錯方法,但是因為并行文件系統(tǒng)的性能提高相對緩慢,數據寫帶寬低,傳統(tǒng)檢查點方法產生了嚴峻的性能問題.針對當前計算機系統(tǒng)計算和存儲資源豐富,而并行文件系統(tǒng)寫帶寬提高相對滯后的特點,提出了基于內存緩存的異步檢查點容錯技術,傳統(tǒng)的檢查點技術被劃分為兩步:檢查點文件首先被緩存在計算結點的局部內存,然后使用一個獨立的幫助任務將數據拷貝到并行文件系統(tǒng).利用局部內存帶寬高以及幫助任務和計算任務并行執(zhí)行的特點,新方法極大減小了檢查點容錯引入的時間開銷,模擬和實際程序測試驗證了異步檢查點容錯技術的有效性.
[Abstract]:The scale of high performance computer system is becoming larger and larger, and the problem of system reliability is becoming more and more serious. Checkpoint technique is the most typical fault-tolerant method, but because the performance of parallel file system is relatively slow and the data write bandwidth is low, the traditional checkpoint method has a severe performance problem. In view of the rich computing and storage resources in current computer systems and the relative lag in the increase of write bandwidth in parallel file systems, an asynchronous checkpoint fault-tolerant technique based on memory cache is proposed. The traditional checkpoint technique is divided into two steps: the checkpoint file is first cached in the local memory of the computing node, and then the data is copied to the parallel file system using an independent help task. Taking advantage of the characteristics of high local memory bandwidth and parallel execution of tasks and computing tasks, the new method greatly reduces the time cost introduced by checkpoint fault tolerance. Simulation and practical program tests verify the effectiveness of asynchronous checkpoint fault tolerance technology.
【作者單位】: 國防科學技術大學計算機學院;北方車輛研究所;
【基金】:國家自然科學基金項目(60903059,61003087,61170049,61120106005) 國家“八六三”高技術研究發(fā)展計劃基金項目(2012AA01A309) “核高基”國家科技重大專項基金項目(2009ZX01036-001-003-001)
【分類號】:TP302.8
[Abstract]:The scale of high performance computer system is becoming larger and larger, and the problem of system reliability is becoming more and more serious. Checkpoint technique is the most typical fault-tolerant method, but because the performance of parallel file system is relatively slow and the data write bandwidth is low, the traditional checkpoint method has a severe performance problem. In view of the rich computing and storage resources in current computer systems and the relative lag in the increase of write bandwidth in parallel file systems, an asynchronous checkpoint fault-tolerant technique based on memory cache is proposed. The traditional checkpoint technique is divided into two steps: the checkpoint file is first cached in the local memory of the computing node, and then the data is copied to the parallel file system using an independent help task. Taking advantage of the characteristics of high local memory bandwidth and parallel execution of tasks and computing tasks, the new method greatly reduces the time cost introduced by checkpoint fault tolerance. Simulation and practical program tests verify the effectiveness of asynchronous checkpoint fault tolerance technology.
【作者單位】: 國防科學技術大學計算機學院;北方車輛研究所;
【基金】:國家自然科學基金項目(60903059,61003087,61170049,61120106005) 國家“八六三”高技術研究發(fā)展計劃基金項目(2012AA01A309) “核高基”國家科技重大專項基金項目(2009ZX01036-001-003-001)
【分類號】:TP302.8
【參考文獻】
相關期刊論文 前1條
1 曹宏嘉;盧宇彤;謝e,
本文編號:2334384
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2334384.html