基于NR-MPI的并行程序容錯設計技術研究
發(fā)布時間:2018-05-28 18:02
本文選題:高性能計算 + MPI并行程序。 參考:《國防科學技術大學》2012年碩士論文
【摘要】:隨著高性能計算技術的飛速發(fā)展,高性能計算機(HPC)的系統(tǒng)規(guī)模急劇增大,系統(tǒng)的平均故障間隔時間(MTBF)隨之降低,遠低于HPC上大型科學計算程序的運行時間,嚴重影響了系統(tǒng)的可用性。容錯技術是提高HPC系統(tǒng)可用性的重要技術手段。然而,目前常用的容錯方法:系統(tǒng)級檢查點,通常會帶來巨大容錯開銷,已不能滿足HPC應用的需求。應用級檢查點技術雖然可以較好的控制容錯開銷,但是它仍然需要重新加載出錯的程序,這在大規(guī)模系統(tǒng)中可能會引入很大的開銷。MPI是HPC領域應用最廣泛的并行編程方式,而NR-MPI是一種新型、高性能的容錯MPI,因此,基于NR-MPI的并行程序容錯設計技術研究具有十分重要的意義。 由于MPI并行程序的復雜性與多樣性,很難找到一種通用且高效的容錯技術。本文面向應用廣泛的循環(huán)迭代并行程序,對數據冗余和結點冗余這兩種容錯技術進行了深入的研究,主要工作如下: 首先,為評價容錯技術的優(yōu)劣,定義了三個評價容錯技術的指標:容錯空間開銷、容錯時間開銷、失效恢復時間,并為估計容錯技術是否適用于某個HPC系統(tǒng)上的某個應用,,定義了容錯時間因子,這些工作為基于NR-MPI的并行程序容錯設計提供了理論支撐。 其次,提出了基于數據冗余的容錯并行算法框架:Data Redundancy based FaultTolerant Framework(簡稱DRFTF),并對其中的關鍵問題:數據備份策略、全局一致性、備份周期和關鍵變量進行了重點分析。DRFTF是建立在程序原算法的基礎上的,對原算法不需要太大改動即可實現容錯,而且對于關鍵變量比例較小的算法,可以保獲得較小的容錯開銷。 第三,對測試程序NPB和Sweep3D的算法進行了分析,使用DRFTF實現了NPB和Sweep3D的容錯版本,并對容錯程序進行了實驗和性能分析。實驗結果驗證了DRFTF的容錯能力和較低的容錯開銷。 第四,針對可以在每步循環(huán)維持校驗和關系的算法,提出了基于結點冗余的容錯并行算法框架:Node Redundancy based Fault Tolerant Framework(簡稱NRFTF)。NRFTF采用結點冗余容錯技術,建立了程序數據的校驗和,并將其保存在冗余結點,校驗和數據由冗余進程進行更新,不暫停原算法的執(zhí)行,因此可以獲得很小的容錯開銷。 最后,分析了并行高斯消元算法,使用NRFTF設計了容錯的并行高斯消元算法,并以TOP500超級計算機排行的測試程序HPL為例,實現了容錯的HPL程序,對容錯程序進行了實驗和性能分析。實驗結果驗證了NRFTF的容錯能力和很低的容錯開銷。
[Abstract]:With the rapid development of high performance computing technology, the scale of high performance computer (HPC) system increases rapidly, and the average fault interval time (MTBF) of the system decreases, which is far less than the running time of large scientific computing program on HPC. The availability of the system is seriously affected. Fault-tolerant technology is an important technique to improve the availability of HPC system. However, the commonly used fault-tolerant methods, system-level checkpoints, usually bring huge fault-tolerant overhead, and can no longer meet the requirements of HPC applications. Although the application-level checkpoint technology can control the fault-tolerant overhead well, it still needs to reload the error-prone program, which may introduce a large amount of overhead in large-scale systems. MPI is the most widely used parallel programming method in the field of HPC. NR-MPI is a new type of fault-tolerant MPI with high performance. Therefore, it is of great significance to study the fault-tolerant design technology of parallel programs based on NR-MPI. Due to the complexity and diversity of MPI parallel programs, it is difficult to find a universal and efficient fault-tolerant technology. In this paper, two kinds of fault-tolerant techniques, data redundancy and node redundancy, are deeply studied for circular iterative parallel programs. The main work is as follows: Firstly, in order to evaluate the merits and demerits of the fault-tolerant technology, three indexes are defined to evaluate the fault-tolerant technique: fault-tolerant space overhead, fault-tolerant time overhead, failure recovery time, and to estimate whether the fault-tolerant technique is suitable for an application in a HPC system. The fault-tolerant time factor is defined, which provides a theoretical support for the fault-tolerant design of parallel programs based on NR-MPI. Secondly, a parallel fault-tolerant algorithm based on data redundancy is proposed, which is called: DRFTF Redundancy based FaultTolerant Framework(, and the key problems are: data backup strategy, global consistency, and so on. The backup period and key variables are analyzed emphatically. DRFTF is based on the original algorithm of the program. It can be fault-tolerant without too much change to the original algorithm, and for the algorithm with small proportion of key variables, It can guarantee less fault tolerance overhead. Thirdly, the algorithms of NPB and Sweep3D are analyzed, the fault-tolerant versions of NPB and Sweep3D are implemented with DRFTF, and the experiment and performance analysis of the fault-tolerant program are carried out. The experimental results show that the DRFTF is fault-tolerant and has a low fault-tolerant overhead. Fourthly, aiming at the algorithm which can maintain the checksum relation in every step, a parallel fault-tolerant algorithm framework named: node Redundancy based Fault Tolerant Framework( based on node redundancy is proposed, which adopts node redundancy fault-tolerant technology and establishes the checksum of program data. The checksum data is updated by the redundant process, and the execution of the original algorithm is not suspended, so the fault tolerant cost can be very small. Finally, the parallel Gao Si elimination algorithm is analyzed, and the fault-tolerant parallel Gao Si elimination algorithm is designed by using NRFTF. Taking HPL, a test program ranked by TOP500 supercomputer, as an example, the fault-tolerant HPL program is implemented. The experiment and performance analysis of fault-tolerant program are carried out. The experimental results show that NRFTF is fault-tolerant and has very low fault-tolerant overhead.
【學位授予單位】:國防科學技術大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP302.8
【參考文獻】
相關期刊論文 前1條
1 李曉梅,莫則堯;多重網格算法綜述[J];中國科學基金;1996年01期
相關博士學位論文 前1條
1 杜云飛;容錯并行算法的研究與分析[D];國防科學技術大學;2008年
本文編號:1947667
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1947667.html