基于NR-MPI的并行程序容錯(cuò)設(shè)計(jì)技術(shù)研究

發(fā)布時(shí)間：2018-05-28 18:02

本文選題：高性能計(jì)算 + MPI并行程序��；參考：《國防科學(xué)技術(shù)大學(xué)》2012年碩士論文

【摘要】：隨著高性能計(jì)算技術(shù)的飛速發(fā)展，高性能計(jì)算機(jī)（HPC）的系統(tǒng)規(guī)模急劇增大，系統(tǒng)的平均故障間隔時(shí)間（MTBF）隨之降低，遠(yuǎn)低于HPC上大型科學(xué)計(jì)算程序的運(yùn)行時(shí)間，嚴(yán)重影響了系統(tǒng)的可用性。容錯(cuò)技術(shù)是提高HPC系統(tǒng)可用性的重要技術(shù)手段。然而，目前常用的容錯(cuò)方法：系統(tǒng)級(jí)檢查點(diǎn)，通常會(huì)帶來巨大容錯(cuò)開銷，已不能滿足HPC應(yīng)用的需求。應(yīng)用級(jí)檢查點(diǎn)技術(shù)雖然可以較好的控制容錯(cuò)開銷，但是它仍然需要重新加載出錯(cuò)的程序，這在大規(guī)模系統(tǒng)中可能會(huì)引入很大的開銷。MPI是HPC領(lǐng)域應(yīng)用最廣泛的并行編程方式，而NR-MPI是一種新型、高性能的容錯(cuò)MPI，因此，基于NR-MPI的并行程序容錯(cuò)設(shè)計(jì)技術(shù)研究具有十分重要的意義。由于MPI并行程序的復(fù)雜性與多樣性，很難找到一種通用且高效的容錯(cuò)技術(shù)。本文面向應(yīng)用廣泛的循環(huán)迭代并行程序，對(duì)數(shù)據(jù)冗余和結(jié)點(diǎn)冗余這兩種容錯(cuò)技術(shù)進(jìn)行了深入的研究，主要工作如下：首先，為評(píng)價(jià)容錯(cuò)技術(shù)的優(yōu)劣，定義了三個(gè)評(píng)價(jià)容錯(cuò)技術(shù)的指標(biāo)：容錯(cuò)空間開銷、容錯(cuò)時(shí)間開銷、失效恢復(fù)時(shí)間，并為估計(jì)容錯(cuò)技術(shù)是否適用于某個(gè)HPC系統(tǒng)上的某個(gè)應(yīng)用，，定義了容錯(cuò)時(shí)間因子，這些工作為基于NR-MPI的并行程序容錯(cuò)設(shè)計(jì)提供了理論支撐。其次，提出了基于數(shù)據(jù)冗余的容錯(cuò)并行算法框架：Data Redundancy based FaultTolerant Framework（簡(jiǎn)稱DRFTF），并對(duì)其中的關(guān)鍵問題：數(shù)據(jù)備份策略、全局一致性、備份周期和關(guān)鍵變量進(jìn)行了重點(diǎn)分析。DRFTF是建立在程序原算法的基礎(chǔ)上的，對(duì)原算法不需要太大改動(dòng)即可實(shí)現(xiàn)容錯(cuò)，而且對(duì)于關(guān)鍵變量比例較小的算法，可以保獲得較小的容錯(cuò)開銷。第三，對(duì)測(cè)試程序NPB和Sweep3D的算法進(jìn)行了分析，使用DRFTF實(shí)現(xiàn)了NPB和Sweep3D的容錯(cuò)版本，并對(duì)容錯(cuò)程序進(jìn)行了實(shí)驗(yàn)和性能分析。實(shí)驗(yàn)結(jié)果驗(yàn)證了DRFTF的容錯(cuò)能力和較低的容錯(cuò)開銷。第四，針對(duì)可以在每步循環(huán)維持校驗(yàn)和關(guān)系的算法，提出了基于結(jié)點(diǎn)冗余的容錯(cuò)并行算法框架：Node Redundancy based Fault Tolerant Framework（簡(jiǎn)稱NRFTF）。NRFTF采用結(jié)點(diǎn)冗余容錯(cuò)技術(shù)，建立了程序數(shù)據(jù)的校驗(yàn)和，并將其保存在冗余結(jié)點(diǎn)，校驗(yàn)和數(shù)據(jù)由冗余進(jìn)程進(jìn)行更新，不暫停原算法的執(zhí)行，因此可以獲得很小的容錯(cuò)開銷。最后，分析了并行高斯消元算法，使用NRFTF設(shè)計(jì)了容錯(cuò)的并行高斯消元算法，并以TOP500超級(jí)計(jì)算機(jī)排行的測(cè)試程序HPL為例，實(shí)現(xiàn)了容錯(cuò)的HPL程序，對(duì)容錯(cuò)程序進(jìn)行了實(shí)驗(yàn)和性能分析。實(shí)驗(yàn)結(jié)果驗(yàn)證了NRFTF的容錯(cuò)能力和很低的容錯(cuò)開銷。
[Abstract]:With the rapid development of high performance computing technology, the scale of high performance computer (HPC) system increases rapidly, and the average fault interval time (MTBF) of the system decreases, which is far less than the running time of large scientific computing program on HPC. The availability of the system is seriously affected. Fault-tolerant technology is an important technique to improve the availability of HPC system. However, the commonly used fault-tolerant methods, system-level checkpoints, usually bring huge fault-tolerant overhead, and can no longer meet the requirements of HPC applications. Although the application-level checkpoint technology can control the fault-tolerant overhead well, it still needs to reload the error-prone program, which may introduce a large amount of overhead in large-scale systems. MPI is the most widely used parallel programming method in the field of HPC. NR-MPI is a new type of fault-tolerant MPI with high performance. Therefore, it is of great significance to study the fault-tolerant design technology of parallel programs based on NR-MPI. Due to the complexity and diversity of MPI parallel programs, it is difficult to find a universal and efficient fault-tolerant technology. In this paper, two kinds of fault-tolerant techniques, data redundancy and node redundancy, are deeply studied for circular iterative parallel programs. The main work is as follows: Firstly, in order to evaluate the merits and demerits of the fault-tolerant technology, three indexes are defined to evaluate the fault-tolerant technique: fault-tolerant space overhead, fault-tolerant time overhead, failure recovery time, and to estimate whether the fault-tolerant technique is suitable for an application in a HPC system. The fault-tolerant time factor is defined, which provides a theoretical support for the fault-tolerant design of parallel programs based on NR-MPI. Secondly, a parallel fault-tolerant algorithm based on data redundancy is proposed, which is called: DRFTF Redundancy based FaultTolerant Framework(, and the key problems are: data backup strategy, global consistency, and so on. The backup period and key variables are analyzed emphatically. DRFTF is based on the original algorithm of the program. It can be fault-tolerant without too much change to the original algorithm, and for the algorithm with small proportion of key variables, It can guarantee less fault tolerance overhead. Thirdly, the algorithms of NPB and Sweep3D are analyzed, the fault-tolerant versions of NPB and Sweep3D are implemented with DRFTF, and the experiment and performance analysis of the fault-tolerant program are carried out. The experimental results show that the DRFTF is fault-tolerant and has a low fault-tolerant overhead. Fourthly, aiming at the algorithm which can maintain the checksum relation in every step, a parallel fault-tolerant algorithm framework named: node Redundancy based Fault Tolerant Framework( based on node redundancy is proposed, which adopts node redundancy fault-tolerant technology and establishes the checksum of program data. The checksum data is updated by the redundant process, and the execution of the original algorithm is not suspended, so the fault tolerant cost can be very small. Finally, the parallel Gao Si elimination algorithm is analyzed, and the fault-tolerant parallel Gao Si elimination algorithm is designed by using NRFTF. Taking HPL, a test program ranked by TOP500 supercomputer, as an example, the fault-tolerant HPL program is implemented. The experiment and performance analysis of fault-tolerant program are carried out. The experimental results show that NRFTF is fault-tolerant and has very low fault-tolerant overhead.
【學(xué)位授予單位】：國防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP302.8

【參考文獻(xiàn)】

相關(guān)期刊論文前1條

1 李曉梅,莫?jiǎng)t堯;多重網(wǎng)格算法綜述[J];中國科學(xué)基金;1996年01期

相關(guān)博士學(xué)位論文前1條

1 杜云飛;容錯(cuò)并行算法的研究與分析[D];國防科學(xué)技術(shù)大學(xué);2008年

本文編號(hào)：1947667

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1947667.html

上一篇：基于改進(jìn)蟻群算法的多處理器任務(wù)調(diào)度仿真
下一篇：平板電腦商矽鼎科技公司的競(jìng)爭(zhēng)戰(zhàn)略選擇

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于NR-MPI的并行程序容錯(cuò)設(shè)計(jì)技術(shù)研究