基于MPI的云計算平臺下計算依賴問題關(guān)鍵技術(shù)研究
發(fā)布時間:2018-07-05 02:55
本文選題:MPI + 計算依賴; 參考:《武漢理工大學(xué)》2014年碩士論文
【摘要】:對于高性能計算而言,由普通商用計算機(jī)組成的集群正在成為越來越流行的平臺。為了充分利用集群的計算和存儲能力同時簡化分布式并行應(yīng)用程序的設(shè)計,科研機(jī)構(gòu)及科技公司研發(fā)出了一系列分布式并行計算框架以及云計算平臺。但通過分析它們的編程模型,發(fā)現(xiàn)這些框架及云計算平臺并不適用于存在計算依賴的作業(yè)或者說是不能有效地解決這類問題。 本文提出了基于有向圖的存在計算依賴的作業(yè)的編程模型,其核心就是用一個有向圖來表達(dá)存在計算依賴的作業(yè)分解后的各個任務(wù)以及任務(wù)所執(zhí)行的計算間的依賴關(guān)系。根據(jù)編程模型的結(jié)構(gòu)來分析編程模型所對應(yīng)并行計算框架的核心過程,研究了任務(wù)所執(zhí)行計算間的依賴關(guān)系類型、依賴關(guān)系表示方法及任務(wù)調(diào)度機(jī)制。在上述基礎(chǔ)之上,基于MPICH(消息傳遞接口MPI的一種具體實現(xiàn))設(shè)計并實現(xiàn)編程模型相應(yīng)的并行計算框架。MPI(Message PassingInterface)本身不提供容錯機(jī)制,為了增強(qiáng)系統(tǒng)的可靠性及高可用性,本文在分析傳統(tǒng)的基于檢查點的卷回恢復(fù)協(xié)議的優(yōu)勢與不足之后,設(shè)計出改進(jìn)的基于通信引發(fā)檢查點的卷回恢復(fù)協(xié)議:采用通信引發(fā)的檢查點設(shè)置協(xié)議可以確保作業(yè)從檢查點恢復(fù)時的正確性;進(jìn)程在設(shè)置檢查點時采用戶導(dǎo)向的檢查點設(shè)置機(jī)制可以有效地減少無錯運(yùn)行時開銷;作業(yè)在出錯恢復(fù)時采用三級容錯恢復(fù)協(xié)議,可以將出錯恢復(fù)限制在與失敗進(jìn)程有直接依賴關(guān)系的進(jìn)程范圍內(nèi)而不影響其他進(jìn)程的正常執(zhí)行,這樣就加快了作業(yè)的出錯恢復(fù)過程。為了支持存在計算依賴的作業(yè)的三級容錯恢復(fù)協(xié)議,本文研究并設(shè)計了不共享通信域的Worker間通信機(jī)制。最終,程序開發(fā)人員只需按照框架的規(guī)范編寫并提交各計算頂點(任務(wù))對應(yīng)的順序執(zhí)行的程序和計算頂點依賴關(guān)系圖,系統(tǒng)自動地對存在計算依賴的作業(yè)進(jìn)行分布式并行處理包括:負(fù)載平衡、任務(wù)調(diào)度、計算結(jié)果的返回、對用戶透明的容錯處理等。 本文將適用于存在計算依賴的作業(yè)的并行計算框架的原型系統(tǒng)部署在實驗室之前研發(fā)的基于MPI的多層容錯高性能云計算平臺上,,使之支持存在計算依賴的作業(yè)。實驗測試結(jié)果表明,原型系統(tǒng)可以正確有效地解決存在計算依賴的作業(yè)。
[Abstract]:For high-performance computing, a cluster of ordinary commercial computers is becoming a more and more popular platform. In order to make full use of the computing and storage capabilities of clusters and simplify the design of distributed parallel applications, scientific research institutions and technology companies have developed a series of distributed parallel computing frameworks and cloud computing platforms. However, by analyzing their programming models, it is found that these frameworks and cloud computing platforms are not suitable for computing dependent jobs or can not solve such problems effectively. In this paper, a programming model of computationally dependent jobs based on directed graphs is proposed. The core of the model is to use a directed graph to express the decomposed tasks of jobs with computational dependencies and the dependencies between the computations performed by the tasks. According to the structure of the programming model, this paper analyzes the core process of the parallel computing framework corresponding to the programming model, and studies the types of dependencies between the computations executed by the tasks, the representation method of the dependency relationships and the task scheduling mechanism. On the above basis, the parallel computing framework .MPI (message passing Interface) is designed and implemented based on MPICH (message passing Interface), which does not provide fault-tolerant mechanism, in order to enhance the reliability and high availability of the system. After analyzing the advantages and disadvantages of the traditional checkpointing based rollback recovery protocol, An improved rollback recovery protocol based on communication trigger checkpoint is designed. The correctness of job recovery from checkpoint can be ensured by using communication triggered checkpoint setting protocol. The house-oriented checkpoint setting mechanism can effectively reduce the error-free runtime overhead, and the three-level fault-tolerant recovery protocol is used in the error recovery process. Error recovery can be limited to the range of processes that are directly dependent on the failed process without affecting the normal execution of other processes, thus speeding up the error recovery process of the job. In order to support a three-level fault-tolerant recovery protocol with computationally dependent jobs, this paper studies and designs an inter-worker communication mechanism for non-shared communication domains. In the end, program developers simply write and submit program and computational vertex dependency diagrams that are executed in the order corresponding to each computing vertex (task) in accordance with the framework specifications. The distributed parallel processing of jobs with computational dependencies includes load balancing, task scheduling, the return of computing results, and transparent fault-tolerant processing for users. In this paper, the prototype system for parallel computing framework with computationally dependent jobs is deployed on MPI-based multi-layer fault-tolerant and high-performance cloud computing platform developed before the laboratory to support computationally dependent jobs. The experimental results show that the prototype system can solve the problem of computing dependency correctly and effectively.
【學(xué)位授予單位】:武漢理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP38
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 崔麗青,徐煒民;MPI容錯機(jī)制的研究[J];計算機(jī)工程;2004年16期
2 張慶成,金海,張浩;MPI程序容錯系統(tǒng)的分析和設(shè)計[J];計算機(jī)工程與科學(xué);2005年06期
本文編號:2098605
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2098605.html
最近更新
教材專著