天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于MPI的云計(jì)算平臺(tái)下計(jì)算依賴問(wèn)題關(guān)鍵技術(shù)研究

發(fā)布時(shí)間:2018-07-05 02:55

  本文選題:MPI + 計(jì)算依賴。 參考:《武漢理工大學(xué)》2014年碩士論文


【摘要】:對(duì)于高性能計(jì)算而言,由普通商用計(jì)算機(jī)組成的集群正在成為越來(lái)越流行的平臺(tái)。為了充分利用集群的計(jì)算和存儲(chǔ)能力同時(shí)簡(jiǎn)化分布式并行應(yīng)用程序的設(shè)計(jì),科研機(jī)構(gòu)及科技公司研發(fā)出了一系列分布式并行計(jì)算框架以及云計(jì)算平臺(tái)。但通過(guò)分析它們的編程模型,發(fā)現(xiàn)這些框架及云計(jì)算平臺(tái)并不適用于存在計(jì)算依賴的作業(yè)或者說(shuō)是不能有效地解決這類問(wèn)題。 本文提出了基于有向圖的存在計(jì)算依賴的作業(yè)的編程模型,其核心就是用一個(gè)有向圖來(lái)表達(dá)存在計(jì)算依賴的作業(yè)分解后的各個(gè)任務(wù)以及任務(wù)所執(zhí)行的計(jì)算間的依賴關(guān)系。根據(jù)編程模型的結(jié)構(gòu)來(lái)分析編程模型所對(duì)應(yīng)并行計(jì)算框架的核心過(guò)程,研究了任務(wù)所執(zhí)行計(jì)算間的依賴關(guān)系類型、依賴關(guān)系表示方法及任務(wù)調(diào)度機(jī)制。在上述基礎(chǔ)之上,基于MPICH(消息傳遞接口MPI的一種具體實(shí)現(xiàn))設(shè)計(jì)并實(shí)現(xiàn)編程模型相應(yīng)的并行計(jì)算框架。MPI(Message PassingInterface)本身不提供容錯(cuò)機(jī)制,為了增強(qiáng)系統(tǒng)的可靠性及高可用性,本文在分析傳統(tǒng)的基于檢查點(diǎn)的卷回恢復(fù)協(xié)議的優(yōu)勢(shì)與不足之后,設(shè)計(jì)出改進(jìn)的基于通信引發(fā)檢查點(diǎn)的卷回恢復(fù)協(xié)議:采用通信引發(fā)的檢查點(diǎn)設(shè)置協(xié)議可以確保作業(yè)從檢查點(diǎn)恢復(fù)時(shí)的正確性;進(jìn)程在設(shè)置檢查點(diǎn)時(shí)采用戶導(dǎo)向的檢查點(diǎn)設(shè)置機(jī)制可以有效地減少無(wú)錯(cuò)運(yùn)行時(shí)開(kāi)銷;作業(yè)在出錯(cuò)恢復(fù)時(shí)采用三級(jí)容錯(cuò)恢復(fù)協(xié)議,可以將出錯(cuò)恢復(fù)限制在與失敗進(jìn)程有直接依賴關(guān)系的進(jìn)程范圍內(nèi)而不影響其他進(jìn)程的正常執(zhí)行,這樣就加快了作業(yè)的出錯(cuò)恢復(fù)過(guò)程。為了支持存在計(jì)算依賴的作業(yè)的三級(jí)容錯(cuò)恢復(fù)協(xié)議,本文研究并設(shè)計(jì)了不共享通信域的Worker間通信機(jī)制。最終,程序開(kāi)發(fā)人員只需按照框架的規(guī)范編寫(xiě)并提交各計(jì)算頂點(diǎn)(任務(wù))對(duì)應(yīng)的順序執(zhí)行的程序和計(jì)算頂點(diǎn)依賴關(guān)系圖,系統(tǒng)自動(dòng)地對(duì)存在計(jì)算依賴的作業(yè)進(jìn)行分布式并行處理包括:負(fù)載平衡、任務(wù)調(diào)度、計(jì)算結(jié)果的返回、對(duì)用戶透明的容錯(cuò)處理等。 本文將適用于存在計(jì)算依賴的作業(yè)的并行計(jì)算框架的原型系統(tǒng)部署在實(shí)驗(yàn)室之前研發(fā)的基于MPI的多層容錯(cuò)高性能云計(jì)算平臺(tái)上,,使之支持存在計(jì)算依賴的作業(yè)。實(shí)驗(yàn)測(cè)試結(jié)果表明,原型系統(tǒng)可以正確有效地解決存在計(jì)算依賴的作業(yè)。
[Abstract]:For high-performance computing, a cluster of ordinary commercial computers is becoming a more and more popular platform. In order to make full use of the computing and storage capabilities of clusters and simplify the design of distributed parallel applications, scientific research institutions and technology companies have developed a series of distributed parallel computing frameworks and cloud computing platforms. However, by analyzing their programming models, it is found that these frameworks and cloud computing platforms are not suitable for computing dependent jobs or can not solve such problems effectively. In this paper, a programming model of computationally dependent jobs based on directed graphs is proposed. The core of the model is to use a directed graph to express the decomposed tasks of jobs with computational dependencies and the dependencies between the computations performed by the tasks. According to the structure of the programming model, this paper analyzes the core process of the parallel computing framework corresponding to the programming model, and studies the types of dependencies between the computations executed by the tasks, the representation method of the dependency relationships and the task scheduling mechanism. On the above basis, the parallel computing framework .MPI (message passing Interface) is designed and implemented based on MPICH (message passing Interface), which does not provide fault-tolerant mechanism, in order to enhance the reliability and high availability of the system. After analyzing the advantages and disadvantages of the traditional checkpointing based rollback recovery protocol, An improved rollback recovery protocol based on communication trigger checkpoint is designed. The correctness of job recovery from checkpoint can be ensured by using communication triggered checkpoint setting protocol. The house-oriented checkpoint setting mechanism can effectively reduce the error-free runtime overhead, and the three-level fault-tolerant recovery protocol is used in the error recovery process. Error recovery can be limited to the range of processes that are directly dependent on the failed process without affecting the normal execution of other processes, thus speeding up the error recovery process of the job. In order to support a three-level fault-tolerant recovery protocol with computationally dependent jobs, this paper studies and designs an inter-worker communication mechanism for non-shared communication domains. In the end, program developers simply write and submit program and computational vertex dependency diagrams that are executed in the order corresponding to each computing vertex (task) in accordance with the framework specifications. The distributed parallel processing of jobs with computational dependencies includes load balancing, task scheduling, the return of computing results, and transparent fault-tolerant processing for users. In this paper, the prototype system for parallel computing framework with computationally dependent jobs is deployed on MPI-based multi-layer fault-tolerant and high-performance cloud computing platform developed before the laboratory to support computationally dependent jobs. The experimental results show that the prototype system can solve the problem of computing dependency correctly and effectively.
【學(xué)位授予單位】:武漢理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP38

【參考文獻(xiàn)】

相關(guān)期刊論文 前2條

1 崔麗青,徐煒民;MPI容錯(cuò)機(jī)制的研究[J];計(jì)算機(jī)工程;2004年16期

2 張慶成,金海,張浩;MPI程序容錯(cuò)系統(tǒng)的分析和設(shè)計(jì)[J];計(jì)算機(jī)工程與科學(xué);2005年06期



本文編號(hào):2098605

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2098605.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶a6643***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com