基于依賴跟蹤和消息計數(shù)的回卷恢復(fù)容錯技術(shù)研究

發(fā)布時間：2019-04-09 10:51

【摘要】：目前大量的科學(xué)研究和工程技術(shù)應(yīng)用都在分布式計算系統(tǒng)上進(jìn)行，但伴隨著系統(tǒng)規(guī)模的擴(kuò)大，系統(tǒng)節(jié)點數(shù)量的增加，系統(tǒng)運行時發(fā)生故障的概率也隨之增大。如果想要使系統(tǒng)在出現(xiàn)故障或異常之后，仍能夠保證結(jié)果的正確性或滿足應(yīng)用的需求，那么系統(tǒng)必須所具有容錯的能力�；鼐砘謴�(fù)容錯技術(shù)基于時間冗余進(jìn)行容錯，無須結(jié)點冗余，是實現(xiàn)高性能分布式計算可靠性的主流技術(shù)。但是回卷恢復(fù)技術(shù)在保障系統(tǒng)可靠性的同時會帶來大量的額外開銷，開銷問題很大程度上限制了它的應(yīng)用與發(fā)展。因此研究降低回卷恢復(fù)協(xié)議開銷、提高系統(tǒng)執(zhí)行效率的方法有著重要的意義。本文的主要研究內(nèi)容包括如下兩個方面：第一，針對傳統(tǒng)消息日志協(xié)議中由于同步約束所導(dǎo)致的消息日志開銷大的問題，，提出一種基于依賴跟蹤的輕量級消息日志協(xié)議。該協(xié)議利用運行時的消息傳遞特性，采用信息附帶策略解除消息日志中的同步約束。該協(xié)議中消息數(shù)據(jù)本身保存在發(fā)送方，未施加任何約束條件，消息的提交信息隨消息傳遞保存在依賴關(guān)系擴(kuò)展中的依賴方，這種保存方式也未引入任何約束。消息的提交信息通過保存方跟蹤，盡力避免了不必要的傳遞，減少了消息的附帶信息量，具有輕量級的特性。通過實驗表明，該協(xié)議相比Egida協(xié)議，其消息日志開銷和檢查點開銷均降低了10%左右。第二，針對已有的協(xié)同檢查點協(xié)議通常存在阻塞或協(xié)同開銷較大的問題，提出了一種基于消息計數(shù)的非阻塞式協(xié)同檢查點協(xié)議。該協(xié)議將進(jìn)程的運行時狀態(tài)分為三種，利用分布式并行程序運行時檢查點設(shè)置概率遠(yuǎn)高于故障發(fā)生概率這一特征，采用信息附帶策略和非阻塞執(zhí)行機(jī)制，將檢查點設(shè)置過程中的部分協(xié)同開銷轉(zhuǎn)移到故障后的回卷恢復(fù)階段，同時通過標(biāo)識檢查點間隔內(nèi)進(jìn)程的通信情況，來避免進(jìn)程設(shè)置不必要的檢查點，以此降低檢查點設(shè)置過程中的整體開銷。實驗結(jié)果表明，該協(xié)議相比兩段式檢查點協(xié)議，其協(xié)同檢查點開銷降低了20%至40%；相比分布式快照協(xié)議，其協(xié)同檢查點開銷降低了20%左右。
[Abstract]:At present, a large number of scientific research and engineering applications are carried out in distributed computing systems. However, with the expansion of system scale and the increase of the number of nodes, the probability of system failure is also increased. If the system is to be able to guarantee the correctness of the results or meet the requirements of the application after the fault or exception occurs, the system must have the fault-tolerant ability. Roll-back recovery fault-tolerant technology, which is based on time redundancy and does not require node redundancy, is the mainstream technology to achieve high-performance distributed computing reliability. However, roll-back recovery technology can not only guarantee the reliability of the system but also bring a lot of additional overhead, which limits its application and development to a great extent. Therefore, it is of great significance to study the methods to reduce the overhead of rollback recovery protocol and improve the efficiency of system execution. The main contents of this paper include the following two aspects: firstly, a lightweight message log protocol based on dependency tracing is proposed to solve the problem of large message log overhead caused by synchronization constraints in traditional message logging protocols. This protocol takes advantage of the message-passing characteristic of runtime and uses the information-attached policy to remove the synchronization constraint in message log. In this protocol, the message data is stored in the sender without any constraints, and the message submission information is stored in the dependent party with the message transmission in the dependency extension, and no constraints are introduced in this way. The message submission information is tracked by the depositor, which avoids unnecessary transmission, reduces the incidental information of the message, and has the characteristics of lightweight. The experimental results show that the message log overhead and checkpoint overhead of the proposed protocol are reduced by about 10% compared with the Egida protocol. Secondly, a non-blocking cooperative checkpoint protocol based on message counting is proposed to solve the problem that the existing cooperative checkpoint protocols usually have blocking or high cooperative overhead. The protocol divides the run-time state of the process into three types. Using the characteristics of distributed parallel program runtime checkpoint setting probability far higher than the probability of failure occurrence, this protocol adopts information collateral policy and non-blocking execution mechanism. "transfers part of the collaboration overhead during checkpoint setup to the post-failure rollback recovery phase, while avoiding unnecessary checkpoints by identifying the traffic of processes within the checkpoint interval." This reduces the overall overhead during checkpoint setup. The experimental results show that compared with the two-segment checkpoint protocol, the cooperative checkpoint overhead of the proposed protocol reduces by 20% to 40%, and that of the distributed snapshot protocol by about 20%.
【學(xué)位授予單位】：湖南大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP302.7

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 裴丹,汪東升,沈美明,鄭緯民;WOB:一種新的文件檢查點設(shè)置策略[J];電子學(xué)報;2000年05期

2 劉云生,張傳富,張童,查亞兵,黃柯棣;基于Markov鏈的分布式仿真系統(tǒng)最佳檢查點間隔研究[J];國防科技大學(xué)學(xué)報;2005年05期

3 張展;左德承;慈軼為;楊孝宗;;穿戴計算機(jī)的內(nèi)核級檢查點優(yōu)化策略研究[J];高技術(shù)通訊;2008年05期

4 劉建,汪東升,沈美明,鄭緯民;一種基于檢查點的并行程序調(diào)試器的設(shè)計與實現(xiàn)[J];計算機(jī)研究與發(fā)展;2002年12期

5 周恩強(qiáng),盧宇彤,沈志宇;一個適合大規(guī)模集群并行計算的檢查點系統(tǒng)[J];計算機(jī)研究與發(fā)展;2005年06期

6 張展;左德承;慈軼為;楊孝宗;;一種基于移動計算環(huán)境的因果日志卷回恢復(fù)算法[J];計算機(jī)研究與發(fā)展;2008年02期

7 羅元盛,閔應(yīng)驊,張大方;一種基于索引的準(zhǔn)同步檢查點協(xié)議[J];計算機(jī)學(xué)報;2005年10期

8 汪東升,邵明瓏;具有O(n)消息復(fù)雜度的協(xié)調(diào)檢查點設(shè)置算法[J];軟件學(xué)報;2003年01期

9 汪東升,沈美明,鄭緯民,裴丹;一種基于檢查點的卷回恢復(fù)與進(jìn)程遷移系統(tǒng)[J];軟件學(xué)報;1999年01期

10 富弘毅;丁滟;宋偉;楊學(xué)軍;;一種利用并行復(fù)算實現(xiàn)的OpenMP容錯機(jī)制[J];軟件學(xué)報;2012年02期

本文編號：2455119

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2455119.html

上一篇：功耗感知的自適應(yīng)粒子群優(yōu)化虛擬機(jī)動態(tài)映射
下一篇：相變存儲單元多值存儲的仿真研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于依賴跟蹤和消息計數(shù)的回卷恢復(fù)容錯技術(shù)研究