基于消息數(shù)目檢驗和消息重排序理論的檢查點算法的研究
發(fā)布時間:2018-11-22 14:14
【摘要】:隨著大型分布式系統(tǒng)的不斷發(fā)展,人們越來越關(guān)注系統(tǒng)的可靠性。例如中國研制的天河一號系統(tǒng)、航空火車等分布式控制交通系統(tǒng)以及基于MPI的FT-MPI系統(tǒng)等。分布式系統(tǒng)不僅關(guān)系到經(jīng)濟社會各方面的發(fā)展,而且與我們每個人息息相關(guān)。分布式系統(tǒng)容錯性的質(zhì)量保證的特性決定了其應用的廣泛性以及重要性。分布式系統(tǒng)的容錯性可以理解為容忍錯誤,消除錯誤影響。分布式系統(tǒng)容錯主要分為前向容錯和后向容錯。考慮到存儲量以及恢復過程,與前向容錯技術(shù)相比,后向容錯技術(shù)在實際應用中更為廣泛。 本課題來自于“基于后向恢復的異構(gòu)分布式系統(tǒng)容錯技術(shù)的研究與實現(xiàn)”的山東省自然科學基金項目。后向容錯技術(shù)分為兩種:基于檢查點的容錯算法與基于消息日志的容錯協(xié)議。如何保存分布式系統(tǒng)的系統(tǒng)狀態(tài)以及當系統(tǒng)失效時如何使進程恢復到全局一致狀態(tài)是后向容錯技術(shù)中的兩個主要問題。現(xiàn)存文獻中存在很多判定分布式全局狀態(tài)一致的方法,但存在不同程度的缺陷。 本文主要創(chuàng)新點及貢獻為: (1)提出消息數(shù)目檢驗模型。通過研究進程間消息接收事件數(shù)目與消息發(fā)送事件數(shù)目的關(guān)系,本文提出消息數(shù)目檢驗模型。在此模型中,若一個全局狀態(tài)中不含孤兒消息,則判定此全局狀態(tài)是一致的。 (2)基于消息數(shù)目檢驗模型,提出一種新的求解包含給定檢查集的最大最小全局一致檢查點算法。此算法首先利用消息數(shù)目檢驗方法判定給定的檢查點集中是否存在孤兒消息。如果存在孤兒消息,則分布式系統(tǒng)中不存在包含給定檢查點集的最大最小全局一致檢查點,減少搜索時間開銷。否則,通過全局搜索算法查找包含給定檢查點集的最大最小全局一致檢查點。 (3)提出了消息重排序理論。首先此理論描述了消息發(fā)送事件與接收事件間的總是在先發(fā)生關(guān)系,并利用進程改進的邏輯時鐘標記事件間的總是在先發(fā)生關(guān)系。其次此理論引入了等價消息接收序列的概念。在消息恢復過程中不存在和進程失效前執(zhí)行結(jié)果完全一致的等價的消息接收序列。最后此理論解決了在樂觀消息日志恢復協(xié)議中,進程的接收消息次序由于故障丟失的問題。 (4)在消息數(shù)目檢驗模型和消息重排序理論的基礎(chǔ)上,提出一種新的消息日志協(xié)議。此消息日志協(xié)議的主要創(chuàng)新點是:①此協(xié)議表明了在樂觀消息日志協(xié)議中,系統(tǒng)故障恢復時未做日志的消息不可能以系統(tǒng)失效前的接收順序準確重現(xiàn)。②使用基于接收的日志協(xié)議,消息做日志事件與消息發(fā)送事件可在消息發(fā)送方異步進行。
[Abstract]:With the development of large distributed system, people pay more and more attention to the reliability of the system. Such as Tianhe 1 system developed in China, distributed traffic control system such as air train and FT-MPI system based on MPI. Distributed system is not only related to economic and social development, but also closely related to each of us. The quality assurance of fault tolerance in distributed systems determines the universality and importance of its applications. The fault tolerance of distributed systems can be understood as tolerating errors and eliminating error effects. Fault tolerance in distributed systems is mainly divided into forward fault tolerance and backward fault tolerance. Considering the storage capacity and recovery process, the backward fault-tolerant technology is more widely used in practice than the forward fault-tolerant technology. This topic comes from the project of Shandong Natural Science Foundation, "Research and implementation of fault-tolerant technology for heterogeneous distributed systems based on backward recovery". Backward fault-tolerant techniques can be divided into two types: fault-tolerant algorithm based on checkpoint and fault-tolerant protocol based on message log. How to preserve the system state of distributed systems and how to restore the process to a globally consistent state when the system fails are two main problems in the backward fault-tolerant technology. There are many methods to determine the consistency of distributed global state in the existing literature, but there are some defects in different degrees. The main innovations and contributions of this paper are as follows: (1) A message number test model is proposed. By studying the relationship between the number of message receiving events and the number of message sending events, this paper proposes a message number verification model. In this model, if there is no orphan message in a global state, it is determined that the global state is consistent. (2) based on the message number test model, a new algorithm for solving the maximum and minimum global uniform checkpointing with a given check set is proposed. The algorithm first uses the number of messages test method to determine whether an orphan message exists in a given checkpoint set. If orphan messages exist in distributed systems there is no maximum and minimum globally uniform checkpoint containing a given set of checkpoints which reduces the search time overhead. Otherwise, a global search algorithm is used to find the maximum and minimum globally uniform checkpoints containing a given set of checkpoints. (3) the theory of message reordering is proposed. Firstly, this theory describes the relationship between the sending and receiving events, and uses the improved logical clock to mark the relationship between the events. Secondly, this theory introduces the concept of equivalent message receiving sequence. In the process of message recovery, there is no equivalent message receiving sequence that is consistent with the result of execution before the process fails. Finally, this theory solves the problem that the order of received messages in the optimistic message log recovery protocol is lost due to failure. (4) based on the message number checking model and message reordering theory, a new message log protocol is proposed. The main innovations of this message logging protocol are: 1 this protocol indicates that in the optimistic message logging protocol, Messages that are not logged during system fault recovery can not be accurately reproduced in the order of receiving before system failure. 2 using the received log protocol, message log events and message sending events can be performed asynchronously at the sender of the message.
【學位授予單位】:山東大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP302.8
本文編號:2349605
[Abstract]:With the development of large distributed system, people pay more and more attention to the reliability of the system. Such as Tianhe 1 system developed in China, distributed traffic control system such as air train and FT-MPI system based on MPI. Distributed system is not only related to economic and social development, but also closely related to each of us. The quality assurance of fault tolerance in distributed systems determines the universality and importance of its applications. The fault tolerance of distributed systems can be understood as tolerating errors and eliminating error effects. Fault tolerance in distributed systems is mainly divided into forward fault tolerance and backward fault tolerance. Considering the storage capacity and recovery process, the backward fault-tolerant technology is more widely used in practice than the forward fault-tolerant technology. This topic comes from the project of Shandong Natural Science Foundation, "Research and implementation of fault-tolerant technology for heterogeneous distributed systems based on backward recovery". Backward fault-tolerant techniques can be divided into two types: fault-tolerant algorithm based on checkpoint and fault-tolerant protocol based on message log. How to preserve the system state of distributed systems and how to restore the process to a globally consistent state when the system fails are two main problems in the backward fault-tolerant technology. There are many methods to determine the consistency of distributed global state in the existing literature, but there are some defects in different degrees. The main innovations and contributions of this paper are as follows: (1) A message number test model is proposed. By studying the relationship between the number of message receiving events and the number of message sending events, this paper proposes a message number verification model. In this model, if there is no orphan message in a global state, it is determined that the global state is consistent. (2) based on the message number test model, a new algorithm for solving the maximum and minimum global uniform checkpointing with a given check set is proposed. The algorithm first uses the number of messages test method to determine whether an orphan message exists in a given checkpoint set. If orphan messages exist in distributed systems there is no maximum and minimum globally uniform checkpoint containing a given set of checkpoints which reduces the search time overhead. Otherwise, a global search algorithm is used to find the maximum and minimum globally uniform checkpoints containing a given set of checkpoints. (3) the theory of message reordering is proposed. Firstly, this theory describes the relationship between the sending and receiving events, and uses the improved logical clock to mark the relationship between the events. Secondly, this theory introduces the concept of equivalent message receiving sequence. In the process of message recovery, there is no equivalent message receiving sequence that is consistent with the result of execution before the process fails. Finally, this theory solves the problem that the order of received messages in the optimistic message log recovery protocol is lost due to failure. (4) based on the message number checking model and message reordering theory, a new message log protocol is proposed. The main innovations of this message logging protocol are: 1 this protocol indicates that in the optimistic message logging protocol, Messages that are not logged during system fault recovery can not be accurately reproduced in the order of receiving before system failure. 2 using the received log protocol, message log events and message sending events can be performed asynchronously at the sender of the message.
【學位授予單位】:山東大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP302.8
【參考文獻】
相關(guān)期刊論文 前1條
1 汪東升,沈美明,鄭緯民,裴丹;一種基于檢查點的卷回恢復與進程遷移系統(tǒng)[J];軟件學報;1999年01期
,本文編號:2349605
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2349605.html
最近更新
教材專著