基于消息數(shù)目檢驗(yàn)和消息重排序理論的檢查點(diǎn)算法的研究
發(fā)布時(shí)間:2018-11-22 14:14
【摘要】:隨著大型分布式系統(tǒng)的不斷發(fā)展,人們?cè)絹碓疥P(guān)注系統(tǒng)的可靠性。例如中國研制的天河一號(hào)系統(tǒng)、航空火車等分布式控制交通系統(tǒng)以及基于MPI的FT-MPI系統(tǒng)等。分布式系統(tǒng)不僅關(guān)系到經(jīng)濟(jì)社會(huì)各方面的發(fā)展,而且與我們每個(gè)人息息相關(guān)。分布式系統(tǒng)容錯(cuò)性的質(zhì)量保證的特性決定了其應(yīng)用的廣泛性以及重要性。分布式系統(tǒng)的容錯(cuò)性可以理解為容忍錯(cuò)誤,消除錯(cuò)誤影響。分布式系統(tǒng)容錯(cuò)主要分為前向容錯(cuò)和后向容錯(cuò)?紤]到存儲(chǔ)量以及恢復(fù)過程,與前向容錯(cuò)技術(shù)相比,后向容錯(cuò)技術(shù)在實(shí)際應(yīng)用中更為廣泛。 本課題來自于“基于后向恢復(fù)的異構(gòu)分布式系統(tǒng)容錯(cuò)技術(shù)的研究與實(shí)現(xiàn)”的山東省自然科學(xué)基金項(xiàng)目。后向容錯(cuò)技術(shù)分為兩種:基于檢查點(diǎn)的容錯(cuò)算法與基于消息日志的容錯(cuò)協(xié)議。如何保存分布式系統(tǒng)的系統(tǒng)狀態(tài)以及當(dāng)系統(tǒng)失效時(shí)如何使進(jìn)程恢復(fù)到全局一致狀態(tài)是后向容錯(cuò)技術(shù)中的兩個(gè)主要問題,F(xiàn)存文獻(xiàn)中存在很多判定分布式全局狀態(tài)一致的方法,但存在不同程度的缺陷。 本文主要?jiǎng)?chuàng)新點(diǎn)及貢獻(xiàn)為: (1)提出消息數(shù)目檢驗(yàn)?zāi)P。通過研究進(jìn)程間消息接收事件數(shù)目與消息發(fā)送事件數(shù)目的關(guān)系,本文提出消息數(shù)目檢驗(yàn)?zāi)P汀T诖四P椭?若一個(gè)全局狀態(tài)中不含孤兒消息,則判定此全局狀態(tài)是一致的。 (2)基于消息數(shù)目檢驗(yàn)?zāi)P?提出一種新的求解包含給定檢查集的最大最小全局一致檢查點(diǎn)算法。此算法首先利用消息數(shù)目檢驗(yàn)方法判定給定的檢查點(diǎn)集中是否存在孤兒消息。如果存在孤兒消息,則分布式系統(tǒng)中不存在包含給定檢查點(diǎn)集的最大最小全局一致檢查點(diǎn),減少搜索時(shí)間開銷。否則,通過全局搜索算法查找包含給定檢查點(diǎn)集的最大最小全局一致檢查點(diǎn)。 (3)提出了消息重排序理論。首先此理論描述了消息發(fā)送事件與接收事件間的總是在先發(fā)生關(guān)系,并利用進(jìn)程改進(jìn)的邏輯時(shí)鐘標(biāo)記事件間的總是在先發(fā)生關(guān)系。其次此理論引入了等價(jià)消息接收序列的概念。在消息恢復(fù)過程中不存在和進(jìn)程失效前執(zhí)行結(jié)果完全一致的等價(jià)的消息接收序列。最后此理論解決了在樂觀消息日志恢復(fù)協(xié)議中,進(jìn)程的接收消息次序由于故障丟失的問題。 (4)在消息數(shù)目檢驗(yàn)?zāi)P秃拖⒅嘏判蚶碚摰幕A(chǔ)上,提出一種新的消息日志協(xié)議。此消息日志協(xié)議的主要?jiǎng)?chuàng)新點(diǎn)是:①此協(xié)議表明了在樂觀消息日志協(xié)議中,系統(tǒng)故障恢復(fù)時(shí)未做日志的消息不可能以系統(tǒng)失效前的接收順序準(zhǔn)確重現(xiàn)。②使用基于接收的日志協(xié)議,消息做日志事件與消息發(fā)送事件可在消息發(fā)送方異步進(jìn)行。
[Abstract]:With the development of large distributed system, people pay more and more attention to the reliability of the system. Such as Tianhe 1 system developed in China, distributed traffic control system such as air train and FT-MPI system based on MPI. Distributed system is not only related to economic and social development, but also closely related to each of us. The quality assurance of fault tolerance in distributed systems determines the universality and importance of its applications. The fault tolerance of distributed systems can be understood as tolerating errors and eliminating error effects. Fault tolerance in distributed systems is mainly divided into forward fault tolerance and backward fault tolerance. Considering the storage capacity and recovery process, the backward fault-tolerant technology is more widely used in practice than the forward fault-tolerant technology. This topic comes from the project of Shandong Natural Science Foundation, "Research and implementation of fault-tolerant technology for heterogeneous distributed systems based on backward recovery". Backward fault-tolerant techniques can be divided into two types: fault-tolerant algorithm based on checkpoint and fault-tolerant protocol based on message log. How to preserve the system state of distributed systems and how to restore the process to a globally consistent state when the system fails are two main problems in the backward fault-tolerant technology. There are many methods to determine the consistency of distributed global state in the existing literature, but there are some defects in different degrees. The main innovations and contributions of this paper are as follows: (1) A message number test model is proposed. By studying the relationship between the number of message receiving events and the number of message sending events, this paper proposes a message number verification model. In this model, if there is no orphan message in a global state, it is determined that the global state is consistent. (2) based on the message number test model, a new algorithm for solving the maximum and minimum global uniform checkpointing with a given check set is proposed. The algorithm first uses the number of messages test method to determine whether an orphan message exists in a given checkpoint set. If orphan messages exist in distributed systems there is no maximum and minimum globally uniform checkpoint containing a given set of checkpoints which reduces the search time overhead. Otherwise, a global search algorithm is used to find the maximum and minimum globally uniform checkpoints containing a given set of checkpoints. (3) the theory of message reordering is proposed. Firstly, this theory describes the relationship between the sending and receiving events, and uses the improved logical clock to mark the relationship between the events. Secondly, this theory introduces the concept of equivalent message receiving sequence. In the process of message recovery, there is no equivalent message receiving sequence that is consistent with the result of execution before the process fails. Finally, this theory solves the problem that the order of received messages in the optimistic message log recovery protocol is lost due to failure. (4) based on the message number checking model and message reordering theory, a new message log protocol is proposed. The main innovations of this message logging protocol are: 1 this protocol indicates that in the optimistic message logging protocol, Messages that are not logged during system fault recovery can not be accurately reproduced in the order of receiving before system failure. 2 using the received log protocol, message log events and message sending events can be performed asynchronously at the sender of the message.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP302.8
本文編號(hào):2349605
[Abstract]:With the development of large distributed system, people pay more and more attention to the reliability of the system. Such as Tianhe 1 system developed in China, distributed traffic control system such as air train and FT-MPI system based on MPI. Distributed system is not only related to economic and social development, but also closely related to each of us. The quality assurance of fault tolerance in distributed systems determines the universality and importance of its applications. The fault tolerance of distributed systems can be understood as tolerating errors and eliminating error effects. Fault tolerance in distributed systems is mainly divided into forward fault tolerance and backward fault tolerance. Considering the storage capacity and recovery process, the backward fault-tolerant technology is more widely used in practice than the forward fault-tolerant technology. This topic comes from the project of Shandong Natural Science Foundation, "Research and implementation of fault-tolerant technology for heterogeneous distributed systems based on backward recovery". Backward fault-tolerant techniques can be divided into two types: fault-tolerant algorithm based on checkpoint and fault-tolerant protocol based on message log. How to preserve the system state of distributed systems and how to restore the process to a globally consistent state when the system fails are two main problems in the backward fault-tolerant technology. There are many methods to determine the consistency of distributed global state in the existing literature, but there are some defects in different degrees. The main innovations and contributions of this paper are as follows: (1) A message number test model is proposed. By studying the relationship between the number of message receiving events and the number of message sending events, this paper proposes a message number verification model. In this model, if there is no orphan message in a global state, it is determined that the global state is consistent. (2) based on the message number test model, a new algorithm for solving the maximum and minimum global uniform checkpointing with a given check set is proposed. The algorithm first uses the number of messages test method to determine whether an orphan message exists in a given checkpoint set. If orphan messages exist in distributed systems there is no maximum and minimum globally uniform checkpoint containing a given set of checkpoints which reduces the search time overhead. Otherwise, a global search algorithm is used to find the maximum and minimum globally uniform checkpoints containing a given set of checkpoints. (3) the theory of message reordering is proposed. Firstly, this theory describes the relationship between the sending and receiving events, and uses the improved logical clock to mark the relationship between the events. Secondly, this theory introduces the concept of equivalent message receiving sequence. In the process of message recovery, there is no equivalent message receiving sequence that is consistent with the result of execution before the process fails. Finally, this theory solves the problem that the order of received messages in the optimistic message log recovery protocol is lost due to failure. (4) based on the message number checking model and message reordering theory, a new message log protocol is proposed. The main innovations of this message logging protocol are: 1 this protocol indicates that in the optimistic message logging protocol, Messages that are not logged during system fault recovery can not be accurately reproduced in the order of receiving before system failure. 2 using the received log protocol, message log events and message sending events can be performed asynchronously at the sender of the message.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP302.8
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 汪東升,沈美明,鄭緯民,裴丹;一種基于檢查點(diǎn)的卷回恢復(fù)與進(jìn)程遷移系統(tǒng)[J];軟件學(xué)報(bào);1999年01期
,本文編號(hào):2349605
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2349605.html
最近更新
教材專著