面向瞬時故障的可配置容錯技術(shù)研究
本文選題:瞬時故障 + 程序分析; 參考:《國防科學(xué)技術(shù)大學(xué)》2013年博士論文
【摘要】:隨著處理器設(shè)計朝更小的晶體管特征尺寸、更低的工作電壓和更高的頻率發(fā)展,瞬時故障引發(fā)的可靠性問題已經(jīng)引起整個計算市場的關(guān)注。由于不同領(lǐng)域的用戶對系統(tǒng)可靠性、成本、性能、功耗等指標(biāo)的要求不同,如何面向不同用戶的不同需求提供可靠性和代價滿足約束的可靠性解決方案,成為處理器設(shè)計者必須面對的挑戰(zhàn)。為了應(yīng)對這種挑戰(zhàn),本文重點研究了可配置、低代價的容錯保護(hù)技術(shù)。此外,為了分析瞬時故障的影響和容錯技術(shù)的可靠性,本文也研究了基于故障注入的可靠性分析技術(shù)。具體來說,本文工作可以分為以下四個方面:1.處理器運算單元中的故障可能導(dǎo)致程序運行出現(xiàn)數(shù)據(jù)流錯誤或控制流錯誤。其中,數(shù)據(jù)流錯誤檢測通;谌哂嘤嬎愕姆椒ㄟM(jìn)行,如何降低冗余計算的開銷(性能、硬件開銷等)是困擾容錯研究至今的難點問題。為了解決這一問題,本文結(jié)合軟、硬件容錯技術(shù)的優(yōu)勢,提出了一種可配置的數(shù)據(jù)流檢測技術(shù)Epipe。Epipe首先通過改造現(xiàn)有的超標(biāo)量流水線處理器,提供了一個能夠?qū)χ噶钸M(jìn)行選擇性冗余保護(hù)的硬件平臺。由于超標(biāo)量處理器中有豐富的計算資源,Epipe平臺只需要很少的硬件開銷。為了減少冗余保護(hù)產(chǎn)生的性能開銷,Epipe還基于程序分析方法評估每個指令的重要性,即指令發(fā)生故障后導(dǎo)致程序輸出錯誤結(jié)果的概率。程序運行時,Epipe根據(jù)用戶的性能和可靠性要求選擇保護(hù)最重要的一部分指令。Epipe的創(chuàng)新點在于,Epipe只冗余保護(hù)發(fā)生故障后導(dǎo)致程序輸出錯誤結(jié)果的指令,對于導(dǎo)致系統(tǒng)異;虺瑫r的故障則直接利用系統(tǒng)中的異常檢測機(jī)制加以處理,而剩余的不會影響程序執(zhí)行的故障(即被屏蔽的故障)則不需要任何處理。這種分類處理故障的方法有效地減少了需要冗余保護(hù)的指令,再結(jié)合時空開銷較低的硬件指令保護(hù)技術(shù),使得Epipe技術(shù)可以更低的開銷保護(hù)程序數(shù)據(jù)流。2.實現(xiàn)控制流檢測的一種有效技術(shù)是軟件實現(xiàn)的標(biāo)簽分析方法。已有的標(biāo)簽分析技術(shù)除了存在時空開銷過大和可靠性不足的問題外,還缺乏可配置性,無法滿足不同用戶的不同需求。此外,軟件檢測技術(shù)引入的冗余代碼自身也有可能發(fā)生錯誤,現(xiàn)有的控制流檢測技術(shù)在容錯機(jī)制的自我保護(hù)方面缺乏研究。為了克服上述不足,本文提出了一種可配置的控制流檢測算法CFCES。CFCES通過為每個程序塊設(shè)計特殊格式的標(biāo)簽并在其中插裝額外的控制流檢測指令,以較少的開銷有效地克服了已有算法的檢測盲點。而且,CFCES在設(shè)計檢測機(jī)制時引入了一種被稱為“對等性”的不變量,通過對這種不變量進(jìn)行檢測,CFCES能夠以極低的代價實現(xiàn)檢錯機(jī)制的自容錯保護(hù)。此外,CFCES還通過分析函數(shù)的重要性和調(diào)節(jié)程序塊的大小提供了可配置的優(yōu)化方法,可以滿足用戶不同的時空開銷和可靠性約束。CFCES優(yōu)化方法的特點在于其可以提高CFCES的容錯效率,且可以用于優(yōu)化其它基于標(biāo)簽分析的控制流檢測算法。3.瞬時故障不僅可能發(fā)生在處理器運算單元,也有可能出現(xiàn)在處理器存儲單元中。被廣泛用于保護(hù)片外存儲的ECC技術(shù)并不適合用來保護(hù)片上存儲結(jié)構(gòu),原因是這些存儲結(jié)構(gòu)本身已經(jīng)占用了大部分芯片面積,并且訪問頻繁,采用ECC保護(hù)會帶來大量的面積、性能和功耗開銷。鑒于現(xiàn)有的容錯研究中十分缺乏針對片上存儲結(jié)構(gòu)的合理保護(hù)方案,本文針對一種特殊的片上存儲結(jié)構(gòu)SPM提出了低代價的保護(hù)技術(shù)PPS。盡管用ECC對SPM進(jìn)行完全保護(hù)的開銷很高,但是對部分SPM存儲進(jìn)行ECC保護(hù)并進(jìn)行合理分配仍是非常有價值的。PPS技術(shù)首先設(shè)計了基于部分ECC保護(hù)SPM的存儲體系結(jié)構(gòu)(被保護(hù)的比例可以根據(jù)不同應(yīng)用的可靠性、性能等需求決定),然后對程序中的待分配變量進(jìn)行脆弱性分析,并將SPM空間劃分為“寄存器”,最后采取基于優(yōu)先級的圖著色方法將較為脆弱的變量優(yōu)先分配到ECC保護(hù)的“寄存器”中;谏鲜龇椒,PPS能夠以較低的開銷獲得較高的存儲可靠性。4.故障注入是一種有效且廣為應(yīng)用的可靠性分析方法。故障注入技術(shù)面臨的困難是如何平衡故障模擬速度與精度的關(guān)系。由于已有的故障注入技術(shù)還不能有效地解決上述問題,本文提出了一種新的故障注入框架Smart Injector。Smart Injector首先基于程序分析從故障空間中刪除等價類故障和結(jié)果確定型故障。等價類故障是指發(fā)生在相似的數(shù)據(jù)流或控制流上下文環(huán)境中的故障。這類故障往往會導(dǎo)致系統(tǒng)產(chǎn)生相同的反應(yīng),因此只需要將它們劃為等價類并從中選取代表進(jìn)行模擬注入即可,等價類中其它故障則可以從故障空間中刪除。結(jié)果確定型故障是指那些通過程序分析就可以確定系統(tǒng)反應(yīng)的故障。Smart Injector還首次開發(fā)了一種故障結(jié)果預(yù)測技術(shù),通過預(yù)測故障產(chǎn)生的結(jié)果和判定結(jié)果的位置,可以在程序運行結(jié)束前提前判斷故障注入的結(jié)果,從而減少單次模擬的時間開銷。結(jié)合提出的故障刪除技術(shù)和故障結(jié)果預(yù)測技術(shù),Smart Injector以少量的精度損失極大地減少了故障注入的時間開銷。
[Abstract]:With the development of smaller transistor feature size, lower working voltage and higher frequency, the reliability problem caused by instantaneous fault has attracted the attention of the whole computing market. In order to cope with this challenge, this paper focuses on configurable and low cost fault tolerance protection technology. In addition, this paper also studies the effect of transient fault and the reliability of fault tolerance technology. The reliability analysis technique of barrier injection, specifically, this work can be divided into four aspects: the fault in the 1. processor unit may lead to a data flow error or a control flow error in the program running. In order to solve this problem, in order to solve this problem, this paper proposes a configurable data flow detection technology, Epipe.Epipe, which is based on the advantages of soft and hardware fault-tolerant technology. First, a superscalar pipelining processor is reformed to provide a selective redundancy protection for the instruction. Hardware platform. Because of the rich computing resources in the superscalar processor, the Epipe platform needs only a few hardware overhead. In order to reduce the performance overhead of redundant protection, Epipe also evaluates the importance of each instruction based on the program analysis method, that is, the probability of the program output error results after the failure of the instruction. The program runs, The innovation of Epipe to select the most important part of the instruction.Epipe according to the user's performance and reliability is that the Epipe is only redundant to protect the instructions that cause the error results of the program after the failure, and to deal with the abnormal or timeout faults directly using the exception detection mechanism in the system, while the rest is not. The fault (the shielded fault) that affects the execution of the program does not require any processing. This method of classifying the fault effectively reduces the instructions requiring redundant protection, and then combines the hardware instruction protection technology with low time and space overhead, so that the Epipe technology can lower the open pin protection program data stream.2. to implement the control flow detection. An effective technique is a label analysis method implemented by software. The existing label analysis technology, in addition to the problem of too much time and space overhead and lack of reliability, is still lack of configurability and can not meet the different needs of different users. In addition, the redundancy code introduced by software detection technology itself may also have errors and existing control. Flow detection technology lacks research on self protection in fault tolerance. In order to overcome these shortcomings, a configurable control flow detection algorithm, CFCES.CFCES, is proposed in this paper by designing a special format label for each block and inserting additional control flow detection instructions in it, effectively overcoming the existing calculation with less overhead. The blind spot of the method is detected. Furthermore, CFCES introduces an invariants called "equivalence" in the design of the detection mechanism. By detecting the invariants, the CFCES can realize the fault tolerance protection of the error detection mechanism at a very low cost. In addition, CFCES provides a fit for the analysis of the importance of the function and the size of the adjustment program block. The optimization method, which can satisfy the user's different time and space overhead and the reliability constraint.CFCES optimization method, can improve the CFCES fault tolerance efficiency, and can be used to optimize the other control flow detection algorithms based on the label analysis,.3. instantaneous fault may not only occur in the processor unit, but also may appear in the process of processing. The ECC technology, which is widely used to protect external storage, is not suitable for protecting the storage structure on the chip. The reason is that these storage structures themselves have occupied most of the chip area, and the access is frequent. The use of ECC protection will bring a lot of area, performance and power consumption. In view of the lack of fault tolerance research, it is very short. For a reasonable protection scheme for the storage structure on the chip, this paper presents a low cost protection technology for a special on chip storage structure SPM, PPS., although the overhead of full protection with ECC for SPM is very high, but the ECC protection and rational allocation of partial SPM storage is still a very valuable.PPS Technology first designed the base The storage architecture of the partial ECC protects the SPM (the protected proportion can be determined according to the reliability of different applications, performance and other requirements). Then, the vulnerability analysis of the undistributed variables in the program is analyzed, and the SPM space is divided into "registers". Finally, the more vulnerable variables are prioritization based on the graph coloring method based on the priority level. Based on the "register" of ECC protection. Based on the above method, PPS can obtain high storage reliability with lower overhead and.4. fault injection is an effective and widely used reliability analysis method. The difficulty of fault injection technology is how to balance the relationship between the speed and precision of the fault simulation. The above problem can not be solved effectively. In this paper, a new fault injection framework, Smart Injector.Smart Injector, is proposed to delete equivalent type fault and result deterministic fault in the fault space first. The equivalent fault is a fault in similar data flow or control flow context. The obstacles often cause the same reaction to the system, so they only need to be classified as equivalent classes and selected from the representative to simulate injection, and other faults in the equivalent class can be deleted from the fault space. The result determined type fault is the.Smart Injector, which can determine the system reaction through the program analysis. A fault result prediction technique is developed. By predicting the result of the fault and the position of the decision result, the result of the fault injection can be judged in advance before the end of the program running, thus reducing the time cost of the single simulation. In combination with the proposed fault deletion technology and the fault result prediction technique, the Smart Injector is with a small amount of precision. The loss greatly reduces the time cost of fault injection.
【學(xué)位授予單位】:國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2013
【分類號】:TP332
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 孫欣;檢測與屏蔽煤礦操作所用計算機(jī)中的瞬時故障[J];煤礦現(xiàn)代化;1995年02期
2 李建立;譚慶平;徐建軍;;一種輻射環(huán)境下瞬時故障的軟件檢測方法[J];計算機(jī)工程與科學(xué);2010年03期
3 馬滿福;姚軍;張強(qiáng);賈永新;;多交叉通道模型中瞬時故障的后向恢復(fù)[J];計算機(jī)應(yīng)用;2014年09期
4 鄧煥明;黃雙;周純杰;;工業(yè)以太網(wǎng)通信中瞬時故障處理[J];計算機(jī)工程與設(shè)計;2012年09期
5 馬杰;黃雄峰;帥金榮;周純杰;;工業(yè)人機(jī)界面瞬時故障檢測及恢復(fù)方法[J];可編程控制器與工廠自動化;2012年01期
6 左澤華;黃雄峰;秦元慶;周純杰;;無線隧道施工監(jiān)控系統(tǒng)瞬時故障恢復(fù)控制[J];計算機(jī)應(yīng)用;2012年05期
7 解鵬,崔剛,,王申科,吳智博,楊孝宗;TMR計算機(jī)系統(tǒng)瞬時故障的糾錯技術(shù)[J];航空計算技術(shù);1996年02期
8 解鵬,崔剛,吳智博,楊孝宗,楊鵬;基于表決的TMR機(jī)瞬時故障糾錯技術(shù)的研究[J];電腦學(xué)習(xí);1996年05期
9 江建慧;梁劍華;靳昂;胡瑾;;Linux上軟件實現(xiàn)的瞬時故障注入方案及實現(xiàn)[J];同濟(jì)大學(xué)學(xué)報(自然科學(xué)版);2006年06期
10 朱丹丹;劉久富;陳柯;梁娟娟;;一種面向瞬時故障的容錯技術(shù)的形式化方法[J];電子設(shè)計工程;2013年05期
相關(guān)重要報紙文章 前1條
1 ;減少特殊天氣下配網(wǎng)瞬時故障[N];中國電力報;2013年
相關(guān)博士學(xué)位論文 前1條
1 李建立;面向瞬時故障的可配置容錯技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2013年
相關(guān)碩士學(xué)位論文 前2條
1 王燁;施工隧道無線監(jiān)控系統(tǒng)瞬時故障分析及控制技術(shù)研究[D];華中科技大學(xué);2011年
2 廖政;星載擺臂控制系統(tǒng)瞬時故障軟件容錯技術(shù)研究[D];華中科技大學(xué);2011年
本文編號:1906611
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1906611.html