實時系統(tǒng)工作流的能量感知容錯算法
發(fā)布時間:2020-12-24 06:26
科學(xué)計算需求量的爆發(fā)式增長,是高性能計算機(jī)(HPC)發(fā)展的直接驅(qū)動力。計算能力的提升,能夠極大推動各個科學(xué)領(lǐng)域研究成果的重大突破,但同時也為系統(tǒng)設(shè)計提出了更多的挑戰(zhàn)。本論文重點研究了高性能計算領(lǐng)域現(xiàn)階段亟待解決的兩個主要難題:容錯和能耗。為滿足科學(xué)計算所需的算力,近年來超級計算機(jī)的計算單元數(shù)量成倍增長,這就直接導(dǎo)致了錯誤頻率的升高。顯然,在如此龐大的計算系統(tǒng)中引入容錯機(jī)制是必須的,否則一個需要在大量計算單元上長時間運行的大型程序,可能永遠(yuǎn)都無法執(zhí)行完成。另一方面,出于預(yù)算限制與環(huán)境保護(hù)的考慮,我們必須要降低系統(tǒng)能耗。尤其因為容錯機(jī)制引入的時間與空間冗余,也導(dǎo)致了額外的能量消耗。同時,節(jié)能技術(shù)通常會引起系統(tǒng)故障率的升高。因此,在降低能耗的同時,我們必須要考慮到系統(tǒng)性能與可靠性的降級。在此研究背景下,我們通過調(diào)度算法的設(shè)計,權(quán)衡系統(tǒng)執(zhí)行時間、容錯、能耗等多個因素,以解決大規(guī)模高性能計算系統(tǒng)中的若干優(yōu)化問題,具體來說:1.本論文對工作流任務(wù)在大規(guī)模并行系統(tǒng)上的調(diào)度和檢查點策略(時間冗余)進(jìn)行研究,解決應(yīng)用的容錯與調(diào)度長度最小化的問題。該問題的解決方案包括兩個階段:決定任務(wù)在可用資源上的調(diào)度;...
【文章來源】:華東師范大學(xué)上海市 211工程院校 985工程院校 教育部直屬院校
【文章頁數(shù)】:127 頁
【學(xué)位級別】:博士
【文章目錄】:
摘要
abstract
Introduction
Ⅰ Scheduling and checkpointing workflows for fail-stop errors
1 Framework
1.1 Introduction
1.2 Related work
1.2.1 Soft and silent errors
1.2.2 Fail-stop failures
1.2.3 Branch and bound methods
2 Optimal solutions for special classes of task graphs
2.1 Example
2.2 Preliminaries
2.2.1 Execution model
2.2.2 Fault-tolerance model
2.2.3 Minimal Series Parallel Graphs (M-SPG)
2.2.4 Problem description and proposed approach
2.2.5 Evaluation of expected makespan
2.3 Scheduling M-SPGs
2.4 Placing checkpoints in superchains
2.4.1 From chains to superchains
2.4.2 Checkpointing algorithm
2.4.3 Technical remarks
2.5 The CKPTNONE strategy
2.5.1 #P-completeness
2.5.2 Approximating the makespan
2.6 Experiments
2.6.1 Experimental methodology
2.6.2 Expected makespan
2.7 Conclusion
3 Generic approaches for arbitrary task graphs
3.1 Example
3.2 Scheduling and checkpointing algorithms
3.2.1 Scheduling heuristics
3.2.2 Checkpointing strategies
3.3 Experiments
3.3.1 Experimental methodology
3.3.2 Simulator
3.3.3 Results
3.4 Conclusion
Ⅱ Energy-aware strategies for reliability-oriented real-time taskallocation
4 Framework
4.1 Introduction
4.2 Related work
4.2.1 Scheduling real-time applications on homogeneous platforms
4.2.2 Scheduling for heterogeneous platforms
4.2.3 Scheduling real-time applications on heterogeneous platforms
5 Homogeneous platforms
5.1 Previous approach
5.1.1 Optimization problem
5.1.2 Replica sets
5.1.3 Mapping and static schedule
5.1.4 Dynamic schedule
5.2 Motivational example
5.3 New strategies
5.3.1 Replica sets
5.3.2 Mapping and static schedule
5.3.3 Dynamic schedule
5.3.4 Heuristics
5.3.5 Complexity analysis
5.4 Performance evaluation
5.4.1 Experimental methodology
5.4.2 Results
5.5 Conclusion
6 Heterogeneous platforms
6.1 Model
6.1.1 Platform and tasks
6.1.2 Power and energy
6.1.3 Reliability
6.1.4 Optimization objective
6.1.5 Complexity
6.2 Mapping
6.3 Scheduling
6.4 Lower bound
6.5 Performance evaluation
6.5.1 Experimental methodology
6.5.2 Results
6.6 Conclusion
Conclusion
Bibliography
Publications
本文編號:2935157
【文章來源】:華東師范大學(xué)上海市 211工程院校 985工程院校 教育部直屬院校
【文章頁數(shù)】:127 頁
【學(xué)位級別】:博士
【文章目錄】:
摘要
abstract
Introduction
Ⅰ Scheduling and checkpointing workflows for fail-stop errors
1 Framework
1.1 Introduction
1.2 Related work
1.2.1 Soft and silent errors
1.2.2 Fail-stop failures
1.2.3 Branch and bound methods
2 Optimal solutions for special classes of task graphs
2.1 Example
2.2 Preliminaries
2.2.1 Execution model
2.2.2 Fault-tolerance model
2.2.3 Minimal Series Parallel Graphs (M-SPG)
2.2.4 Problem description and proposed approach
2.2.5 Evaluation of expected makespan
2.3 Scheduling M-SPGs
2.4 Placing checkpoints in superchains
2.4.1 From chains to superchains
2.4.2 Checkpointing algorithm
2.4.3 Technical remarks
2.5 The CKPTNONE strategy
2.5.1 #P-completeness
2.5.2 Approximating the makespan
2.6 Experiments
2.6.1 Experimental methodology
2.6.2 Expected makespan
2.7 Conclusion
3 Generic approaches for arbitrary task graphs
3.1 Example
3.2 Scheduling and checkpointing algorithms
3.2.1 Scheduling heuristics
3.2.2 Checkpointing strategies
3.3 Experiments
3.3.1 Experimental methodology
3.3.2 Simulator
3.3.3 Results
3.4 Conclusion
Ⅱ Energy-aware strategies for reliability-oriented real-time taskallocation
4 Framework
4.1 Introduction
4.2 Related work
4.2.1 Scheduling real-time applications on homogeneous platforms
4.2.2 Scheduling for heterogeneous platforms
4.2.3 Scheduling real-time applications on heterogeneous platforms
5 Homogeneous platforms
5.1 Previous approach
5.1.1 Optimization problem
5.1.2 Replica sets
5.1.3 Mapping and static schedule
5.1.4 Dynamic schedule
5.2 Motivational example
5.3 New strategies
5.3.1 Replica sets
5.3.2 Mapping and static schedule
5.3.3 Dynamic schedule
5.3.4 Heuristics
5.3.5 Complexity analysis
5.4 Performance evaluation
5.4.1 Experimental methodology
5.4.2 Results
5.5 Conclusion
6 Heterogeneous platforms
6.1 Model
6.1.1 Platform and tasks
6.1.2 Power and energy
6.1.3 Reliability
6.1.4 Optimization objective
6.1.5 Complexity
6.2 Mapping
6.3 Scheduling
6.4 Lower bound
6.5 Performance evaluation
6.5.1 Experimental methodology
6.5.2 Results
6.6 Conclusion
Conclusion
Bibliography
Publications
本文編號:2935157
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/2935157.html
最近更新
教材專著