實(shí)時(shí)系統(tǒng)工作流的能量感知容錯(cuò)算法
發(fā)布時(shí)間:2020-12-24 06:26
科學(xué)計(jì)算需求量的爆發(fā)式增長(zhǎng),是高性能計(jì)算機(jī)(HPC)發(fā)展的直接驅(qū)動(dòng)力。計(jì)算能力的提升,能夠極大推動(dòng)各個(gè)科學(xué)領(lǐng)域研究成果的重大突破,但同時(shí)也為系統(tǒng)設(shè)計(jì)提出了更多的挑戰(zhàn)。本論文重點(diǎn)研究了高性能計(jì)算領(lǐng)域現(xiàn)階段亟待解決的兩個(gè)主要難題:容錯(cuò)和能耗。為滿足科學(xué)計(jì)算所需的算力,近年來超級(jí)計(jì)算機(jī)的計(jì)算單元數(shù)量成倍增長(zhǎng),這就直接導(dǎo)致了錯(cuò)誤頻率的升高。顯然,在如此龐大的計(jì)算系統(tǒng)中引入容錯(cuò)機(jī)制是必須的,否則一個(gè)需要在大量計(jì)算單元上長(zhǎng)時(shí)間運(yùn)行的大型程序,可能永遠(yuǎn)都無法執(zhí)行完成。另一方面,出于預(yù)算限制與環(huán)境保護(hù)的考慮,我們必須要降低系統(tǒng)能耗。尤其因?yàn)槿蒎e(cuò)機(jī)制引入的時(shí)間與空間冗余,也導(dǎo)致了額外的能量消耗。同時(shí),節(jié)能技術(shù)通常會(huì)引起系統(tǒng)故障率的升高。因此,在降低能耗的同時(shí),我們必須要考慮到系統(tǒng)性能與可靠性的降級(jí)。在此研究背景下,我們通過調(diào)度算法的設(shè)計(jì),權(quán)衡系統(tǒng)執(zhí)行時(shí)間、容錯(cuò)、能耗等多個(gè)因素,以解決大規(guī)模高性能計(jì)算系統(tǒng)中的若干優(yōu)化問題,具體來說:1.本論文對(duì)工作流任務(wù)在大規(guī)模并行系統(tǒng)上的調(diào)度和檢查點(diǎn)策略(時(shí)間冗余)進(jìn)行研究,解決應(yīng)用的容錯(cuò)與調(diào)度長(zhǎng)度最小化的問題。該問題的解決方案包括兩個(gè)階段:決定任務(wù)在可用資源上的調(diào)度;...
【文章來源】:華東師范大學(xué)上海市 211工程院校 985工程院校 教育部直屬院校
【文章頁(yè)數(shù)】:127 頁(yè)
【學(xué)位級(jí)別】:博士
【文章目錄】:
摘要
abstract
Introduction
Ⅰ Scheduling and checkpointing workflows for fail-stop errors
1 Framework
1.1 Introduction
1.2 Related work
1.2.1 Soft and silent errors
1.2.2 Fail-stop failures
1.2.3 Branch and bound methods
2 Optimal solutions for special classes of task graphs
2.1 Example
2.2 Preliminaries
2.2.1 Execution model
2.2.2 Fault-tolerance model
2.2.3 Minimal Series Parallel Graphs (M-SPG)
2.2.4 Problem description and proposed approach
2.2.5 Evaluation of expected makespan
2.3 Scheduling M-SPGs
2.4 Placing checkpoints in superchains
2.4.1 From chains to superchains
2.4.2 Checkpointing algorithm
2.4.3 Technical remarks
2.5 The CKPTNONE strategy
2.5.1 #P-completeness
2.5.2 Approximating the makespan
2.6 Experiments
2.6.1 Experimental methodology
2.6.2 Expected makespan
2.7 Conclusion
3 Generic approaches for arbitrary task graphs
3.1 Example
3.2 Scheduling and checkpointing algorithms
3.2.1 Scheduling heuristics
3.2.2 Checkpointing strategies
3.3 Experiments
3.3.1 Experimental methodology
3.3.2 Simulator
3.3.3 Results
3.4 Conclusion
Ⅱ Energy-aware strategies for reliability-oriented real-time taskallocation
4 Framework
4.1 Introduction
4.2 Related work
4.2.1 Scheduling real-time applications on homogeneous platforms
4.2.2 Scheduling for heterogeneous platforms
4.2.3 Scheduling real-time applications on heterogeneous platforms
5 Homogeneous platforms
5.1 Previous approach
5.1.1 Optimization problem
5.1.2 Replica sets
5.1.3 Mapping and static schedule
5.1.4 Dynamic schedule
5.2 Motivational example
5.3 New strategies
5.3.1 Replica sets
5.3.2 Mapping and static schedule
5.3.3 Dynamic schedule
5.3.4 Heuristics
5.3.5 Complexity analysis
5.4 Performance evaluation
5.4.1 Experimental methodology
5.4.2 Results
5.5 Conclusion
6 Heterogeneous platforms
6.1 Model
6.1.1 Platform and tasks
6.1.2 Power and energy
6.1.3 Reliability
6.1.4 Optimization objective
6.1.5 Complexity
6.2 Mapping
6.3 Scheduling
6.4 Lower bound
6.5 Performance evaluation
6.5.1 Experimental methodology
6.5.2 Results
6.6 Conclusion
Conclusion
Bibliography
Publications
本文編號(hào):2935157
【文章來源】:華東師范大學(xué)上海市 211工程院校 985工程院校 教育部直屬院校
【文章頁(yè)數(shù)】:127 頁(yè)
【學(xué)位級(jí)別】:博士
【文章目錄】:
摘要
abstract
Introduction
Ⅰ Scheduling and checkpointing workflows for fail-stop errors
1 Framework
1.1 Introduction
1.2 Related work
1.2.1 Soft and silent errors
1.2.2 Fail-stop failures
1.2.3 Branch and bound methods
2 Optimal solutions for special classes of task graphs
2.1 Example
2.2 Preliminaries
2.2.1 Execution model
2.2.2 Fault-tolerance model
2.2.3 Minimal Series Parallel Graphs (M-SPG)
2.2.4 Problem description and proposed approach
2.2.5 Evaluation of expected makespan
2.3 Scheduling M-SPGs
2.4 Placing checkpoints in superchains
2.4.1 From chains to superchains
2.4.2 Checkpointing algorithm
2.4.3 Technical remarks
2.5 The CKPTNONE strategy
2.5.1 #P-completeness
2.5.2 Approximating the makespan
2.6 Experiments
2.6.1 Experimental methodology
2.6.2 Expected makespan
2.7 Conclusion
3 Generic approaches for arbitrary task graphs
3.1 Example
3.2 Scheduling and checkpointing algorithms
3.2.1 Scheduling heuristics
3.2.2 Checkpointing strategies
3.3 Experiments
3.3.1 Experimental methodology
3.3.2 Simulator
3.3.3 Results
3.4 Conclusion
Ⅱ Energy-aware strategies for reliability-oriented real-time taskallocation
4 Framework
4.1 Introduction
4.2 Related work
4.2.1 Scheduling real-time applications on homogeneous platforms
4.2.2 Scheduling for heterogeneous platforms
4.2.3 Scheduling real-time applications on heterogeneous platforms
5 Homogeneous platforms
5.1 Previous approach
5.1.1 Optimization problem
5.1.2 Replica sets
5.1.3 Mapping and static schedule
5.1.4 Dynamic schedule
5.2 Motivational example
5.3 New strategies
5.3.1 Replica sets
5.3.2 Mapping and static schedule
5.3.3 Dynamic schedule
5.3.4 Heuristics
5.3.5 Complexity analysis
5.4 Performance evaluation
5.4.1 Experimental methodology
5.4.2 Results
5.5 Conclusion
6 Heterogeneous platforms
6.1 Model
6.1.1 Platform and tasks
6.1.2 Power and energy
6.1.3 Reliability
6.1.4 Optimization objective
6.1.5 Complexity
6.2 Mapping
6.3 Scheduling
6.4 Lower bound
6.5 Performance evaluation
6.5.1 Experimental methodology
6.5.2 Results
6.6 Conclusion
Conclusion
Bibliography
Publications
本文編號(hào):2935157
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/2935157.html
最近更新
教材專著