As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively f...As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the 'work-most' (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.展开更多
With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-sc...With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-scale HPC systems.Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs.This paper provides a comprehensive survey of existing software resilience approaches.Firstly,a classification of software resilience approaches is presented;then we introduce major approaches and techniques,including checkpointing,replication,soft error resilience,algorithmbased fault tolerance,fault detection and prediction.In addition,challenges exposed by system-scale and heterogeneous architecture are also discussed.展开更多
在高性能地学计算系统中,任务计算失败将会导致严重的后果,因此高性能地学计算必须具有可靠性保障。软件容错模型是提高并行计算容错能力的一种有效方法。针对传统基于检查点/回滚的容错策略存在资源浪费的不足,以并行地形分析为研究对...在高性能地学计算系统中,任务计算失败将会导致严重的后果,因此高性能地学计算必须具有可靠性保障。软件容错模型是提高并行计算容错能力的一种有效方法。针对传统基于检查点/回滚的容错策略存在资源浪费的不足,以并行地形分析为研究对象,基于软件容错模型提出一种基于邻域型算法的容错策略——N-ABFT(Neigh-boring-Algorithm Based Fault-Tolerant)。针对邻域型地形因子,该容错策略为并行程序划分出的各数据块增加冗余的校验行或校验列。最后,结合N-ABFT算法,提出一种容错调度算法,有效地提高了系统容错能力,降低了错误检测开销。展开更多
基金Project supported by the National Natural Science Foundation of China(Nos.61272141,61120106005,and 61303068)the National High-Tech R&D Program of China(No.2012AA01A301)
文摘As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the 'work-most' (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.
基金supported by the GHFund A(No.ghfund202107010337).
文摘With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-scale HPC systems.Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs.This paper provides a comprehensive survey of existing software resilience approaches.Firstly,a classification of software resilience approaches is presented;then we introduce major approaches and techniques,including checkpointing,replication,soft error resilience,algorithmbased fault tolerance,fault detection and prediction.In addition,challenges exposed by system-scale and heterogeneous architecture are also discussed.
文摘在高性能地学计算系统中,任务计算失败将会导致严重的后果,因此高性能地学计算必须具有可靠性保障。软件容错模型是提高并行计算容错能力的一种有效方法。针对传统基于检查点/回滚的容错策略存在资源浪费的不足,以并行地形分析为研究对象,基于软件容错模型提出一种基于邻域型算法的容错策略——N-ABFT(Neigh-boring-Algorithm Based Fault-Tolerant)。针对邻域型地形因子,该容错策略为并行程序划分出的各数据块增加冗余的校验行或校验列。最后,结合N-ABFT算法,提出一种容错调度算法,有效地提高了系统容错能力,降低了错误检测开销。