期刊文献+

FTRP:基于进程复制和预取的高性能计算容错框架(英文)

FTRP:a new fault tolerance framework using process replication and prefetching for high-performance computing
原文传递
导出
摘要 随着超级计算机规模迅速增大,可靠性成为制约系统可用性的主要问题。现有容错机制,包括检查点技术和进程冗余等,不能有效解决该问题。为此,提出一种基于进程复制和预取的高性能计算容错框架—FTRP(fault tolerance framework using process replication and prefetching),该框架兼具主动和被动容错机制的优点,引入创新的开销模型和主动容错机制,能够有效改善应用运行效率。提出"工作最多"(work-most,WM)的创新开销模型,基于故障预测结果和应用状态,从容错机制集中在线自适应给出运行容错决策。与程序运行过程中的局部性相似,我们第一次观察到超级计算机故障局部性现象。基于故障局部性,提出一种新的进程复制和进程预取相结合的容错机制,无论故障能否被预测到,都能够有效避免故障引起的损失。通过基于实际故障路径和普通故障预测准确率的模拟实验,并采用FTRP容错框架的应用,可以获得比现有容错机制10%的改进,且在P级甚至更大规模系统上有效。 As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the 'work-most' (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.
出处 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2018年第10期1273-1290,共18页 信息与电子工程前沿(英文版)
基金 Project supported by the National Natural Science Foundation of China(Nos.61272141,61120106005,and 61303068) the National High-Tech R&D Program of China(No.2012AA01A301)
关键词 High-performance computing PROACTIVE fault TOLERANCE Failure LOCALITY PROCESS REPLICATION PROCESS PREFETCHING High-performance computing Proactive fault tolerance Failure locality Process replication Process prefetching
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部