期刊文献+

Software approaches for resilience of high performance computing systems:a survey 被引量:1

原文传递
导出
摘要 With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-scale HPC systems.Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs.This paper provides a comprehensive survey of existing software resilience approaches.Firstly,a classification of software resilience approaches is presented;then we introduce major approaches and techniques,including checkpointing,replication,soft error resilience,algorithmbased fault tolerance,fault detection and prediction.In addition,challenges exposed by system-scale and heterogeneous architecture are also discussed.
出处 《Frontiers of Computer Science》 SCIE EI CSCD 2023年第4期43-56,共14页 中国计算机科学前沿(英文版)
基金 supported by the GHFund A(No.ghfund202107010337).
  • 引文网络
  • 相关文献

同被引文献3

引证文献1

二级引证文献1

;
使用帮助 返回顶部