期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
User-level failure detection and auto-recovery of parallel programs in HPC systems
1
作者 Guozhen ZHANG Yi LIU +2 位作者 Hailong YANG Jun XU Depei QIAN 《Frontiers of Computer Science》 SCIE EI CSCD 2021年第6期31-42,共12页
As the mean-time-between-failures(MTBF)continues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,program failures might occur during the execution period with ... As the mean-time-between-failures(MTBF)continues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,program failures might occur during the execution period with high probability.Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned.From the user perspective,if the program failure cannot be detected and handled in time,it would waste resources and delay the progress of program execution.Unfortunately,the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege.Currently,automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing.This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs.The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs.In addition,we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher(ARL)and evaluate it on the Tianhe-2 system.Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system.In addition,the communication and performance overhead caused by ARL is negligible.The good scalability of ARL makes it applicable for large-scale HPC systems. 展开更多
关键词 high performance computing parallel program failure detection failure auto-recovery
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部