摘要
为了确保并行程序能够在并行环境下准确地运行,须提高系统的可靠性,将容错技术应用到并行计算中。该文针对MPI并行程序提出一种容错系统的设计方法,采用检查点/卷回恢复技术、并添加故障检测功能,能够有效地处理节点失效故障和进程失效故障,在一定范围内实现容错,为MPI环境下进行大规模计算提供一个可使用的应用模型。
In order to ensure that the parallel program accurately runs on parallel environment,system reliability must be enhanced,so fault-tolerant technology is applied to parallel computing.In this paper,a design method of fault-tolerant system is proposed for parallel program based on MPI,which adoptes checkpoint/rollback recovery technology and adds a function of detecting failure.The fault-tolerant system can effectively handle node failure and processes failure in a certain range,which provide a practical model for large-scale calculation under MPI environment.
作者
李飞飞
LI Fei-fei(Northeast Dianli University,Jilin 132012,China)
出处
《电脑知识与技术》
2011年第2期817-819,共3页
Computer Knowledge and Technology
关键词
MPI并行程序
容错
检查点/卷回恢复
故障检测
parallel program
MPI
fault-tolerant
checkpoint/rollback recovery
detecting failure