摘要
为了保证大规模集群系统的可靠性和可用性,设计并实现了一个面向集群消息传递并行程序的容错系统。该系统采用检查点设置与卷回恢复技术,提出了基于内存排除的退出重进入并行环境策略,实现了对用户程序完全透明的容错功能、进程迁移以及系统自动重构。实验结果表明:检查点设置和系统恢复开销小于10%,符合大规模并行程序容错功能的要求。该系统提高了集群系统的可靠性和可用性,其设计结构和实现方法可以方便地移植到其他消息传递系统。
A fault tolerant run time system was developed for cluster oriented message passing interface (MPI) parallel applications to guarantee system reliability and availability in high performance clusters. This system uses the checkpointing and rollback recovery technique, with user lever transparent fault tolerance, process migration, and system auto reconfiguration based on an "exit and reenter" parallel environment strategy, Test results suggest that the overhead is less then 10% to satisfy the basic requirements of parallel fault tolerance. The system improves the cluster reliability and availability and its structure and implementation scheme can be conveniently ported to other message passing systems.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2006年第1期67-69,110,共4页
Journal of Tsinghua University(Science and Technology)
基金
国家"八六三"高技术项目(2002AA1Z2103)
关键词
容错技术
检查点
卷回恢复
消息传递接口
并行程序
fault tolerance
checkpointing
rollback recovery
message passing interface
parallel application