期刊文献+

面向集群的消息传递并行程序容错系统 被引量:1

Fault tolerance for cluster-oriented MPI parallel applications
原文传递
导出
摘要 为了保证大规模集群系统的可靠性和可用性,设计并实现了一个面向集群消息传递并行程序的容错系统。该系统采用检查点设置与卷回恢复技术,提出了基于内存排除的退出重进入并行环境策略,实现了对用户程序完全透明的容错功能、进程迁移以及系统自动重构。实验结果表明:检查点设置和系统恢复开销小于10%,符合大规模并行程序容错功能的要求。该系统提高了集群系统的可靠性和可用性,其设计结构和实现方法可以方便地移植到其他消息传递系统。 A fault tolerant run time system was developed for cluster oriented message passing interface (MPI) parallel applications to guarantee system reliability and availability in high performance clusters. This system uses the checkpointing and rollback recovery technique, with user lever transparent fault tolerance, process migration, and system auto reconfiguration based on an "exit and reenter" parallel environment strategy, Test results suggest that the overhead is less then 10% to satisfy the basic requirements of parallel fault tolerance. The system improves the cluster reliability and availability and its structure and implementation scheme can be conveniently ported to other message passing systems.
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2006年第1期67-69,110,共4页 Journal of Tsinghua University(Science and Technology)
基金 国家"八六三"高技术项目(2002AA1Z2103)
关键词 容错技术 检查点 卷回恢复 消息传递接口 并行程序 fault tolerance checkpointing rollback recovery message passing interface parallel application
  • 相关文献

参考文献9

  • 1Top500 Supercomputer Sites.TOP500 List[OL].http://www.top500.org/,2004.
  • 2David P,Aaron B,Pete B,et al.Recovery-Oriented Computing (ROC):Motivation,Definition,Techniques,and Case Studies[R].UCB-CSD-02-1175,USA:University of California Berkeley,2002.
  • 3Message Passing Interface Forum.MPI:A message passing interface standard[J].International Journal of Supercomputer Applications,1994,8(3/4):159-416.
  • 4Duell J,Hargrove P,Roman E.The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart[R].LBNL-54941,USA:Berkeley Lab,2003.
  • 5Aurelien B,Franck C,Herault H,et al.MPICH-V2:A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging[A].SC2003.Igniting Innovation[C].New York:ACM Press and IEEE Computer Society Press,2003.
  • 6Georg S.CoCheck:Checkpointing and process migration for MPI[A].Proceedings of the 10th International Parallel Processing Symposium (IPPS '96)[C].Honolulu,HA:IEEE Press,1996.
  • 7William G,Ewing L.A high-performance,portable implementation of the MPI message passing interface standard[J].Parallel Computing,1996,22(6):789-828.
  • 8Elmootazbellah N E,Lorenzo A,Wang Y M,et al.A Survey of Rollback-Recovery Protocols in Message-Passing Systems[R].CMU-CS-96-181,USA:Carnegie Mellon University,1996.
  • 9Kai L,Jeffrey N,James P.Low-latency,concurrent checkpointing for parallel programs[J].IEEE Tran on Parallel and Distributed Systems,1994,5(8):874-879.

同被引文献3

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部