面向集群的消息传递并行程序容错系统被引量：1

Fault tolerance for cluster-oriented MPI parallel applications

导出

摘要为了保证大规模集群系统的可靠性和可用性,设计并实现了一个面向集群消息传递并行程序的容错系统。该系统采用检查点设置与卷回恢复技术,提出了基于内存排除的退出重进入并行环境策略,实现了对用户程序完全透明的容错功能、进程迁移以及系统自动重构。实验结果表明:检查点设置和系统恢复开销小于10%,符合大规模并行程序容错功能的要求。该系统提高了集群系统的可靠性和可用性,其设计结构和实现方法可以方便地移植到其他消息传递系统。 A fault tolerant run time system was developed for cluster oriented message passing interface （MPI） parallel applications to guarantee system reliability and availability in high performance clusters. This system uses the checkpointing and rollback recovery technique, with user lever transparent fault tolerance, process migration, and system auto reconfiguration based on an ＂exit and reenter＂ parallel environment strategy, Test results suggest that the overhead is less then 10% to satisfy the basic requirements of parallel fault tolerance. The system improves the cluster reliability and availability and its structure and implementation scheme can be conveniently ported to other message passing systems.

作者薛瑞尼张悠慧陈文光郑纬民

机构地区清华大学计算机科学与技术系

出处《清华大学学报（自然科学版）》 EI CAS CSCD 北大核心 2006年第1期67-69,110,共4页 Journal of Tsinghua University(Science and Technology)

基金国家"八六三"高技术项目(2002AA1Z2103)

关键词容错技术检查点卷回恢复消息传递接口并行程序 fault tolerance checkpointing rollback recovery message passing interface parallel application

分类号 TP302.8 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献9

1Top500 Supercomputer Sites.TOP500 List[OL].http://www.top500.org/,2004.
2David P,Aaron B,Pete B,et al.Recovery-Oriented Computing (ROC):Motivation,Definition,Techniques,and Case Studies[R].UCB-CSD-02-1175,USA:University of California Berkeley,2002.
3Message Passing Interface Forum.MPI:A message passing interface standard[J].International Journal of Supercomputer Applications,1994,8(3/4):159-416.
4Duell J,Hargrove P,Roman E.The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart[R].LBNL-54941,USA:Berkeley Lab,2003.
5Aurelien B,Franck C,Herault H,et al.MPICH-V2:A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging[A].SC2003.Igniting Innovation[C].New York:ACM Press and IEEE Computer Society Press,2003.
6Georg S.CoCheck:Checkpointing and process migration for MPI[A].Proceedings of the 10th International Parallel Processing Symposium (IPPS '96)[C].Honolulu,HA:IEEE Press,1996.
7William G,Ewing L.A high-performance,portable implementation of the MPI message passing interface standard[J].Parallel Computing,1996,22(6):789-828.
8Elmootazbellah N E,Lorenzo A,Wang Y M,et al.A Survey of Rollback-Recovery Protocols in Message-Passing Systems[R].CMU-CS-96-181,USA:Carnegie Mellon University,1996.
9Kai L,Jeffrey N,James P.Low-latency,concurrent checkpointing for parallel programs[J].IEEE Tran on Parallel and Distributed Systems,1994,5(8):874-879.

同被引文献3

1李文中,陈道蓄,陆桑璐.分布式缓存系统中一种优化缓存部署的图算法[J].软件学报,2010,21(7):1524-1535. 被引量：13
2杨玉海,宾雪莲,郑玉墙.合作式Web缓存系统的性能分析[J].计算机研究与发展,2003,40(5):757-762. 被引量：14
3张悠慧,郑纬民.一种新的网络对象存储设备研究[J].电子学报,2003,31(5):679-682. 被引量：2

引证文献1

1连健.基于用户层通信的合作缓存技术[J].科学与财富,2014(6):260-261.

1陈伟,刘求真,张蕾,蒲利.8031两模冗余容错单片机系统[J].西南石油学院学报,1993,15(3):130-134.
2文梅,李宏亮,张春元,范金鹏,吴涛,王志英.分布式系统故障卷回恢复技术研究与实践[J].计算机工程与科学,2000,22(5):52-55. 被引量：3
3薛瑞尼,陈文光,郑纬民.基于内存功能划分的并行程序检查点策略研究[J].华中科技大学学报（自然科学版）,2005,33(z1):107-110.
4崔焕庆,吴哲辉,方欢.公平消息传递并行程序设计的Petri网方法[J].系统仿真学报,2009,21(13):3933-3936. 被引量：1
5万国伟,卢宇彤,谢旻,沈志宇.一种低开销非阻塞的协同式检查点算法[J].计算机工程,2007,33(24):66-68. 被引量：1
6张悠慧,汪东升,郑纬民.Solaris系统多线程检查点设置与卷回恢复[J].计算机工程与应用,2000,36(8):45-47. 被引量：2
7张怡,胡建平.机群系统中检查点卷回恢复协议分析[J].计算机工程与科学,2001,23(5):66-69. 被引量：2
8巩敦卫,陈永伟,田甜.消息传递并行程序的弱变异测试及其转化[J].软件学报,2016,27(8):2008-2024. 被引量：2
9高建华,邵世煌,邵清.数据库文件恢复的容错设计方法[J].微型电脑应用,1999,15(5):1-4.
10田甜,巩敦卫.消息传递并行程序路径覆盖测试数据生成问题的模型及其进化求解方法[J].计算机学报,2013,36(11):2212-2223. 被引量：9

清华大学学报（自然科学版）

2006年第1期

浏览历史

内容加载中请稍等...

面向集群的消息传递并行程序容错系统被引量：1

参考文献9

同被引文献3

引证文献1

相关作者

相关机构

相关主题

浏览历史

面向集群的消息传递并行程序容错系统 被引量：1

参考文献9

同被引文献3

引证文献1

相关作者

相关机构

相关主题

浏览历史

面向集群的消息传递并行程序容错系统被引量：1