期刊文献+

基于Lustre文件系统的MPI检查点系统实现技术与性能测试 被引量:4

Implementation and Evaluation of MPI Checkpointing System over Lustre File System
下载PDF
导出
摘要 基于协同式检查点的回卷恢复是在大规模并行计算机系统中得到采用的一项重要容错技术,其性能开销主要为协同协议和检查点映像存储所决定.描述了一个在MPICH2中实现的应用透明的并行检查点系统,相比已有的技术,该系统有以下特点:1)协同协议操作利用了并行应用的近邻通信特性,通过虚连接方法减少协议的处理开销;2)采用Lustre文件系统简化检查点映像文件管理的复杂性;3)通过并行I/O操作提高性能,优化检查点映像的存储过程.实际应用的测试表明,该检查点系统具有较小的运行时间开销和良好的可扩展性. As one of the most important fault-tolerant techniques,coordinated checkpoint based rollback-recovery has been adopted in large scale parallel computer systems.Coordinating protocol and checkpoint image storage are two major factors that affect the overhead of parallel checkpointing systems.A novel application-transparent parallel checkpointing system implemented in MPICH2 is proposed.Compared with the existing techniques,the advantages of this system are summarized as follows:1) Utilize the feature of near-neighbor communication in applications and virtual connection method to reduce the number of internal messages exchanged in coordinating stage,and hence to reduce the latency of protocol processing;2) Store checkpoint images using Lustre file system to simplify the checkpoint files management;and 3) Implement parallel I/O in image storage stage to improve the system performance.Experiments suggest that the approach proposed results in low runtime overhead and enhances system scalability.
出处 《计算机研究与发展》 EI CSCD 北大核心 2007年第10期1709-1716,共8页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60621003 60573135) 国家"八六三"高技术研究发展计划基金项目(2006AA01A106)
关键词 容错技术 MPICH2 回卷恢复 协同式检查点 LUSTRE文件系统 fault-tolerant MPICH2 rollback-recovery coordinated checkpoint Lustre file system
  • 相关文献

参考文献13

  • 1P Schwan.Lustre:Building a file system for 1,000-node clusters[C].Ottawa Linux Symposium,Ottawa,Canada,2003
  • 2E N Elnozahy,L Alvisi,Y M Wang,et al.A survey of rollback-recovery protocols in message passing systems[R].School of Computer Science,Carnegie Mellon University,Tech Rep:CMU-CS-99-148,1999
  • 3M Schulz,G Bronevetsky,R Fernandes,et al.Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs[C].ACM/IEEE SC 2004 Conference (SC'04),Pittsburgh,PA,2004
  • 4J Duell,P Hargrove,E Roman.The design and implementation of Berkeley Lab's Linux checkpoint/restart[R].Berkeley Lab,Tech Rep:LBNL-54941,2002
  • 5K M Chandy,L Lamport.Distributed snapshots:Determining global states of distributed systems[J].ACM Trans on Computer System,1985,3(1):63-75
  • 6A Yoo,M Jette,M Grondona.SLURM:Simple Linux utility for resource management[G].In:Job Scheduling Strategies for Parallel Processing,LNCS 2862.Berlin:Springer,2003.44 -60
  • 7G Stellner.CoCheck:Checkpointing and process migration for MPI[C].In:Proc of the Int'l Parallel Processing Symposium.Los Alamitos,CA:IEEE Computer Society Press,1996.526-531
  • 8Y Zhang,D Wang,W Zheng.Checkpoint and migration of parallel processes based on message passing interface[C].The 3rd Int'l Conf on High-Performance Clustered Computing,St Petersburg,FL,2002
  • 9S Sankaran,J M Squyres,B Barrett,et al.The LAM/MPI checkpoint/restart framework:system-initiated checkpointing[J].International Journal of High Performance Computing Applications,2005,19(4):479-493
  • 10Q Gao,W Yu,W Huang,et al.Application-transparent checkpoint/restart for MPI programs over InfiniBand[C].In:Proc of Int'l Conf on Parallel Processing (ICPP'06).Los Alamitos,CA:IEEE Computer Society Press,2006.471-478

二级参考文献4

  • 1E.N. Elnozahy, D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. School of Computer Science, Carnegie Mellon University, Tech Rep: CMU-CS-96-181, 1996
  • 2Pierre Lemarinier, Aurelien Bouteiller. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI.IEEE Int'l Conf. Cluster Computing (Cluster 2003), Hong Kong, 2003
  • 3Chandy K M, Lamport L. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Computer Systems, 1985, 3(1): 63~75
  • 4谢旻 邢座程.NICHAL通信软件接口设计与实现[J].计算机研究与发展,2002,39:189-203.

共引文献11

同被引文献35

引证文献4

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部