期刊文献+

基于MPI并行程序的容错系统设计 被引量:1

The Design of Fault-Tolerant System for Parallel Program Based on MPI
下载PDF
导出
摘要 为了确保并行程序能够在并行环境下准确地运行,须提高系统的可靠性,将容错技术应用到并行计算中。该文针对MPI并行程序提出一种容错系统的设计方法,采用检查点/卷回恢复技术、并添加故障检测功能,能够有效地处理节点失效故障和进程失效故障,在一定范围内实现容错,为MPI环境下进行大规模计算提供一个可使用的应用模型。 In order to ensure that the parallel program accurately runs on parallel environment,system reliability must be enhanced,so fault-tolerant technology is applied to parallel computing.In this paper,a design method of fault-tolerant system is proposed for parallel program based on MPI,which adoptes checkpoint/rollback recovery technology and adds a function of detecting failure.The fault-tolerant system can effectively handle node failure and processes failure in a certain range,which provide a practical model for large-scale calculation under MPI environment.
作者 李飞飞 LI Fei-fei(Northeast Dianli University,Jilin 132012,China)
机构地区 东北电力大学
出处 《电脑知识与技术》 2011年第2期817-819,共3页 Computer Knowledge and Technology
关键词 MPI并行程序 容错 检查点/卷回恢复 故障检测 parallel program MPI fault-tolerant checkpoint/rollback recovery detecting failure
  • 相关文献

参考文献5

二级参考文献16

  • 1[1]Gropp W, Lusk E. Installation and Users Guide for Mpich, a Portable Implementation of MPI[D]. Technical Report ANL-01/x, Argonne National Laboratory, 2001
  • 2[2]Gropp W, Lusk E, Doss N, et al. A High-performance, Portable Implementation of the MPI Message-passing Interface standard[J]. Parallel Computing,1996, 22(6):789-828
  • 3[1]Butler RM, Lusk EL. Monitors, messages and clusters:The P4 parallel programming system[J]. ParallelComputing, 1994,20(4):547-564.
  • 4[2]The MPI Forum. The MPI message-passing interface standard[EB/OL]. http://www.mcs.anl.gov/mpi/standard .html, May 1995.
  • 5[3]Stellner G. CoCheck:Checkpointing and process Migration for MPI[A]. 10th Intl. Par. Proc. Symp.[C], Apr, 1996.
  • 6[4]William Gropp and Ewing Lusk. Fault Tolerance in MPI Programs[EB/OL]. http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/mpi-fault.pdf
  • 7[5]Chandy KM, Lamport L. Distributed Snapshots: Determining Global States in Distributed Systems[J]. ACM Trans. Computer Systems,1985,3(1):63-75.
  • 8Wang Y M,Proc IEEE Fault Tolerant Computing Symp,1995年,22页
  • 9Lin L,Proc IEEE Fault Tolerant Computing Symp,1990年,97页
  • 10Wall K,Linux Programming Unleashed,2000年

共引文献26

同被引文献8

引证文献1

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部