期刊文献+

一种基于索引的准同步检查点协议 被引量:3

An Index-Based Quasi-Synchronous Checkpointing Protocol
下载PDF
导出
摘要 在基于索引的分布式检查点算法中,尽量减少全局一致性检查点和强制检查点的数目对提高计算效率具有重要意义.该文在已有的基于索引的检查点算法的基础上,提出了一种新的检查点协议,既减少检查点的数目,又使各个进程的检查点之间实时同步,以免程序出错后回卷执行的开销太大,丢失过多有效计算.模拟实验表明,按该文所提协议,平均每条消息导致的强制检查点数比传统方法平均减少23·2%. To provide rollback-recovery for fault-tolerance in distributed systems, it is significant to reduce the number of checkpoints under the existence of consistent global checkpoints in indexbased distributed checkpointing algorithms. A new checkpointing protocol is presented in this paper on the basis of index-based checkpointing protocols. It not only reduces the number of forcedcheckpoints but also keeps synchronous in time to avoid too much amount of overhead of roll-back recovery due to useful computation losing in case of failure. Simulation results show that the proposal algorithm in this paper can reduce the number of induced forced-checkpoints per message 23.2 % on an average comparing to the traditional strategies.
出处 《计算机学报》 EI CSCD 北大核心 2005年第10期1620-1625,共6页 Chinese Journal of Computers
基金 国家自然科学基金(60473031 60273070)资助.~~
关键词 分布式系统 检查点 多米诺效应 索引 主动同步 distributed systems checkpoint Domino-effect index active-synchronous
  • 相关文献

参考文献11

  • 1Tsai J., Lin J.W.. On characteristics of DEF communication-induced checkpointing protocols. In: Proceedings of the 2002 Pacific Rim International Symposium on Dependable Computing, 2002, 29~36.
  • 2Elnozahy E.N., Alvisi L., Wang Y.M., Johnson D.B.. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 2002, 34(3): 375~408.
  • 3Baldoni R., Quaglia F., Fornara P.. An index-based checkpointing algorithm for autonomous distributed systems. IEEE Transactions on Parallel and Distributed Systems, 1999, 10(2): 181~192.
  • 4Vieira G.M.D., Garcia I.C., Buzato L.E.. Systematic analysis of index-based checkpointing algorithms using simulation. In: Proceedings of IX Brazilian Symposium on Fault-Tolerant Computing, 2001.
  • 5Manivannan D., Singhal M.. A low overhead recovery technique using quasi-synchronous checkpointing. In: Proceedings of the 16th IEEE International Conference on Distributed Computing System, 1996, 100~107.
  • 6Briatico D., Ciufoletti A., Simoncini L.. A distributed domino-effect free recovery algorithm. In: Proceedings of the 4th IEEE Symposium on Reliability in Distributed Software and Database System, 1984, 207~215.
  • 7Plank James S., Thomason Michael G.. Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing, 2001, 61(11): 1570~1590.
  • 8Holzmann G.J.. The model checker SPIN. IEEE Transactions on Software Engineering, 1997, 23(5): 279~295.
  • 9Tsai Jichiang. Systematic Comparisons of RDT communication-Induced checkpointing crotocols. In: Proceedings of Pacific Rim International Symposium on Dependable Computing, 2004, 66~75.
  • 10Briatico D., Ciuffoletti A., Simoncini L.. A distributed domino-effect free recovery algorithm. In: Proceedings of IEEE 4th Symposium on Reliability in Distributed Software and Database Systems, 1984, 207~215.

同被引文献27

  • 1王忠群,谢晓东.一种基于Java应用构件动态重定位模型[J].南京大学学报(自然科学版),2005,41(2):180-188. 被引量:6
  • 2周恩强,卢宇彤,沈志宇.一个适合大规模集群并行计算的检查点系统[J].计算机研究与发展,2005,42(6):987-992. 被引量:12
  • 3洪雄,戴光明,冷春霞.构架Linux环境下基于MPICH的工作站机群[J].微计算机信息,2006,22(03X):124-126. 被引量:10
  • 4Plank James S, Thomason Michael G. Processor allocation and checkpoint interval selection in cluster computing systems [J]. Journal of Parallel and Distributed Computing, 2001,61 (11): 1570-1590.
  • 5Elnozahy E N, Alvisi L, Wang Y M,et al. A survey of rollback recovery protocols in message passing systems[J]. ACM Computing Surveys, 2002, 34(3):375-408.
  • 6Baldoni R, Quaglia F, Fornara P. An index-based check-pointing algorithm for autonomous distributed systems [J]. IEEE Transactions on Parallel and Distributed Systems,1999,10(2): 181-192.
  • 7Elnozahy E N,Alvisi L,Wang Y M,et al.A Survey of RollbackRecovery Protocols in Message-Passing Systems[J].ACM Computing Surveys,2002,34(3):375-408.
  • 8Walters,Paul J,Chaudhary,et al.A fault-tolerant strategy for virtualized[J].HPC clusters Journal of Supercomputing,2009,50(3):209-239.
  • 9Ong H,Sarago N,Chanchio K,et al.VCCP:A Transparent,Coordinated Checkpointing System for Virtualization-based Cluster Computing[C]//Proceedings of IEEE Int.Conf.Cluster Comput.2009.
  • 10Kangarlou A,Xu D,Ruth P,et al.Taking Snapshots of Virtual Networked Environments[C]// Proceedings of Int.Workshop Virtualization Technology Distrib.Computer.2007.

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部