期刊文献+

一种改进的同步检查点设置算法 被引量:3

An Improved Algorithm of Synchronous Checkpointing Method
下载PDF
导出
摘要 检查点设置与卷回恢复是集群系统中容错计算的重要手段。同步检查点方法在集群系统中得到了广泛应用。为了提高集群计算系统的工作效率,降低系统的容错开销,根据基于消息驱赶的同步检查点设置算法的性质和在实际应用中并行应用程序的通信特征,通过减小协同过程中的阻塞时间,降低系统中控制消息的数量,对基于消息驱赶的Sync-and-Stop算法进行优化。改进的算法有效降低检查点设置的时间和空间开销,减小在系统应用中检查点设置的代价,进一步提高系统可扩展性和应用可靠性。 The technology of checkpoint and rollback recovery is an effective method of fault tolerance in cluster system. Synchronous checkpointing method has been widely used in cluster system. For improving the efficiency of cluster computing system, proposes the modified coperative checkpoint to reduce the blocking time and the number of control information, according to nature of algorithm of synchronous checkpointing method and feature of parallel programs. It optimizes the Syne - and - Stop algorithm based on driving information. The test result indicates that this algorithm improves the system's scalability and application's reliability.
作者 田甜 祝永志
出处 《计算机技术与发展》 2009年第8期124-126,共3页 Computer Technology and Development
基金 山东省高等学校实验研究项目基金(2005-400) 曲阜师范大学校级科研项目(XJ0734)
关键词 检查点 同步 消息驱赶 checkpoint synchronous driving information
  • 相关文献

参考文献5

二级参考文献29

  • 1周恩强,卢宇彤,沈志宇.一个适合大规模集群并行计算的检查点系统[J].计算机研究与发展,2005,42(6):987-992. 被引量:12
  • 2李翀,罗家融,王华忠.基于BEOWULF的PC集群系统设计及并行编程的研究[J].微计算机信息,2005,21(08X):64-67. 被引量:14
  • 3E.N. Elnozahy, D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. School of Computer Science, Carnegie Mellon University, Tech Rep: CMU-CS-96-181, 1996
  • 4Pierre Lemarinier, Aurelien Bouteiller. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI.IEEE Int'l Conf. Cluster Computing (Cluster 2003), Hong Kong, 2003
  • 5Chandy K M, Lamport L. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Computer Systems, 1985, 3(1): 63~75
  • 6谢旻 邢座程.NICHAL通信软件接口设计与实现[J].计算机研究与发展,2002,39:189-203.
  • 7Elnozahy E.N., Alvisi L., Wang Y.M., Johnson D.B.. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 2002, 34(3): 375~408.
  • 8Baldoni R., Quaglia F., Fornara P.. An index-based checkpointing algorithm for autonomous distributed systems. IEEE Transactions on Parallel and Distributed Systems, 1999, 10(2): 181~192.
  • 9Vieira G.M.D., Garcia I.C., Buzato L.E.. Systematic analysis of index-based checkpointing algorithms using simulation. In: Proceedings of IX Brazilian Symposium on Fault-Tolerant Computing, 2001.
  • 10Manivannan D., Singhal M.. A low overhead recovery technique using quasi-synchronous checkpointing. In: Proceedings of the 16th IEEE International Conference on Distributed Computing System, 1996, 100~107.

共引文献24

同被引文献40

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部