摘要
检查点设置与卷回恢复是集群系统中容错计算的重要手段。同步检查点方法在集群系统中得到了广泛应用。为了提高集群计算系统的工作效率,降低系统的容错开销,根据基于消息驱赶的同步检查点设置算法的性质和在实际应用中并行应用程序的通信特征,通过减小协同过程中的阻塞时间,降低系统中控制消息的数量,对基于消息驱赶的Sync-and-Stop算法进行优化。改进的算法有效降低检查点设置的时间和空间开销,减小在系统应用中检查点设置的代价,进一步提高系统可扩展性和应用可靠性。
The technology of checkpoint and rollback recovery is an effective method of fault tolerance in cluster system. Synchronous checkpointing method has been widely used in cluster system. For improving the efficiency of cluster computing system, proposes the modified coperative checkpoint to reduce the blocking time and the number of control information, according to nature of algorithm of synchronous checkpointing method and feature of parallel programs. It optimizes the Syne - and - Stop algorithm based on driving information. The test result indicates that this algorithm improves the system's scalability and application's reliability.
出处
《计算机技术与发展》
2009年第8期124-126,共3页
Computer Technology and Development
基金
山东省高等学校实验研究项目基金(2005-400)
曲阜师范大学校级科研项目(XJ0734)
关键词
检查点
同步
消息驱赶
checkpoint
synchronous
driving information