摘要
协同式检查点设置及卷回恢复技术是一种简单有效的容错手段,被广泛地运用于并行/分布式系统中。为进一步降低协同式检查点算法的开销,该文给出了一个基于可重建检查点的非阻塞协同式检查点算法。并行程序出错导致卷回恢复发生的概率远小于检查点设置概率,该算法利用这一特性,将检查点设置的部分开销转至卷回恢复阶段,降低了容错的开销,提高了系统的可扩展性。
As an effective method of fault-tolerance, technologies of coordinated checkpoint and rollback recovery are widely used on the parallel or distributed computer systems. In order to reduce the overhead of checkpoint time, this paper proposes a low and non-blocking coordinated checkpoint algorithm based on reconstructed checkpoint. Checkpoint happens much more often than rollback, fractional consumption of checkpoint setting is turned to rollback recovery stage. The algorithm lowers fault-tolerance consumption, and improves system's scalability.
出处
《计算机工程》
CAS
CSCD
北大核心
2007年第24期66-68,共3页
Computer Engineering
关键词
检查点
容错
卷回恢复
非阻塞
checkpoint
fault-tolerance
rollback recovery
non-blocking