摘要
在大规模机群环境下,检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制,在不作全局同步的情况下获取通信系统全局状态的方法,并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销,适用于大规模机群应用。
Checkpointing and recovery systems are growing in importance in large-scale clusters. A non-blocking coordinated checkpointing and recovery system is proposed in which reliable communication mechanisms are used to eliminate the overhead of global synchronization. It is shown that a parallel checkpointing system can benefit from supports embedded in low-level communication systems in its implementation and to improve its performance.
出处
《计算机工程》
CAS
CSCD
北大核心
2007年第5期217-219,共3页
Computer Engineering
基金
中科院新一代机群关键技术的研究项目(KGCX2-SW-116)
关键词
机群通信系统
并行检查点
容错技术
Cluster communication system
Parallel checkpointing
Fault-tolerance