摘要
在并行和分布式计算环境中,随着系统规模的增长,系统出错的概率大大增加。为提高集群系统的可靠性和可用性,采用对称式Active/Active高可用模型的原理和组通信工具,实现了一种增强头节点作业服务可用性的高可用管理方案;针对并行计算环境的特点,利用LAM/Migration检查点迁移技术,实现了集群系统中计算节点的故障自探测、任务自恢复功能。
In parallel and distributed computing environment,with the growth of system scale,the probability of happening system errors increases greatly.For the purpose of improving the reliability and availability of cluster system,using principle of symmetric Active/Active high availability model and group communication facility,it achieves a high availability management schema for enhancing availability of job service on head nodes.And an implementation method of high availability management module for computing nodes which has considered the characteristic of parallel computing environment and taken advantage of LAM/Migration checkpoint migration technology is given.It makes computing nodes in the cluster system have functions of fault self-detection and task self-recovery.
出处
《舰船电子工程》
2010年第3期23-26,共4页
Ship Electronic Engineering
关键词
并行计算
高可用
检查点
进程迁移
parallel computing
high availability
checkpoint
process migration