摘要
以工作站簇为代表的计算环境是当前分布式系统和并行计算的研究重点之一,PVM所提供的消息传递机制支持了高效的异构网络计算。但标准PVM缺乏对系统容错的支持,这可以通过使用检查点的回滚恢复方式予以弥补。该文对如何在用户级实现PVM全局容错,分析其设计思想和实现技术。主要思想是使用进行消息记录的异步检查点算法,并利用PVM守护进程和全局调度进程进行控制,所有操作对应用程序都是透明的。利用该系统还可以进一步实现PVM的透明进程迁移和负载均衡。
Computing environment exemplified by Clusters of Workstations is one of the hot-spots in the study ofdistributed system and Parallel computing. The message-passing mechanism of PVM provides the necessary supportfor efficient heterogeneous network computing. But such system lacks the ability to support fault-tolerance, whichcan be enhanced by rollback recovery with checkpointing. this paper analyzes the design principles and implementtechnique used to extend PVM with global fault-tolerance at user-level. The main idea is to harness the asyn-chronous checkpointing with message logging. The daemon processes and global scheduler of PVM are employed.All the operations are transparent to the application. It is also possible to implement transparent process migrationand load balancing by the system in the future.
出处
《计算机工程与应用》
CSCD
北大核心
1999年第11期34-37,共4页
Computer Engineering and Applications
基金
国家自然科学基金
关键词
异步检查点
容错
工作站簇
PVM
软件系统
checkpointing, asynchronous checkpointing, message logging, task id mapping