Network of workstations (NOW) now becomes one of the main trends of parallel computing. But for long-running scientific programs, it needs effective fault tolerance for its changing property. Checkpointing and rollbac...Network of workstations (NOW) now becomes one of the main trends of parallel computing. But for long-running scientific programs, it needs effective fault tolerance for its changing property. Checkpointing and rollback recovery is a solution to this problem. First the main problems upon rollback recovery are discussed, the different checkpointing techniques for NOW are analyzed, and then the design and implementation of ChaRM (checkpoint-based rollback recovery and process migration) system are described. The comparison of three coordinated checkpointing systems is given.展开更多
Fault-tolerance is very important in cluster computing and has beenimplemented in many famous cluster-computing systems using checkpoint/restartmechanisms. But existent check-pointing algorithms cannot restore the sta...Fault-tolerance is very important in cluster computing and has beenimplemented in many famous cluster-computing systems using checkpoint/restartmechanisms. But existent check-pointing algorithms cannot restore the states of afile system when roll-backing the running of a program, so there are many restrictionson file accesses in existent fault-tolerance systems. SCR algorithm, an algorithmbased on atomic operation and consistent schedule, which can restore the states offile systems, is presented in this paper. In the SCR algorithm, system calls on filesystems are classified into idem-potent operations and non-idem-potent operations.A non-idem-potent operation modifies a file system's states, while an idem-potentoperation does not. SCR algorithm tracks changes of the file system states. It logseach non-idem-potent operation used by user programs and the information that canrestore the operation in disks. When check-pointing roll-backing the program, SCRalgorithm will revert the file system states to the last checkpoint time. By usingSCR algorithm, users are allowed to use any file operation in their programs.展开更多
基金Project supported by the National "863" High-tech Program of China.
文摘Network of workstations (NOW) now becomes one of the main trends of parallel computing. But for long-running scientific programs, it needs effective fault tolerance for its changing property. Checkpointing and rollback recovery is a solution to this problem. First the main problems upon rollback recovery are discussed, the different checkpointing techniques for NOW are analyzed, and then the design and implementation of ChaRM (checkpoint-based rollback recovery and process migration) system are described. The comparison of three coordinated checkpointing systems is given.
文摘Fault-tolerance is very important in cluster computing and has beenimplemented in many famous cluster-computing systems using checkpoint/restartmechanisms. But existent check-pointing algorithms cannot restore the states of afile system when roll-backing the running of a program, so there are many restrictionson file accesses in existent fault-tolerance systems. SCR algorithm, an algorithmbased on atomic operation and consistent schedule, which can restore the states offile systems, is presented in this paper. In the SCR algorithm, system calls on filesystems are classified into idem-potent operations and non-idem-potent operations.A non-idem-potent operation modifies a file system's states, while an idem-potentoperation does not. SCR algorithm tracks changes of the file system states. It logseach non-idem-potent operation used by user programs and the information that canrestore the operation in disks. When check-pointing roll-backing the program, SCRalgorithm will revert the file system states to the last checkpoint time. By usingSCR algorithm, users are allowed to use any file operation in their programs.