5P Schwan.Lustre:Building a file system for 1,000-node clusters[C].Ottawa Linux Symposium,Ottawa,Canada,2003
6E N Elnozahy,L Alvisi,Y M Wang,et al.A survey of rollback-recovery protocols in message passing systems[R].School of Computer Science,Carnegie Mellon University,Tech Rep:CMU-CS-99-148,1999
7M Schulz,G Bronevetsky,R Fernandes,et al.Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs[C].ACM/IEEE SC 2004 Conference (SC'04),Pittsburgh,PA,2004
8J Duell,P Hargrove,E Roman.The design and implementation of Berkeley Lab's Linux checkpoint/restart[R].Berkeley Lab,Tech Rep:LBNL-54941,2002
9K M Chandy,L Lamport.Distributed snapshots:Determining global states of distributed systems[J].ACM Trans on Computer System,1985,3(1):63-75
10A Yoo,M Jette,M Grondona.SLURM:Simple Linux utility for resource management[G].In:Job Scheduling Strategies for Parallel Processing,LNCS 2862.Berlin:Springer,2003.44 -60