期刊文献+

面向大规模计算系统的Cache式并行检查点 被引量:1

Cache-style Parallel Checkpointing for Large-scale Computing System
下载PDF
导出
摘要 检查点机制是高性能并行计算系统中重要的容错手段,随着系统规模的增大,并行检查点的可扩展性受文件访问的制约。针对大规模并行计算系统的多级文件系统结构,提出了cache式并行检查点技术。它将全局同步并行检查点转化为局部文件操作,并利用多处理器结构进行乱序流水线式写回调度,将检查点的写回时机合理分布,从而有效地隐藏了检查点的写回开销,保证了并行检查点文件访问的高性能和高可扩展性。 Checkpointing is a typical technique for fault tolerance,whereas its scalability is limited by the overhead of file access.According to the multi-level file system architecture,the cache-style parallel checkpointing was introduced,which translates global coordinated checkpointing into local file operation by out-of-order pipelining of checkpoint flushing opportunity.The overhead of write-back is hidden effectively to increase the performance and the scalability of parallel checkpointing.
出处 《计算机科学》 CSCD 北大核心 2011年第5期287-289,F0003,共4页 Computer Science
基金 高效能服务器和存储技术国家重点实验室开放基金项目(2009HSSA04)资助
关键词 Cache式检查点 并行计算 多级文件系统 多处理器 乱序流水线 Cache-style checkpointing Parallel computing Multi-level file system Multi-processor Out-of-order pipeline
  • 相关文献

参考文献11

  • 1Gibson G,Schroeder B, Digney J. Failure Tolerance in Petascale Computers[J]. CTWatch Quarterly, 2007,3(4) :4-10.
  • 2Elnozahy E N,Alvisi L,David B,. et al. A Survey of Rollbackrecovery Protocols in Message-passing Systems[J]. ACM Computing Surveys, 2002,34 (3) : 375-408.
  • 3Lawrence Berkeley National Laboratory. Berkeley Lab Check point-Restart (BLCR)[EB/OL]. https://ftg, lbl. gov/Cbeck pointRestart/, Jun 2010.
  • 4刘勇鹏,王小平,李根.用户指导的多层混合检查点技术及性能优化[J].计算机应用研究,2008,25(7):2097-2099. 被引量:2
  • 5The 35th Top500 last[EB/OL]. http://www, top500, org, 2010.
  • 6Franek Cappello INRIA. Fault Tolerance & PetaScale Systems: Current Knowledge, Challenges and Opportunities [C]//EuroPVM/MPI 2008. LNCS 5205. Berlin Heidelberg: Springer Verlag, 2008.
  • 7Plank J S, Li Kai, et al. Diskless Checkpointing. IEEE Transaction on Parallel and Distributed Systems[J].1998,9 (10):972- 986.
  • 8Chen gi-zhong,Fagg G E,Gabril E,et al. Building Fault Survivable MPI Programs with FT MPI Using Diskless Checkpointing [C] // Proceedings ofthe 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP). Chicago, Illinois, 2005 : 213-223.
  • 9Vaidya N H. A Case for Two Level Distributed Recovery Schemes[C]//ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. Ottawa, Canada, 1995 : 65- 73.
  • 10Hwang K, Hai Jin,Chow E, et al. Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space[J]. IEEE Concurreycy. 1999,7(1):60-69.

二级参考文献9

  • 1DUELL J, HARGROVE P, ROMAN E. The design and implementation of berkeley lab' s Linux checkpoint/restart, LBNL- 54941 [ R]. [ S. l. ] :Lawrence Berkeley National Laboratory ,2003.
  • 2ROMAN E. A survey of checkpoint/restart implementations, LBNL- 54942 [ R]. [ S. l. ] :Lawrence Berkeley National Laboratory,2002.
  • 3ZHONG Hua, NIEN J. Crak: Linux checkpoint/restart as a kernel module, CUCS-014-01 [ R]. New York:Columbia University,2001.
  • 4PINHEIRO E. EPCKPT [ EB/OL ]. ( 2002- 23- 09 ) [ 2006 ]. http :// www. research. rutgers.edu/-edpin/epekpt.
  • 5PLANK J, BECK M, KINGSLEY G, et al. Libckpt: transparent checkpointing under UNIX [ C ]//Proc of USENIX Technical Conference. Berkeley..USENIX Association, 1995:213-223.
  • 6LITZKOW M, TANNENAUM T, BASNEY J, et al. Checkpoint and migration of UNIX Processes in the Condor distributed processing system, CS-TR-199701346 [ R ]. Madison: University of Wisconsin, 1997.
  • 7Intel Corporation. Intel hanium architecture software developer' s manual volume 2: system architecture revision 2.2, SKU 245318. 005 [ R]. [S. l. ] :Intel Corporation,2006.
  • 8MOSBERGER D,ERANIAN S. IA-64 Linux kernel design and implementation [ M ]. New Jersey :Prentice Hall PTR, 2002 : 93-116.
  • 9LI Peng, WANG Dong-sheng. Process checkpointingand rolIback recovery on the IA- 64 architecture [ R ]. Beijing: Tsinghua University, 2002.

共引文献1

同被引文献13

引证文献1

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部