期刊文献+

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs 被引量:1

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs
原文传递
导出
摘要 GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens. GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第2期240-255,共16页 计算机科学技术学报(英文版)
基金 supported by the National Natural Science Foundation of China under Grant Nos. 60921062, 61003087, 61120106005 and 61170049
关键词 GPGPU partial recomputing fault tolerance CUDA CHECKPOINTING GPGPU, partial recomputing, fault tolerance, CUDA, checkpointing
  • 相关文献

参考文献25

  • 1Luebke D, Harris M, Kruger J, Purcell T, Govindaraju N, Buck I, Woolley C, Lefohn A. GPGPU: General-purpose computation on graphics hardware. In Proc. SIGGRAPH 2004 Course Notes, New York, NY, USA, Aug. 2004, p.33.
  • 2Owens J, Luebke D, Govindaraju N, Harris M, Kruger J, Lefohn A, Purcell T. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, Mar. 2007, 26(1): 80-113.
  • 3Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P. Brook for GPUs: Stream computing on graphics hardware. In Proc. ACM SIGGRAPH 2004 Papers, New York, NY, USA, Aug. 2004, pp.777-786.
  • 4AMD. Brook-}. http://developer.amd. com Zgpu.assets/ AMDBrookplus.pdf.
  • 5NVIDIA Corporation. Cuda programming guide, 2008. http://www.nvidia.com/object/cuda_develop.html.
  • 6Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. ACM SIGPLAN Notices, April 2009, 44(4): 101-110.
  • 7Top500 Supercomputer Site. http://www.top500.org/lists /2010/11.
  • 8Yim K S, Pham C, Saleheen M, kalbarczyk Z, Iyer R.Hauberk: Lightweight silent data corruption error detectors for GPGPU. In Proc. the 25th Int. Parallel & Distributed Processing Symposium, Anchorage, USA, May 2011, pp.287- 300.
  • 9Borucki L, Schindlbeck G, Slayman C. Comparison of accelerated DRAM soft error rates measured at component and system level. In Proc. the Int. Reliability Physics Symposium, Phoenix, USA, April 27-May 1, 2008, pp.482-487.
  • 10Schroeder B, Pinheiro E, Weber W D. DRAM errors in the wild: A large-scale field study. In Proc. the 11 th International Joint Conf. Measurement and Modeling of Computer Systems, Seattle, USA, June 15-19, 2009, pp.193-204.

同被引文献18

  • 1Cappello F.Fault tolerance in petascale/exascale systems:Current knowledge,challenges and research opportunities[J].International Journal of High Performance ComputingApplications,2009,23(3):212-226.
  • 2Mancini L,Koutny M.Formal specification of N-modularredundancy[C]//Proceedings of the’86 ACM 14th AnnualConference on Computer Science,1986:199-204.
  • 3Lifflander J,Meneses E,Menon H,et al.Scalable replaywith partial-order dependencies for message-logging faulttolerance[C]//Proceedings of IEEE International Conferenceon Cluster Computing,2014:19-28.
  • 4Yang X J,Du Y F,Wang P F,et al,The fault tolerantparallel algorithm:The parallel recomputing based failurerecovery[C]//Proceedings of the 16th International Conferenceon Parallel Architectures and Compilation Techniques,2007:199-209.
  • 5Takizawa H,Sato K,Komatsu K,et al.CheCUDA:Acheckpoint/restart tool for CUDA applications[C]//Proceedingsof 2009 International Conference on Paralleland Distributed Computing,Applications and Technologies(Workshop on Ultra Performance and DependableAcceleration Systems),2009:408-413.
  • 6Kale L V,Krishnan S.CHARM++:A portable concurrentobject oriented system based on C ++ [C]//Proceedingsof the 8th Annual Conference on Object-Oriented ProgrammingSystems,Languages,and Applications,1993:91-108.
  • 7Acun B,Gupta A,Jain N,et al.Parallel programmingwith migratable objects:Charm ++ in practice[C]//Proceedingsof the International Conference on High PerformanceComputing,Networking,Storage and Analysis,2014:647-658.
  • 8Kale L V,Bhatele A.Parallel science and engineeringapplications[M]//The Charm++ Approach.[S.l.]:CRC Press,2013.
  • 9Feng L L,Shu C W,Zhang M P.A hybrid cosmologicalhydrodynamic/N-body code based on a weighted essentiallynon-oscillatory scheme[J].The Astrophysical Journal,2004,612(1):1-13.
  • 10Shu C W.Essentially non-oscillatory and weighted essentiallynon-oscillatory schemes for hyperbolic conservationlaws[M]//Advanced Numerical Approximate of NonlinearHyperbolic Equations.Berlin Heidelberg:Springer,1998:325-432.

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部