期刊文献+

大规模计算系统故障特征及容错机制分析 被引量:3

Survey on the Dependability and the Fault Tolerance Mechanism for Large Scale Computing Systems
下载PDF
导出
摘要 本文围绕国内外若干大规模计算系统的运行稳定性状况展开调研:首先根据若干典型系统的故障数据,从故障模式、故障特征方面对目前实际生产性系统的稳定性进行分析;然后,在总结目前系统级容错研究思路的基础上,分析了未来更大规模计算系统容错机制的挑战及可能的解决方案。 The running stability of several large scale computing systems is discussed. First, we summaries the main fault models and features according to the public fault data. Second, based on the survey of system fault tolerance research, the challenge and likely mechanisms for fault tolerance of more large scale computing systems is introduced.
出处 《计算机工程与科学》 CSCD 北大核心 2009年第A01期237-240,共4页 Computer Engineering & Science
基金 国家自然科学基金资助项目(60803045)
关键词 大规模计算系统 故障 容错 断点续算 Large scale computing system Fault Fault tolerance Checkpoint restart
  • 相关文献

参考文献17

  • 1Schroeder B, Gibson G. A Large-Scale Study of Failures in High-Performance Computing Systems [C]//Proc of the 2006 Int'l Conf on Dependable Systems and Networks, 2006 : 249-258.
  • 2Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS) [EB/OL].[ 2009-07-01]. http://www, cs. sandia, gov/-jrstear/ras/.
  • 3Glosli J N, Richards D F, Caspersen K J, et al. Streitz: Extending Stability Beyond CPU Millennium: A Micron-Scale Atomistic Aimulation of Kelvin-Helmholtz Instability[C]//Proc of SC'07,2007.
  • 4Los Alamos National Laboratory Computer Science Research HPC-5[EB/OL]. [2009-07-01]. http://institute, lanl. gov/ data/f data/.
  • 5黄永勤,金利峰,刘耀.高性能计算机的可靠性技术现状及趋势[C]//HPC China 2008论文集,2008:1-6.
  • 6Avizienis A, Laprie J C, Randell B. Fundamental Concepts of Dependability [R]. Research Report No. 1145, LAASCNRS, 2001.
  • 7Young J W. A First Order Approximation to the Optimum Checkpoint Interval[J].Comm ACM, 1974,17(9): 530-531.
  • 8Gibson G, Schroeder B, Digney J. Failure Tolerance in Petascale Computers[C]. CTWatch Quarterly, 2007,3 (4).
  • 9Schroeder B, Gibson G. Understanding Failures in Petascale Computers[C]//Proc Sci DAC'07,2007 : 012-022.
  • 10Oliner A J,Stearley J. What Supercomputers Say: A Study of Five System Logs[C]//Proc of DSN'07,2007: 575-584.

同被引文献23

  • 1任勇.系统优化原理与实践[J].福建电脑,2007,23(8):166-167. 被引量:1
  • 2United Nation.Medical support manual for United Nations peace-keeping operations[M].2nd edition.United Nations,1999,22.
  • 3Horn P.Autonomic computing:IBM′s perspective onthe state of information technology. . 2001
  • 4Litvinova A,,Engelmann C,Scott S L.A proactivefault tolerance framework for high-performance com-puting. Proceedings of the 9th IASTED Inter-na-tional Conference on Parallel and Distributed Com-pu-ting and Networks . 2010
  • 5Joshua H,Andrew L.A composable runtime recov-ery policy framework supporting resilient HPC appli-cations. . 2010
  • 6C.Wang,F.Mueller,C.Engelmann,S.L.Scott.Proactive process-level live migration in HPC environments. Proceedings of the IEEE/ACM International Conference on High Performance Computing,Networking,Storage and Analysis(SC) . 2008
  • 7Michalak S E,Harris KW,Hengartner N W,et al.Predictingthe Number of Fatal Soft Errorsin Los Alamos National La-boratory‘s ASC QSupercomputer. IEEE Trans on Deviceand Materials Reliability . 2005
  • 8Sahoo R,Oliner A,Rish I,et al.Critical event pre-diction for proactive management in large-scale com-puter clusters. International Conference onKnowledge Discovery and Data Mining 2003 . 2003
  • 9Franck C.Fault tolerance in petascale/exascale sys-tems:current knowledge,challenges and researchopportunities. International Journal of High Per-formance Computing Applications . 2009
  • 10Treuren V,Bradford G,Alcatel L,et al.SystemJTAGinitiative group advancements. Interna-tional Test Conference 2008 . 2008

引证文献3

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部