大规模计算系统故障特征及容错机制分析被引量：3

Survey on the Dependability and the Fault Tolerance Mechanism for Large Scale Computing Systems

下载PDF

导出

摘要本文围绕国内外若干大规模计算系统的运行稳定性状况展开调研:首先根据若干典型系统的故障数据,从故障模式、故障特征方面对目前实际生产性系统的稳定性进行分析;然后,在总结目前系统级容错研究思路的基础上,分析了未来更大规模计算系统容错机制的挑战及可能的解决方案。 The running stability of several large scale computing systems is discussed. First, we summaries the main fault models and features according to the public fault data. Second, based on the survey of system fault tolerance research, the challenge and likely mechanisms for fault tolerance of more large scale computing systems is introduced.

作者武林平罗红兵刘勇鹏

机构地区北京应用物理与计算数学研究所国防科技大学计算机学院

出处《计算机工程与科学》 CSCD 北大核心 2009年第A01期237-240,共4页 Computer Engineering & Science

基金国家自然科学基金资助项目(60803045)

关键词大规模计算系统故障容错断点续算 Large scale computing system Fault Fault tolerance Checkpoint restart

分类号 TP302.8 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献17

1Schroeder B, Gibson G. A Large-Scale Study of Failures in High-Performance Computing Systems [C]//Proc of the 2006 Int'l Conf on Dependable Systems and Networks, 2006 : 249-258.
2Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS) [EB/OL].[ 2009-07-01]. http://www, cs. sandia, gov/-jrstear/ras/.
3Glosli J N, Richards D F, Caspersen K J, et al. Streitz: Extending Stability Beyond CPU Millennium: A Micron-Scale Atomistic Aimulation of Kelvin-Helmholtz Instability[C]//Proc of SC'07,2007.
4Los Alamos National Laboratory Computer Science Research HPC-5[EB/OL]. [2009-07-01]. http://institute, lanl. gov/ data/f data/.
5黄永勤,金利峰,刘耀.高性能计算机的可靠性技术现状及趋势[C]//HPC China 2008论文集,2008:1-6.
6Avizienis A, Laprie J C, Randell B. Fundamental Concepts of Dependability [R]. Research Report No. 1145, LAASCNRS, 2001.
7Young J W. A First Order Approximation to the Optimum Checkpoint Interval[J].Comm ACM, 1974,17(9): 530-531.
8Gibson G, Schroeder B, Digney J. Failure Tolerance in Petascale Computers[C]. CTWatch Quarterly, 2007,3 (4).
9Schroeder B, Gibson G. Understanding Failures in Petascale Computers[C]//Proc Sci DAC'07,2007 : 012-022.
10Oliner A J,Stearley J. What Supercomputers Say: A Study of Five System Logs[C]//Proc of DSN'07,2007: 575-584.

同被引文献23

1任勇.系统优化原理与实践[J].福建电脑,2007,23(8):166-167. 被引量：1
2United Nation.Medical support manual for United Nations peace-keeping operations[M].2nd edition.United Nations,1999,22.
3Horn P.Autonomic computing:IBM′s perspective onthe state of information technology. . 2001
4Litvinova A,,Engelmann C,Scott S L.A proactivefault tolerance framework for high-performance com-puting. Proceedings of the 9th IASTED Inter-na-tional Conference on Parallel and Distributed Com-pu-ting and Networks . 2010
5Joshua H,Andrew L.A composable runtime recov-ery policy framework supporting resilient HPC appli-cations. . 2010
6C.Wang,F.Mueller,C.Engelmann,S.L.Scott.Proactive process-level live migration in HPC environments. Proceedings of the IEEE/ACM International Conference on High Performance Computing,Networking,Storage and Analysis(SC) . 2008
7Michalak S E,Harris KW,Hengartner N W,et al.Predictingthe Number of Fatal Soft Errorsin Los Alamos National La-boratory‘s ASC QSupercomputer. IEEE Trans on Deviceand Materials Reliability . 2005
8Sahoo R,Oliner A,Rish I,et al.Critical event pre-diction for proactive management in large-scale com-puter clusters. International Conference onKnowledge Discovery and Data Mining 2003 . 2003
9Franck C.Fault tolerance in petascale/exascale sys-tems:current knowledge,challenges and researchopportunities. International Journal of High Per-formance Computing Applications . 2009
10Treuren V,Bradford G,Alcatel L,et al.SystemJTAGinitiative group advancements. Interna-tional Test Conference 2008 . 2008

引证文献3

1魏勇,邢莉,武林平,罗红兵.提高集群系统稳定性的自动化管理方法[J].华中科技大学学报（自然科学版）,2011,39(S1):144-147. 被引量：1
2武林平,罗红兵,艾志玮,沈岳.大规模计算系统的主动故障管理方法[J].华中科技大学学报（自然科学版）,2010,38(S1):20-24. 被引量：5
3屈岚,魏立,赵志强,叶勇.模拟维和战时状态下三种药房管理软件稳定性比较[J].解放军药学学报,2011,27(1):88-89. 被引量：2

二级引证文献8

1魏勇,邢莉,武林平,罗红兵.提高集群系统稳定性的自动化管理方法[J].华中科技大学学报（自然科学版）,2011,39(S1):144-147. 被引量：1
2武林平,张晓霞,王伟,罗红兵.集群系统运行状态监控软件设计[J].华中科技大学学报（自然科学版）,2011,39(S1):148-152. 被引量：2
3魏立,赵志强,屈岚,叶勇.药房管理软件系统在模拟维和战时状态下的应用[J].解放军药学学报,2012,28(2):179-181. 被引量：3
4陈铎龙,孟相如,徐有,袁荣坤.基于阴性选择的网络限制洪泛算法[J].微电子学与计算机,2013,30(9):53-57.
5孟凡斌.精细化质量管理在航空发动机故障处置中的应用[J].航空科学技术,2014,25(5):114-118. 被引量：4
6任升,高原,顾文杰.集群系统分布式任务故障冗余管理机制的设计与实现[J].江苏科技信息,2015,32(21):37-39. 被引量：2
7祝谨惠.基于阴性选择的网络故障隔离研究[J].计算机应用与软件,2015,32(7):310-313.
8袁晓梅.计算机常见故障及维修[J].福建电脑,2003,19(2):40-40. 被引量：4

1熊曾刚.SCO Unix OPEN Server V5.04系统故障特征、分析及解决[J].计算机工程,2002,28(7):281-284.
2十大云计算厂商[J].互联网周刊,2009(11):50-50.
3梁小菊,贺占庄.两种容错方案的比较和可靠性分析[J].微机发展,2005,15(11):77-79. 被引量：4
4武林平,罗红兵,艾志玮,沈岳.大规模计算系统的主动故障管理方法[J].华中科技大学学报（自然科学版）,2010,38(S1):20-24. 被引量：5
5刘平安,徐翠琴.神经网络多传感器信息融合故障检测实现[J].科技创新与应用,2014,4(4):41-42.
6陈左宁.大规模计算机系统可信性技术的研究[J].高性能计算技术,2004,0(6). 被引量：4
7刘睿涛,陈左宁,朱建涛.大规模并行计算机容错技术研究[J].高性能计算技术,2004,0(6):1-5.
8雷鸣,陈左宁,朱建涛.HPC系统级容错管理体系结构设计[J].高性能计算技术,2004,0(6):6-10.
9杨际祥,谭国真,王凡,周美娜.一种大规模分布式计算负载均衡策略[J].电子学报,2012,40(11):2226-2231. 被引量：7
10邓鹏飞,舒涛.一种新的液压系统故障特征信号消噪方法研究[J].机床与液压,2012,40(9):151-153. 被引量：2

计算机工程与科学

2009年第A01期

浏览历史

内容加载中请稍等...

大规模计算系统故障特征及容错机制分析被引量：3

参考文献17

同被引文献23

引证文献3

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

大规模计算系统故障特征及容错机制分析 被引量：3

参考文献17

同被引文献23

引证文献3

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

大规模计算系统故障特征及容错机制分析被引量：3