摘要
本文围绕国内外若干大规模计算系统的运行稳定性状况展开调研:首先根据若干典型系统的故障数据,从故障模式、故障特征方面对目前实际生产性系统的稳定性进行分析;然后,在总结目前系统级容错研究思路的基础上,分析了未来更大规模计算系统容错机制的挑战及可能的解决方案。
The running stability of several large scale computing systems is discussed. First, we summaries the main fault models and features according to the public fault data. Second, based on the survey of system fault tolerance research, the challenge and likely mechanisms for fault tolerance of more large scale computing systems is introduced.
出处
《计算机工程与科学》
CSCD
北大核心
2009年第A01期237-240,共4页
Computer Engineering & Science
基金
国家自然科学基金资助项目(60803045)
关键词
大规模计算系统
故障
容错
断点续算
Large scale computing system
Fault
Fault tolerance
Checkpoint restart