期刊文献+

大规模并行计算机系统硬件故障容错技术综述 被引量:6

A Survey of the Fault-Tolerance Techniques for Large-Scale Parallel Computing Systems
下载PDF
导出
摘要 计算机系统的容错是一个不容忽视的问题。近年来,随着系统结构的复杂性增加,半导体制造工艺的发展,线宽的降低以及集成度的提高,从用户桌面系统到分布式计算环境,乃至大规模并行计算机系统,功耗和可靠性问题都很突出。本文首先介绍了计算机系统可靠性和容错技术的基本概念、基本方法和基本思想,然后回顾了近些年来一些具有代表性的硬件故障检测技术和硬件故障恢复技术,其中重点介绍了针对大规模并行计算机系统提出的容错方法。本文还介绍了我们在先前的研究工作中提出的一种优化的故障恢复技术,称为容错并行算法。最后,总结了一些可能的研究方向。 Fault tolerance is critical to computer systems. Recently,as the ever increasing complexity of architecture and the development of semiconductor techniques,the density of chips becomes much higher. As a consequence,the reliability issue of computer systems emerges,not only for largescale parallel systems,but also for distributed environments,even desktop applications. This paper reviews a number of typical faulttolerance techniques concerning hardware faults proposed in recent years,especially for those designed for largescale parallel systems,draws some preliminary conclusions,and puts forward several potential research topics of this domain.
出处 《计算机工程与科学》 CSCD 北大核心 2010年第10期38-43,53,共7页 Computer Engineering & Science
基金 国家自然科学基金资助项目(60621003 60633050)
关键词 大规模并行计算 容错技术 可靠性 largescale parallel computing faulttolerance techique reliability
  • 相关文献

参考文献34

  • 1TOP500 Supercomputing Site[EB/OL]. [2009-03-10]. http://www, top500. org.
  • 2Reed D A, Lu C, Mendes C L. Reliability Challenges in Large Systems[J]. Future Generation Computer Systems, 2006,22 (3) :293-302.
  • 3Dubrova E. Fault Tolerant Design: An Introduction [S]. Draft, 2006.
  • 4Neumann J V. Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components[M]. Princeton University Press, 1956.
  • 5LaFrieda C, Ipek E, Martinez J F, et al. Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor[C]//Proc of the 37th Annual IEEE/IFIP Int'l Conf on Dependable Systems and Networks, 2007 : 317-326.
  • 6Oh N, Shirvani P P, McCluskey E J. Error Detection by Duplicated Instructions in Super-Scalar Processors[J]. IEEE Trans on Reliability, 2002,51 (1) :63-75.
  • 7Oh N,Mitra S,McCluskey E J. Ed4i: Error Detection by Diverse Data and Duplicated Instructions[J]. IEEE Trans on Computers, 2002,51 (2) : 180-199.
  • 8Reinhardt S K, Mukherjee S S. Transient Fault Detection via Simultaneous Multithreading[C]//Proe of the 27th Annual Int'l Syrup on Computer Arehiteeture, 2000 : 25-36.
  • 9Mukherjee S S, Kontz M, Reinhardt S K. Detailed Design and Evaluation of Redundant Multi-Threadingaltematives[C] // Proc of the 29th Annual Int'l Symp on Computer Architecture, 2002 : 99-110.
  • 10Lu D J. Watchdog Processors and VLSI[C]//Proc of National Electronies Conf, 1980 : 240-245.

同被引文献94

引证文献6

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部