摘要
宇宙射线辐射所导致的瞬态故障一直是航天计算面临的最主要挑战之一。而随着集成电路制造工艺的持续进步,现代处理器的性能在大幅度提高的同时,其可信性也正日益面临着瞬态故障的严重威胁。当前针对瞬态故障的容错技术可大致分为两类:基于硬件实现和基于软件实现。相比较前者,后者由于在实现成本和灵活性等方面的优势而备受关注。本文首先概述了面向瞬态故障的容错基本原理和对应软件容错技术的主要特点;然后,从不同实现层次介绍和分析了软件容错技术有代表性的最新研究成果;最后,对当前研究的特点和存在的问题进行了总结,并对软件容错技术未来的研究方向给出了意见。
Transient faults, which are caused by the radiation of cosmic rays, are always one of the top challenges for computing in space applications. With the continuous progress of integrated circuits, the performance of modern processors are improved significantly, but their dependability are increasingly affected by transient faults. Currently, the techniques for transient fault tolerance can mainly be classified into two types: hardware-implemented and software-implemented. Comparing with the former techniques, the latter are attractive because of their advantages on costs and flexibility. This paper firstly sketches the basic principle of transient fault tolerance and the characteristics of software-implemented techniques. Then, the representative techniques of software-implemented fault tolerance are introduced and analyzed from different levels. Finally, the properties and defects of the current studies are summa rized, and the advices are proposed for the future research trends of software-implemented fault tolerance.
出处
《计算机工程与科学》
CSCD
北大核心
2011年第11期132-139,共8页
Computer Engineering & Science
关键词
瞬态故障
软错误
软件容错
冗余计算
可信计算
transient fault
soft error
software fault tolerance
redundancy computing
dependable computing