
实时多任务并行计算系统的容错技术 被引量:5

Fault tolerance in real-time and multitask parallel computing system
摘要 容错技术是实时多任务并行计算系统设计中必须解决的一个关键难点。针对实时多任务并行计算系统的高可靠性和高效性的要求,介绍了计算机系统可靠性和容错技术的基本概念、基本方法和基本思想,在检查点技术和卷回技术的基础上,提出了进行多层次、多角度的并行容错计算机系统设计和解决中途消息和孤立消息的相关方案,给出了相应的模型和技术评估,通过仿真实验证明了该模型的有效性。 Fault tolerance plays a key role in the design of real-time and multitask parallel computing systems. Aiming at the request of high reliability and efficiency in the real-time and multitask parallel computing system, the basic concepts, basic methods and basic thoughts in the technology of reliability and fault tolerance of computing system are introduced, based on the check-pointing technology and back-out recovery technology. Fault-tolerance parallel computing system from multi-levels and multi-aspects and the solving way of midway message and isolated message are put forward. At the same time, the relate model and technology evaluating are discussed to prove the validity of the model.
出处 《计算机工程与应用》 CSCD 2013年第9期33-36,101,共5页 Computer Engineering and Applications
关键词 实时多任务 容错 检查点 多层次 real-time and multi-task fault tolerance checkpoint multi-levels
  • 相关文献



  • 1崔丽青,徐炜民.MPI容错问题的研究及实现[J].计算机应用,2003,23(z2):236-238. 被引量:3
  • 2任波,王乘.MPI集群通信性能分析[J].计算机工程,2004,30(11):71-73. 被引量:13
  • 3周恩强,卢宇彤,沈志宇.一个适合大规模集群并行计算的检查点系统[J].计算机研究与发展,2005,42(6):987-992. 被引量:12
  • 4TOP500 Supercomputing Site[EB/OL]. [2009-03-10]. http://www, top500. org.
  • 5Reed D A, Lu C, Mendes C L. Reliability Challenges in Large Systems[J]. Future Generation Computer Systems, 2006,22 (3) :293-302.
  • 6Dubrova E. Fault Tolerant Design: An Introduction [S]. Draft, 2006.
  • 7Neumann J V. Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components[M]. Princeton University Press, 1956.
  • 8LaFrieda C, Ipek E, Martinez J F, et al. Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor[C]//Proc of the 37th Annual IEEE/IFIP Int'l Conf on Dependable Systems and Networks, 2007 : 317-326.
  • 9Oh N, Shirvani P P, McCluskey E J. Error Detection by Duplicated Instructions in Super-Scalar Processors[J]. IEEE Trans on Reliability, 2002,51 (1) :63-75.
  • 10Oh N,Mitra S,McCluskey E J. Ed4i: Error Detection by Diverse Data and Duplicated Instructions[J]. IEEE Trans on Computers, 2002,51 (2) : 180-199.



  • 1何先波,李志蜀,唐宁九,王朝斌,刘家彬.通信领域中嵌入式软件异常处理模块分析与实现[J].计算机工程,2007,33(9):69-71. 被引量:2
  • 2李继芳.实用计算机图形学[M].北京:清华大学出版社,2012:14-16.
  • 3陆玲.计算图形学[M].北京:电子工业出版社,2012:17-20.
  • 4LI Tao.A polymorphic array architecture for graphics and image processing[C]//2012 5th Int Symp Parallel Architectures,Algorithms and Programming,Taipei, IEEE Computer Society CPS,2003:242-249.
  • 5LI Hen,WANG Long.Parallel programming languages on multi-core and many-core architectures[J].Information Technology Letter,2012,10(1):23-30.
  • 6YANG C, LIU J, HSU C. On improvement of cloud virtual machine availability with virtualization fault tolerance mechanism [ J]. Jour- nal of Supercomputing, 2014, 69(3) : 1103 - 1122.
  • 7PAJOROVa E, HLUCHY L. Complicated simulation visualization based on grid and cloud computing [ C]//CDVE 2010: Proceedings of the 7th International Conference on Cooperative Design, Visual- ization & Engineering. Berlin: Springer-Verlag, 2010:211 -217.
  • 8JUNG B, CHIN S, CHUNG K S, et al. VM migration for fault tol- erance in spot instance based cloud computing [ J]. Grid and Perva- sive Computing, 2013, 7861(1) : 142 - 151.
  • 9DAS P. Virtualization and fault tolerance in cloud computing [ D]. Rourkela: National Institute of Technology Rourkela, 2013:11 -26.
  • 10RADHAKRISHNAN G. Adaptive application scaling for improving fault-tolerance and availability in the cloud [ J], Bell Labs Technical Journal, 2012, 17(2): 5-14.










使用帮助 返回顶部