摘要
随着并行计算机系统规模的不断增大,系统的失效率呈线性增长。如何保证大规模并行系统能够提供持续不断的服务,即提高系统的可用性,达到高可用的目标,已成为并行系统设计的重要方面。系统级容错的概念目前已经提出,但系统可用性的度量仍然需要深入研究。本文运用组合模型和马尔科夫过程模型,对系统可靠性和可用性进行了建模和分析,推导了基于马尔科夫过程的可用性度量公式,得出运用高可用技术可以提高系统的可用性。在此基础上,还给出了一个大规模并行计算机系统的高可用系统结构。
As the scale of the parallel computer system increases, its failure rate increases linearly. Guaranteeing high a-vailability in such a large-scale parallel system becomes a primary requirement to ensure continuous services . In order to achieve high availability, a variety of fault-tolerance and high availability technologies have been applied. A new concept of system-level fault tolerance has been proposed. It means that people design a fault-tolerant model from the whole system point of view and integrate different levels of fault tolerance technologies to improve the fault tolerance capability of the whole system . Meanwhile, what degree of availability can be attained is still unknown . In the paper , we model and analyze the reliability and availability of a parallel system by a compositional model and a Markov process model . Availability expressions are deduced from the Markov process model. The conclusion is that the availability of a system can be increased by high availability technologies . Then we propose a high availability architecture of the large-scale parallel computer system and demonstrate how to get high availability from the architecture .
出处
《计算机工程与科学》
CSCD
2005年第5期104-107,110,共5页
Computer Engineering & Science
基金
国家杰出青年科学基金资助项目(60025206)
关键词
并行计算机
高可用性分析
设计
可靠性
马尔科夫过程
parallel computer reliability
availability
compositional model
Markov process
system-level fault tolerance