摘要
针对网格计算可靠性需求,提出一套网格计算容错框架,该容错框架包括两个方面网格错误检测与网格错误处理.本容错框架通过提供一种层次式错误检测方式以及基于策略的通用错误处理方式来保证网格计算的可靠性.错误检测服务按照层次方式组织,最底层是本地错误检测器,它负责收集被检测对象的信息,发送到中间层的数据收集器,中间层数据收集器按照列表方式发送被检测对象的信息到顶层数据收集器.当错误检测器检测到运行错误时,按照决策分析的方法来提供灵活的错误处理方式.对系统的性能评测表明提出的通用网格容错框架具有很好的扩展性、高效性以及较低的额外开销.
A general fault-tolerance framework for grid computing is proposed which are dealt with hierarchical structure fault detection services and policy-based fault-handling method, based on the requirements of reliable grid computing. The bottom of the fault detection service is local fault detector, which monitors the objects in local area and sends heartbeat messages to the middle data collector; the middle data collector sends the status list of the monitored objects to the top data collectors within specific interval; the top data collector is managed by an index server. When any fault detected, the system chooses an appropriate fault-handling method, such as checkpointing, retrying, replication. The results of the performance evaluation show that this framework is scalable, high-efficiency and low-overhead.
出处
《华中科技大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2006年第7期42-45,共4页
Journal of Huazhong University of Science and Technology(Natural Science Edition)
基金
国家自然科学基金重大专项基金资助项目(90412010)
中国教育科研网格计划ChinaGrid基金资助项目(CG2003-CG001).
关键词
错误检测
容错
基于策略的错误处理
fault detection
fault-tolerance
policy-based fault-handling