期刊文献+

An Optimistic Checkpoint Mechanism Based on Job Characteristics and Resource Availability for Dynamic Grids

An Optimistic Checkpoint Mechanism Based on Job Characteristics and Resource Availability for Dynamic Grids
原文传递
导出
摘要 In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids(OCM4G) is proposed. It can determine whether to checkpoint a given job running on a given resource node and establish optimal aperiodic checkpoint intervals by applying the knowledge of job characteristics and resource availability. We evaluate OCM4G over a real grid environment (ChitlaGrid) and the results show that OCM4G achieves better performance than the periodic checkpoint and the analytical method of calculating aperiodic checkpoint intervals. In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids(OCM4G) is proposed. It can determine whether to checkpoint a given job running on a given resource node and establish optimal aperiodic checkpoint intervals by applying the knowledge of job characteristics and resource availability. We evaluate OCM4G over a real grid environment (ChitlaGrid) and the results show that OCM4G achieves better performance than the periodic checkpoint and the analytical method of calculating aperiodic checkpoint intervals.
出处 《Wuhan University Journal of Natural Sciences》 CAS 2011年第3期213-222,共10页 武汉大学学报(自然科学英文版)
基金 Supported by the National Natural Science Foundation of China (90412010,60603058,and 60673174) the Ministry of Education of China and Program for New Century Excellent Talents in University (NCET-07-0334)
关键词 grid computing fault tolerance CHECKPOINT MARKOV grid computing fault tolerance checkpoint Markov
  • 相关文献

参考文献28

  • 1Foster I, Kesselman C. The Grid: Blueprint for a New Computing Infrastructure [M]. 2nd edition. San Francisco: Morgan Kaufmann, 2003.
  • 2Krauter K, Buyya R, Maheswaran M. A taxonomy and survey of grid resource management systems for distributed computing [J]. Software Practice and Experience, 2002, 32(2): 135-164.
  • 3Hwang S, Kesselman C. Grid workflow: A flexible failure handling framework for the grid [C]//IEEE International Symposium on High Performance Distributed Computing (HPDC-12), Washington: IEEE Press, 2003: 126-137.
  • 4Oliner A J, Sahoo R K, Moreira J E, et al. Performance implications of periodic checkpoint on large-scale cluster systems [C]//IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005), Washington: IEEE Press, 2005.
  • 5Zhang Y, Squillante M S, Sivasubramaniam A, et al. Performance implications of failures in large-scale cluster scheduling [C]//Proceedings of the 10th Workshop on JSSPP, Sigmetrics, New York: IEEE Press, 2004: 233-252.
  • 6Ling Y, Mi J, Lin X. A variational calculus approach to optimal checkpoint placement [J]. IEEE Transaction on Computers, 2001, 50(7): 699-708.
  • 7Nurmi D, Brevik J, Wolski R. Minimizing the network overhead of checkpointing in cycle-harvesting cluster environments [C]//IEEE International Conference on Custer Computing, Boston: IEEE Press, 2005: 1-10.
  • 8Li H, Groep D, Walters L. Workload characteristics of a multi-cluster supercomputer [C]//Job Scheduling Strategies for Parallel Processing. New York: Springer-Verlag, 2004.
  • 9Heath T, Martin R, Nguyen T D. Improving cluster availability using workstation validation [C]//Proceedings of the ACM Sigmetrics, Marina Del Rey: ACM Press, 2002:217-227.
  • 10Brevik J, Nurmi D, Wolski R. Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems [C]//Proceedings of the Cluster Computing and the Grid, Washington: IEEE Press, 2004:190-199.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部