An Optimistic Checkpoint Mechanism Based on Job Characteristics and Resource Availability for Dynamic Grids

An Optimistic Checkpoint Mechanism Based on Job Characteristics and Resource Availability for Dynamic Grids

导出

摘要 In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids（OCM4G） is proposed. It can determine whether to checkpoint a given job running on a given resource node and establish optimal aperiodic checkpoint intervals by applying the knowledge of job characteristics and resource availability. We evaluate OCM4G over a real grid environment （ChitlaGrid） and the results show that OCM4G achieves better performance than the periodic checkpoint and the analytical method of calculating aperiodic checkpoint intervals. In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids（OCM4G） is proposed. It can determine whether to checkpoint a given job running on a given resource node and establish optimal aperiodic checkpoint intervals by applying the knowledge of job characteristics and resource availability. We evaluate OCM4G over a real grid environment （ChitlaGrid） and the results show that OCM4G achieves better performance than the periodic checkpoint and the analytical method of calculating aperiodic checkpoint intervals.

作者 TAO Yongcai JIN Hai WU Song

机构地区 School of Information Engineering Services Computing Technology and System Lab

出处《Wuhan University Journal of Natural Sciences》 CAS 2011年第3期213-222,共10页 武汉大学学报（自然科学英文版）

基金 Supported by the National Natural Science Foundation of China (90412010,60603058,and 60673174) the Ministry of Education of China and Program for New Century Excellent Talents in University (NCET-07-0334)

关键词 grid computing fault tolerance CHECKPOINT MARKOV grid computing fault tolerance checkpoint Markov

分类号 TP302.1 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献28

1Foster I, Kesselman C. The Grid: Blueprint for a New Computing Infrastructure [M]. 2nd edition. San Francisco: Morgan Kaufmann, 2003.
2Krauter K, Buyya R, Maheswaran M. A taxonomy and survey of grid resource management systems for distributed computing [J]. Software Practice and Experience, 2002, 32(2): 135-164.
3Hwang S, Kesselman C. Grid workflow: A flexible failure handling framework for the grid [C]//IEEE International Symposium on High Performance Distributed Computing (HPDC-12), Washington: IEEE Press, 2003: 126-137.
4Oliner A J, Sahoo R K, Moreira J E, et al. Performance implications of periodic checkpoint on large-scale cluster systems [C]//IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005), Washington: IEEE Press, 2005.
5Zhang Y, Squillante M S, Sivasubramaniam A, et al. Performance implications of failures in large-scale cluster scheduling [C]//Proceedings of the 10th Workshop on JSSPP, Sigmetrics, New York: IEEE Press, 2004: 233-252.
6Ling Y, Mi J, Lin X. A variational calculus approach to optimal checkpoint placement [J]. IEEE Transaction on Computers, 2001, 50(7): 699-708.
7Nurmi D, Brevik J, Wolski R. Minimizing the network overhead of checkpointing in cycle-harvesting cluster environments [C]//IEEE International Conference on Custer Computing, Boston: IEEE Press, 2005: 1-10.
8Li H, Groep D, Walters L. Workload characteristics of a multi-cluster supercomputer [C]//Job Scheduling Strategies for Parallel Processing. New York: Springer-Verlag, 2004.
9Heath T, Martin R, Nguyen T D. Improving cluster availability using workstation validation [C]//Proceedings of the ACM Sigmetrics, Marina Del Rey: ACM Press, 2002:217-227.
10Brevik J, Nurmi D, Wolski R. Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems [C]//Proceedings of the Cluster Computing and the Grid, Washington: IEEE Press, 2004:190-199.

1WU Yongwei MAO Jiayin YANG Guangwen ZHENG Weimin.Performance Analysis of Grid Computing Pool[J].Chinese Journal of Electronics,2005,14(4):564-568.
2曹大有,周天宏.VFP多用户环境下通用“写”引擎的设计[J].长江大学学报（社会科学版）,2003,26(5):24-26.
3鄢喜爱,杨金民,田华.双机容错系统中最佳检查点间隔的分析[J].计算机工程,2007,33(5):283-284. 被引量：3
4田华.信息资源管理中可靠高效的冗余系统实现方法[J].科学技术与工程,2007,7(21):5713-5716.
5田华.信息资源管理中可靠高效的冗余系统实现方法[J].云南图书馆,2007(2):76-78.
6桑莉莉.工作流系统适应性检查点机制的研究[J].计算机应用与软件,2010,27(3):139-141.
7LIU Cheng GU Weiguo QIAN Nan WANG Dezhong.Study of image reconstruction using dynamic grids in tomographic gamma scanning[J].Nuclear Science and Techniques,2012,23(5):277-283. 被引量：2
8CHEN Jian-gang WANG Ru-chuan WANG Hai-yan.The extended RBAC model based on grid computing[J].The Journal of China Universities of Posts and Telecommunications,2006,13(3):93-97. 被引量：5
9刘云生,张传富,张童,查亚兵,黄柯棣.基于Markov链的分布式仿真系统最佳检查点间隔研究[J].国防科技大学学报,2005,27(5):73-77. 被引量：9
10门朝光,何忠政,陈拥军,李香,蒋庆丰.应用混合粒子群优化的检查点全局优化算法[J].哈尔滨工业大学学报,2015,47(5):91-96. 被引量：2

Wuhan University Journal of Natural Sciences

2011年第3期

浏览历史

内容加载中请稍等...

An Optimistic Checkpoint Mechanism Based on Job Characteristics and Resource Availability for Dynamic Grids

参考文献28

相关作者

相关机构

相关主题

浏览历史