LUNF——基于节点失效特征的机群作业调度策略被引量：4

LUNF-A Cluster Job Scheduling Strategy Using Characterization of Nodes' Failure

下载PDF

导出

摘要良好的可扩展性使得人们可通过扩大机群系统的规模来达到所需要的计算能力,但随着机群系统节点数目的增多,节点失效对机群系统性能的影响已经成为大规模机群系统使用过程中一个不可回避的问题.机群作业调度作为机群操作系统软件的重要组成部分,完成高效资源管理与合理作业调度,机群作业调度系统功能上可分为作业选取策略和节点分配策略两部分.结合机群系统节点失效的特征,提出了正常运行时间最长节点优先(longestuptimenodefirst,LUNF)的节点分配策略.仿真结果表明,相对于节点随机分配策略,LUNF策略的作业平均响应时间与作业平均slowdown降低10%左右. Owing to the outstanding scalability of cluster systems, the demand of high performance can be easily met by increasing the number of nodes. But, with the scale of cluster system expanding, node failures become a commonplace feature of such large-scale systems. New ways are needed to accommodate the occurrence of node failure. As an important part of cluster operating system software, job scheduling completes the task of high efficient resource management and reasonable job scheduling. The function of job scheduling in cluster system is divided into two sub-processes: strategy of job selection and node allocation policy. In this paper, the LUNF (longest uptime node first) node allocation policy is introduced using characterization of nodes' failure. The simulation results show that LUNF policy do better than random node allocation policy for the system performance.

作者武林平孟丹梁毅涂碧波王磊

机构地区中国科学院计算技术研究所国家智能计算机研究开发中心

出处《计算机研究与发展》 EI CSCD 北大核心 2005年第6期1000-1005,共6页 Journal of Computer Research and Development

基金国家"八六三"高技术研究发展计划重大专项基金项目(2002AA104410) 国家"八六三"高技术研究发展计划软件重大专项基金项目(2002AA1Z2102)

关键词机群系统节点失效作业调度节点分配 LUNF cluster system node failure job scheduling node allocation LUNF

分类号 TP302 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献8

1Top 500 Supercomputer Sites. System Info: Dawning 4000A,Opteron 2_ 2 GHz, Myrinet. http: ∥www.top500.org/sublist/System. php? id = 7036, 2004-05
2T. Heath, R. Martin, T. D. Nguyen. Improving cluster availability using workstation validation. The ACM SIGMETRICS 2002, Marina Del Rey, CA, 2002
3A. Avizienis, J. C. Laprie, B. Randell. Fundamental concepts of dependability. LAAS-CNRS, Tech Rep: 01-145, 2001
4A. Brown, D. A. Patterson. To err is human. The 1st Workshop on Evaluating and Architecting System dependability (EASY '01), Goteborg, Sweden, 2001
5D. Tang, R. K. Iyer. Impact of correlated failures on dependability in a VAXcluster system. In: Proc. the 2nd Int'l Working Conf. Dependable Computing for Critical Applications.Vienna, Austria: Springer-Verlag, 1991. 175～194
6R. Sahoo, A. Sivasubramaniam, M. Squillante, et al. Failure data analysis of a large-scale heterogeneous server environment.The Int'l Conf. Dependable Systems and Networks (DSN),Florence, Italy, 2004
7D.G. Feitelson. Experimental analysis of the root causes of performance evaluation results: A backfilling case study. school of computer science and engineering, The Hebrew University of Jerusalem. Tech Rep: 2002-4, 2002
8Y. Huang, C. Kintala, N. Kolettis, et al. Software rejuvenation: Analysis, module and applications. The 25th Int'l Symposium on Fault-Tolerant Computing, Pasadena, CA, 1995

同被引文献19

1张丽晓,袁立强,徐炜民.基于任务类型的集群调度策略[J].计算机工程,2004,30(13):63-64. 被引量：11
2邵雄凯,卢炎生,程学先.动态调整移动数据库中失效报告时间窗口的大小[J].计算机研究与发展,2004,41(7):1246-1250. 被引量：3
3袁立强,徐炜民.高性能集群系统中资源负载量化的研究[J].计算机科学,2004,31(5):100-102. 被引量：2
4苏蕊,徐炜民,钱晓竞.基于双向匹配模型的任务调度策略的研究[J].计算机工程与设计,2005,26(8):2045-2047. 被引量：4
5雷向东,赵跃龙,陈松桥,袁晓莉.移动实时数据库系统多版本数据广播[J].计算机工程,2006,32(20):56-58. 被引量：1
6黄铠徐志伟.可扩展并行计算技术、结构与编程[M].北京：机械工业出版社,2000..
7Barry, Satish, Matt, et al. PETSc [CP/OL]. http://www-unix.mcs. anl.gov/petsc/petsc-as/download/index.html,2006.
8Marston S,Li Z,Bandyopadhyay S.Cloud computing-the business perspective[J].Decision Support Systems,2011,51(1):176-189.
9Pandey S,Wu Linlin,Guru S M,et al.A Particle Swarm Optimization(PSO)-based heuristic for scheduling workflow applications in cloud computing environments[C]//24th IEEE International Conference on Advanced Information Networking and Applications,Australia,2010,1(1):400-407.
10Sahoo R K,Sivasubramaniam A,Squillante M S.Failure data analysis of a large-scale heterogeneous server environment[C]//Proceedings of the DSN2004,Florence,Italy,2004:772-784.

引证文献4

1张丽晓.基于资源域的分级调度模型[J].计算机工程与设计,2007,28(21):5317-5318. 被引量：1
2胡文斌,王磊,范存联,李文杰,宋家兴,齐新燕.一种新的数据广播调度失效性控制策略[J].中国科技论文在线,2009,4(2):79-84.
3丁燕艳,潘郁,程仕伟.云计算环境下的PSO可信资源调度[J].计算机工程与应用,2013,49(18):78-81. 被引量：5
4高剑,于康,卿鹏,尉红梅.面向高性能计算的分布式故障定位框架[J].计算机应用,2018,38(1):44-49. 被引量：4

二级引证文献10

1翁英萍,季中恒,彭建华.基于时延的分级多业务调度算法研究[J].计算机工程与设计,2009,30(6):1311-1314. 被引量：1
2李卫斌,董影影,李小林,张伟.改进蚁群算法在应急VRP中的应用及收敛性分析[J].计算机应用研究,2014,31(12):3557-3559. 被引量：5
3刘运,程家兴,林京.基于高斯变异的人工萤火虫算法在云计算资源调度中的研究[J].计算机应用研究,2015,32(3):834-837. 被引量：8
4陈海涛.云计算中的基于粒子群算法和差分遗传算法的资源调度[J].计算机系统应用,2015,24(10):136-141. 被引量：1
5陈海涛.云计算中基于改进的布谷鸟算法的资源调度[J].计算机系统应用,2016,25(1):114-120. 被引量：1
6朱光慧,曾云辉.一种大规模并行作业运行故障快速定位方法[J].郑州大学学报（理学版）,2019,51(4):102-109. 被引量：1
7韩琦琦,刘鑫,曾云辉,朱光慧.海洋数值模式运行管理系统的设计与实现[J].计算机应用与软件,2020,37(4):6-11. 被引量：3
8雷丽,赵超莹.共轭孪生波通信系统中休眠节点软故障识别[J].计算机仿真,2020,37(11):158-161.
9池来新,杨旭涛,谢宁,张学杰.边缘计算系统中资源分配防策略拍卖机制设计[J].计算机工程与科学,2021,43(10):1720-1729. 被引量：1
10高剑刚,郑岩,于康,彭达佳,李宏亮,刘勇,何王全,陈德训,王飞.神威超级计算机运行时故障定位方法[J].计算机研究与发展,2024,61(1):86-97.

1郑成文,韩柯,张志强,汤伟.面向黑盒测试的软件失效特征分析[J].价值工程,2012,31(27):7-8.
2田敬军.基于RFID的新型农机机群分布式调度系统的设计[J].中国农机化学报,2013,34(3):224-228.
3斗山发布全新升级版设备远程管理系统“斗知道”[J].工程机械与维修,2015(4).
4贾俊宽.海窑区间使用机械化小机群作业的分析[J].中国科技博览,2009(31):275-275.
5王涛,杨娟,程海川,李博.基于未确知集-贝叶斯网络的构件软件体系可靠性模型[J].计算机应用,2009,29(6):1715-1718.
6王春耕,朱建涛.大规模机群系统中基于LDAP的用户管理[J].计算机工程与应用,2004,40(18):47-49. 被引量：4
7邹铭,涂碧波,詹剑锋.基于分区租借的大规模作业管理系统[J].计算机工程,2007,33(17):99-101. 被引量：1
8伍文静,刘爱贵,程耀东,汪璐,陈刚.大规模机群系统的快速部署与动态配置[J].计算机应用研究,2008,25(6):1911-1913. 被引量：3
9ERP Market Faces Slowdown in 2011[J].中国制造业信息化（应用版）,2011(2):56-56.
10王涛,杨娟,邹杰.基于随机集贝叶斯网络的构件软件体系可靠性模型研究[J].西南师范大学学报（自然科学版）,2009,34(4):196-203.

计算机研究与发展

2005年第6期

浏览历史

内容加载中请稍等...

LUNF——基于节点失效特征的机群作业调度策略被引量：4

参考文献8

同被引文献19

引证文献4

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

LUNF——基于节点失效特征的机群作业调度策略 被引量：4

参考文献8

同被引文献19

引证文献4

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

LUNF——基于节点失效特征的机群作业调度策略被引量：4