期刊文献+

LUNF——基于节点失效特征的机群作业调度策略 被引量:4

LUNF-A Cluster Job Scheduling Strategy Using Characterization of Nodes' Failure
下载PDF
导出
摘要 良好的可扩展性使得人们可通过扩大机群系统的规模来达到所需要的计算能力,但随着机群系统节点数目的增多,节点失效对机群系统性能的影响已经成为大规模机群系统使用过程中一个不可回避的问题.机群作业调度作为机群操作系统软件的重要组成部分,完成高效资源管理与合理作业调度,机群作业调度系统功能上可分为作业选取策略和节点分配策略两部分.结合机群系统节点失效的特征,提出了正常运行时间最长节点优先(longestuptimenodefirst,LUNF)的节点分配策略.仿真结果表明,相对于节点随机分配策略,LUNF策略的作业平均响应时间与作业平均slowdown降低10%左右. Owing to the outstanding scalability of cluster systems, the demand of high performance can be easily met by increasing the number of nodes. But, with the scale of cluster system expanding, node failures become a commonplace feature of such large-scale systems. New ways are needed to accommodate the occurrence of node failure. As an important part of cluster operating system software, job scheduling completes the task of high efficient resource management and reasonable job scheduling. The function of job scheduling in cluster system is divided into two sub-processes: strategy of job selection and node allocation policy. In this paper, the LUNF (longest uptime node first) node allocation policy is introduced using characterization of nodes' failure. The simulation results show that LUNF policy do better than random node allocation policy for the system performance.
出处 《计算机研究与发展》 EI CSCD 北大核心 2005年第6期1000-1005,共6页 Journal of Computer Research and Development
基金 国家"八六三"高技术研究发展计划重大专项基金项目(2002AA104410) 国家"八六三"高技术研究发展计划软件重大专项基金项目(2002AA1Z2102)
关键词 机群系统 节点失效 作业调度 节点分配 LUNF cluster system node failure job scheduling node allocation LUNF
  • 相关文献

参考文献8

  • 1Top 500 Supercomputer Sites. System Info: Dawning 4000A,Opteron 2_ 2 GHz, Myrinet. http: ∥www.top500.org/sublist/System. php? id = 7036, 2004-05
  • 2T. Heath, R. Martin, T. D. Nguyen. Improving cluster availability using workstation validation. The ACM SIGMETRICS 2002, Marina Del Rey, CA, 2002
  • 3A. Avizienis, J. C. Laprie, B. Randell. Fundamental concepts of dependability. LAAS-CNRS, Tech Rep: 01-145, 2001
  • 4A. Brown, D. A. Patterson. To err is human. The 1st Workshop on Evaluating and Architecting System dependability (EASY '01), Goteborg, Sweden, 2001
  • 5D. Tang, R. K. Iyer. Impact of correlated failures on dependability in a VAXcluster system. In: Proc. the 2nd Int'l Working Conf. Dependable Computing for Critical Applications.Vienna, Austria: Springer-Verlag, 1991. 175~194
  • 6R. Sahoo, A. Sivasubramaniam, M. Squillante, et al. Failure data analysis of a large-scale heterogeneous server environment.The Int'l Conf. Dependable Systems and Networks (DSN),Florence, Italy, 2004
  • 7D.G. Feitelson. Experimental analysis of the root causes of performance evaluation results: A backfilling case study. school of computer science and engineering, The Hebrew University of Jerusalem. Tech Rep: 2002-4, 2002
  • 8Y. Huang, C. Kintala, N. Kolettis, et al. Software rejuvenation: Analysis, module and applications. The 25th Int'l Symposium on Fault-Tolerant Computing, Pasadena, CA, 1995

同被引文献19

引证文献4

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部