期刊文献+

大型高能物理计算集群资源管理方法的评测 被引量:6

Evaluation of Resource Management Methods for Large High Energy Physics Computer Cluster
下载PDF
导出
摘要 高能物理数据由物理事例组成,事例之间没有相关性。可以通过大量作业同时处理大量不同的数据文件,从而实现高能物理计算任务的并行化,因此高能物理计算是典型的高吞吐量计算场景。高能所计算集群使用开源的TORQUE/Maui进行资源管理及作业调度,并通过将集群资源划分成不同队列以及限制用户最大运行作业数来保证公平性,然而这也导致了集群整体资源利用率非常低下。SLURM和HTCondor都是近年来流行的开源资源管理系统,前者拥有丰富的作业调度策略,后者非常适合高吞吐量计算,二者都能够替代老旧、缺乏维护的TORQUE/Maui,都是管理计算集群资源的可行方案。在SLURM和HTCondor测试集群上模拟大亚湾实验用户的作业提交行为,对SLURM和HTCondor的资源分配行为和效率进行了测试,并与相同作业在高能物理研究所TORQUE/Maui集群上的实际调度结果进行了对比,分析了SLURM及HTCondor的优势和不足,探讨了使用SLURM或HTCondor管理高能物理研究所计算集群的可行性。 High energy physics data consist of multiple events,among which there is no relativity.A high energy physics computing mission is parallelized by running multiple jobs processing multiple different data files simultaneously.Therefore,high energy physics computing is a typical high throughput computing scenario.The computer cluster running at the institute of high energy physics(IHEP)uses the open-source TORQUE/Maui for resource management and job scheduling.IHEP keeps a fair-use policy by dividing the computing resources of this cluster into multiple queues,and limiting the maximum number of running jobs of each user.However,this leads up to a low overall resource usage of the cluster.SLURM and HTCondor are both popular open-source resource management system.SLURM has plenty of job scheduling policy,while HTCondor well suits high throughput computing.Both of them are the possible solutions of resource management for computer clusters,replacing old,lack-of-service TORQUE/Maui.In this paper,job submission behavior of users from Daya Bay experiment was simulated at SLURM and HTCondor testing cluster,testing the resource allocation behaviors and efficiencies of SLURM and HTCondor.Their scheduling results were then compared with the actual scheduling result of the same jobs on IHEP TORQUE/Maui cluster.Finally the strengths and weaknesses of SLURM and HTCondor were analyzed,and the practicability of using SLURM or HTCondor to manage the IHEP computer cluster was discussed.
出处 《计算机科学》 CSCD 北大核心 2017年第10期85-90,共6页 Computer Science
基金 国家自然科学基金项目(11475210)资助
关键词 资源管理系统 作业调度器 计算集群 高吞吐量计算 高能物理计算 Resource management system, Job scheduler, Computer cluster, High throughput computing, High energy physics computing
  • 相关文献

参考文献3

二级参考文献14

  • 1李卫国,1990年
  • 2郑志鹏,1990年
  • 3Debrunner H,et al.The solar flare event on 1990 May 24-Evidence for two separate particle accelerations[J],Astrophys J,1992,387:L51.
  • 4BabayanV,et.al.Alert service for extreme radiation storms[C].Proc.27th ICRC,Hambag,2003,9:3541.
  • 5Dorman LI,On the Prediction of Great Flare Energetic Particle Events to Save Electronics on Spacecrafts[C].Proc.26th ICRC,Salt-Lake-City,1999,6:382.
  • 6Wang Y-F,hep-ex/0610024,talk given at 33rd International Conf.on High Energy Physics(ICHEP06),Moscow,Russia,July 26-Aug2,2006.Daya Bay Coll.,hep-ex/0701029;更多的具体情况参见http://dayawane.ihep.ac.cn/
  • 7Yao W -M et al.J.Phys.,2006,G33:1
  • 8Wang Y F.Int.J.Mod.Phys.A,2005,20:5244; Plenary talk given at 32nd International Conference on High-Energy Physics (ICHEP 04).Beijing:China,16-22 Aug 2004
  • 9http://www.aps.org/neutrino/
  • 10Anderson K et al.hep-ex/0402041

共引文献28

同被引文献48

引证文献6

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部