期刊文献+

并行作业启动及其可扩展性分析 被引量:1

Experiences and Scalability Analysis of Parallel Job Startup
下载PDF
导出
摘要 随着高性能计算机系统与并行应用规模的不断增加,大规模并行作业的启动时间不能再被忽略不计.已有的研究给出了在Tianhe-1A系统上加载MPI作业的性能结果.通过分析作业启动在控制消息传递、文件访问、MPI环境初始化等各阶段的时间开销,发现对于大规模MPI作业而言,环境初始化时间是作业启动的主要开销.基于此发现进行了一些优化,减少MPI环境初始化时交换的数据量,并避免不必要的数据传输开销.显著地提高了并行作业启动的性能.进而提出了一种层次式的可扩展进程管理结构,以进一步增强作业启动的可扩展性.与其他主流MPI实现的进程管理机制的作业启动时间进行了比较. As the scale of HPC systems and parallel applications keep increasing, time of large-scale parallel job startup cannot be ignored anymore. Various efforts have been made to improve the performance of program launching and runtime environment initialization. The experiences and results of starting MPI jobs, on Tianhe-lA supercomputer system are presented. Detailed study of the time costs of job startup in different stages, including control message transferring, file access, and MPI environment initialization, shows that for large scale MPI jobs, the environment initialization time dominates the job startup time. Based on this discovery, some preliminary optimization work has been done to reduce the data exchanged during MPI environment initialization and avoid unnecessary data transfer costs. The optimization improves the job startup performance notably. An optimizing process management design with hierarchical structure for MPI environment initialization is proposed to further improve the scalability of job startup. For completeness, we also compare and analyze the job start time of other process management mechanism in main-stream MPI implementations.
出处 《计算机研究与发展》 EI CSCD 北大核心 2013年第8期1755-1761,共7页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61120106005) 国家"八六三"高技术研究发展计划基金项目(2012AA01A301)
关键词 高性能计算 并行作业启动 进程管理 MPI 可扩展性 high performance computing parallel job startupl process management MPI scalability
  • 相关文献

参考文献15

  • 1TOP500.Org. June 2011 List [EB/OL]. (2011-06-21) [2011-07-12]. http://www. top500. org//ists/2011/06.
  • 2Butler R. Gropp W. Lusk E. A scalable process-management environment for parallel programs [G]//LNCS 1908: Proc of the 7th European PVM/MPI Users' Group Meeting. Berlin: Springer. 2000: 168- 175.
  • 3Sridhar K. Koop J, Perkins L. et al. See LA: Scalable and extensible launching architecture for clusters [C]//Proc of the .15th Int Conf on High Performance Computing. Berlin: Springer, 2008: 323-335.
  • 4Sridhar K, Panda K. Impact of node level caching in MPJ job launch mechanisms [C]//Proc of the 16th European PVM/ MPI Users' Group Meeting. Berlin: Springer, 2009: 230- 239.
  • 5WU Weikuan. Wu Jiesheng, Panda K. Fast and scalable startup of MPI programs in infiniband clusters [C]//Proc of the 11th Int Conf on High Performance Computing. Berlin: Springer. 2004: 440-449.
  • 6Oracle Corporal ion. The lustre Iilesyst ern [EB/OL]. (2011- 05-13) [2011-06-03]. http://www.lustre.org/.
  • 7Argonne National Laboratory. MPICH2: High performance and portable MPI [EB/OL]. (2011-03-28) [2011-04-17]. http://www.mes.anl. gov/researeh/projecls/mpich2/.
  • 8Lawrence Livermore National Laboratory. SLURM: A highly scalable resource manager [EB/OL]. (2011-05-05) [2011-06-11]. https: / /computing. llni. gov//inux/slurm/.
  • 9Butler R. Gropp W. Lusk E. Components and interfaces of a process management system for pa rallel programs [J] Parallel Computing. 2001, 27(11): 1417-1429.
  • 10Argonne National Laboratory. Hydra process management framework [EB/OL]. (2010-04-20) [2011-04-17]. http:// wiki. mcs. anl. gov/mpich2/index. php/ Hydra _ Process Management_Frarnework.

同被引文献8

引证文献1

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部