并行作业启动及其可扩展性分析被引量：1

Experiences and Scalability Analysis of Parallel Job Startup

下载PDF

导出

摘要随着高性能计算机系统与并行应用规模的不断增加,大规模并行作业的启动时间不能再被忽略不计.已有的研究给出了在Tianhe-1A系统上加载MPI作业的性能结果.通过分析作业启动在控制消息传递、文件访问、MPI环境初始化等各阶段的时间开销,发现对于大规模MPI作业而言,环境初始化时间是作业启动的主要开销.基于此发现进行了一些优化,减少MPI环境初始化时交换的数据量,并避免不必要的数据传输开销.显著地提高了并行作业启动的性能.进而提出了一种层次式的可扩展进程管理结构,以进一步增强作业启动的可扩展性.与其他主流MPI实现的进程管理机制的作业启动时间进行了比较. As the scale of HPC systems and parallel applications keep increasing, time of large-scale parallel job startup cannot be ignored anymore. Various efforts have been made to improve the performance of program launching and runtime environment initialization. The experiences and results of starting MPI jobs, on Tianhe-lA supercomputer system are presented. Detailed study of the time costs of job startup in different stages, including control message transferring, file access, and MPI environment initialization, shows that for large scale MPI jobs, the environment initialization time dominates the job startup time. Based on this discovery, some preliminary optimization work has been done to reduce the data exchanged during MPI environment initialization and avoid unnecessary data transfer costs. The optimization improves the job startup performance notably. An optimizing process management design with hierarchical structure for MPI environment initialization is proposed to further improve the scalability of job startup. For completeness, we also compare and analyze the job start time of other process management mechanism in main-stream MPI implementations.

作者曹宏嘉卢宇彤谢旻周恩强

机构地区国防科学技术大学计算机学院

出处《计算机研究与发展》 EI CSCD 北大核心 2013年第8期1755-1761,共7页 Journal of Computer Research and Development

基金国家自然科学基金项目(61120106005) 国家"八六三"高技术研究发展计划基金项目(2012AA01A301)

关键词高性能计算并行作业启动进程管理 MPI 可扩展性 high performance computing parallel job startupl process management MPI scalability

分类号 TP316 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献15

1TOP500.Org. June 2011 List [EB/OL]. (2011-06-21) [2011-07-12]. http://www. top500. org//ists/2011/06.
2Butler R. Gropp W. Lusk E. A scalable process-management environment for parallel programs [G]//LNCS 1908: Proc of the 7th European PVM/MPI Users' Group Meeting. Berlin: Springer. 2000: 168- 175.
3Sridhar K. Koop J, Perkins L. et al. See LA: Scalable and extensible launching architecture for clusters [C]//Proc of the .15th Int Conf on High Performance Computing. Berlin: Springer, 2008: 323-335.
4Sridhar K, Panda K. Impact of node level caching in MPJ job launch mechanisms [C]//Proc of the 16th European PVM/ MPI Users' Group Meeting. Berlin: Springer, 2009: 230- 239.
5WU Weikuan. Wu Jiesheng, Panda K. Fast and scalable startup of MPI programs in infiniband clusters [C]//Proc of the 11th Int Conf on High Performance Computing. Berlin: Springer. 2004: 440-449.
6Oracle Corporal ion. The lustre Iilesyst ern [EB/OL]. (2011- 05-13) [2011-06-03]. http://www.lustre.org/.
7Argonne National Laboratory. MPICH2: High performance and portable MPI [EB/OL]. (2011-03-28) [2011-04-17]. http://www.mes.anl. gov/researeh/projecls/mpich2/.
8Lawrence Livermore National Laboratory. SLURM: A highly scalable resource manager [EB/OL]. (2011-05-05) [2011-06-11]. https: / /computing. llni. gov//inux/slurm/.
9Butler R. Gropp W. Lusk E. Components and interfaces of a process management system for pa rallel programs [J] Parallel Computing. 2001, 27(11): 1417-1429.
10Argonne National Laboratory. Hydra process management framework [EB/OL]. (2010-04-20) [2011-04-17]. http:// wiki. mcs. anl. gov/mpich2/index. php/ Hydra _ Process Management_Frarnework.

同被引文献8

1焦毅,李琳,王颖慧,叶南荣.一种面向企业私有云的数据分布策略[J].计算机研究与发展,2011,48(S3):239-244. 被引量：5
2罗萱,林新华,金耀辉.HPC与云融合之道[J].中国教育网络,2011(9):19-21. 被引量：1
3魏豪,周抒睿,张锐,杨挺,王千祥.基于应用特征的PaaS弹性资源管理机制[J].计算机学报,2016,39(2):223-236. 被引量：13
4苑迎,王翠荣,王聪,任婷婷,刘冰玉.基于非完全信息博弈的云资源分配模型[J].计算机研究与发展,2016,53(6):1342-1351. 被引量：5
5张玉清,王晓菲,刘雪峰,刘玲.云计算环境安全综述[J].软件学报,2016,27(6):1328-1348. 被引量：186
6王国峰,刘川意,潘鹤中,方滨兴.云计算模式内部威胁综述[J].计算机学报,2017,40(2):296-316. 被引量：35
7陈重韬.面向多用户环境的MapReduce集群调度算法研究[J].高技术通讯,2017,27(4):295-302. 被引量：2
8洪文杰,李肯立,全哲,阳王东,李克勤,郝子宇,谢向辉.面向神威·太湖之光的PETSc可扩展异构并行算法及其性能优化[J].计算机学报,2017,40(9):2057-2069. 被引量：14

引证文献1

1刘晓东,赵晓芳,金岩,罗刚,陈雅静,赵曙光.企业私有云环境下面向高性能计算的资源弹性分配算法[J].高技术通讯,2018,28(8):669-676. 被引量：4

二级引证文献4

1孙孝萍,陈继红,罗刚,周坤,文佳敏,杜吉国.物探云计算环境研究及实现[J].石油地球物理勘探,2020(S01):98-104. 被引量：5
2陈继红,尚民强,杜吉国,罗刚,孙孝平,王增波.海量地震数据处理平台研究与应用[J].中国科技成果,2020(8):38-39.
3路芳瑞.高性能计算与虚拟桌面基础设施融合使用研究[J].现代电子技术,2022,45(20):167-170. 被引量：1
4陈红华,崔翛龙,王耀杰.基于多种云环境的任务调度算法综述[J].计算机应用研究,2023,40(10):2889-2895. 被引量：6

1陈海勇,伏汉英,黄永忠,郭金庚.基于portlet的网格门户系统设计[J].计算机工程与设计,2006,27(10):1860-1862. 被引量：5
2曾细尧.域间路由结构可扩展性分析[J].电脑知识与技术（过刊）,2011,17(8X):5616-5619.
3王韶娟,曾国荪.分形维数的一个并行算法[J].计算机应用与软件,2005,22(10):19-20. 被引量：2
4史维.基于MPI环境的并行算法在有限元分析中的应用与研究[J].内蒙古石油化工,2008,34(17):5-7. 被引量：2
5杜晋瑞,戴光明.Bp算法在MPI负载平衡中的应用[J].电子与电脑,2005,5(6):108-110. 被引量：1
6付晓蕊,张连芳,舒炎泰.Ad Hoc 网络路由协议的可扩展性分析[J].计算机工程,2003,29(13):46-48.
7陈军,李晓梅.不同数据分配方式下并行系统的可扩展性[J].计算机工程与科学,2000,22(5):61-63.
8黄津津,汤克明,曹莹莹,吉祖勤.一种网络服务可扩展性分析方法研究[J].计算机应用与软件,2016,33(8):27-29.
9段晓阳,韩志杰,王冠男.基于蜂拥的P2P流媒体系统可扩展性分析[J].计算机科学,2012,39(B06):142-145. 被引量：1
10李海军.基于IBA构建高性能的MPI环境[J].计算机工程与应用,2004,40(15):46-48.

计算机研究与发展

2013年第8期

浏览历史

内容加载中请稍等...

并行作业启动及其可扩展性分析被引量：1

参考文献15

同被引文献8

引证文献1

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

并行作业启动及其可扩展性分析 被引量：1

参考文献15

同被引文献8

引证文献1

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

并行作业启动及其可扩展性分析被引量：1