摘要
随着超级计算机及其编程环境的发展,异构系统结构下的多级并行编程将成为趋势,神威·太湖之光国产超级计算机就是其中的一个典型。自2016年神威·太湖之光运行以来,国内外很多学者在其上进行了方法研究和应用验证,为申威环境积累了比较丰富的众核化编程方法及优化方法。但是,将全球系统模式CESM移植到申威众核环境时,对于海洋分量模式POP中的一些二维数据计算,常用的众核优化方法在1024进程规模下运行时具有较好的加速效果,然而在16800大规模进程下运行时众核化会失效,表现为负加速。针对上述问题,文中提出了一种基于从核分区的并行计算方法,一个核组内的64个从核被分成多个互不交叉的从核分区,将可以独立计算的多个代码段计算任务分别分配到不同的从核分区上进行运行,能够有效利用从核的计算能力,还可以实现对多个独立的代码段进行计算时间隐藏。每个从核分区内的从核数量及从核号可以根据拟分配的计算任务情况进行适当选取,使得每个从核都能达到较适宜的数据量和计算量。在采用前述从核分区方法的基础上,结合使用循环合并和函数上提等方法增大程序并行粒度,提高了二维数据计算在大规模进程下的可扩展性,CESM模式高分辨率G算例中POP分量模式在110万核心规模下的模拟速度提高了0.8模式年/天,众核化的加速效果明显。
With the development of supercomputer and its programming environment,multilevel parallelism under heterogeneous system infrastructure is a promising trend.Applications ported to Sunway TaihuLight are typical.Since the Sunway TaihuLight was open to public in 2016,many scholars focus on the method study and application verification,so much experience on Shenwei many-core programming method is accumulated.However,when the CESM model is ported to Shenwei many-core infrastructure,some two dimensional computations in the ported POP model show quite good results under 1024 processes.On the contrary,they perform much worse than the original version,and false acceleration ratios appeared under 16800 processes.Upon this problem,a new parallel method based on slave-core partitions was proposed.Under the new parallel method,the 64 slave-cores in a core-group are divided into some disjoint small partitions,which make that different and independent computing kernels can run at different slave-core partitions simultaneously.In the method,the computing kernels can be loaded to different slave-core partitions with the suitable data size and computational load,where the amount and number of the slave-cores in each partition can be pro-perly set according to the computing scale,so the slave-core’s calculation ability can be fully utilized.Based on the new parallel method,also with the loops combination and function expansion,the slave-cores are fully applied and some computing time among several parallel running codes is hidden.Furthermore,it is effective to extend the parallel granularity of the kernels to be athrea-ded.Applied the above methods,the simulation speed of POP model in high-resolution CESM G-compset is improved by 0.8 si-mulation year per day under 1.1 million cores.
作者
庄园
郭强
张洁
曾云辉
ZHUANG Yuan;GUO Qiang;ZHANG Jie;ZENG Yun-hui(Qilu University of Technology(Shandong Academy of Sciences),Jinan 250101,China;Shandong Computer Science Center(National Supercomputer Center in Jinan),Jinan 250101,China;Shandong Provincial Key Laboratory of Computer Networks,Jinan 250101,China)
出处
《计算机科学》
CSCD
北大核心
2020年第8期87-92,共6页
Computer Science
基金
国家重点研发计划项目(2016YFB0201100)。
关键词
二维数据计算
申威众核
大规模可扩展性
从核分区
并行粒度
2D-array computation
Shenwei many-core
Large scalability
Slave-core partition
Parallel granularity