摘要
文中主要介绍了数值天气预报模式GRAPES_MESO(4.0版本)与大气化学模式CUACE在线耦合形成的GRAPES_CUACE大气化学耦合模型在不同版本的x86体系结构下的并行优化算法的研究与分析。借鉴目前国内外主流的并行优化设计方法,结合GRAPES_MESO系统本身的程序架构和并行框架,针对不同版本x86体系架构做了相应的并行化改造。运用gprof工具和戳桩计时等方法,测试得到的程序热点模块主要有3部分:IO、通信和物理过程。对IO模块主要的优化方法为:1、由离散读写改为连续读写;2、开辟缓冲区由稀疏访存改为连续访存;3、异步IO。对通信部分采用两种方式:1、由细粒度改为粗粒度通信;2、采用时间复杂度更低的集合通信。对IO与通信模块优化结果分析可得:IO模块优化后的耗时占比由原来的43.7%降至1.41%,比重大幅度降低,最优部分性能提升了317倍,因此,该方法极大地提升了IO模块运行效率。此外,对物理过程进行优化采用的主要方法是:1、多层循环计算过程由离散改为连续;2、通信机制循环外移;3、数据复用以减少计算冗余;4、缩减栈变量空间等。这些优化方法使计算性能提高了22%,进一步提高了程序的并行效率和模式的强可扩展性。
This article mainly introduced the research and analysis of the parallel optimization algorithm of the meteorological particulate_meso dust aerosol coupling model under different versions of the x86 architecture.Drawing on the current mainstream parallel optimization design methods at home and abroad,combined with the GRAPES_MESO system’s own program architecture and parallel framework,corresponding parallelization transformation was implemented for different versions of x86 architecture.Using the gprof tool and poke pile timing,the test hotspot module has three main parts:IO,communication and physical process.The main optimization methods for the IO module are:1、continuous reading and writing by discrete reading and writing;2、opening buffer from sparse memory access to continuous memory access;3、asynchronous IO.The following methods are adopted for the communication part:1、the fine-grained communication is changed from fine-grained to coarse-grained;2、the aggregate communication with lower time complexity is adopted.Analysis of optimization results for IO and communication modules show that the time-consuming ratio of IO module optimization decreased from 43.7%to 1.41%.The proportion is greatly reduced,and the optimal performance is improved by 317 times.Therefore,the method described in this paper greatly improves the operating efficiency of the IO module.In addition,the main optimization methods used to optimize the physical process are as follows:1、the multi-layer cyclic calculation process is changed from discrete to continuous;2、the communication mechanism is cyclically shifted;3、the data is reused to reduce computational redundancy;4、the stack variable space is reduced.The computational performance is increased by 22%,which further improves the parallel efficiency of the program and the strong scalability of the model.
作者
叶跃进
陈德训
胡江凯
马欣
张小曳
YE Yue-jin;CHEN De-xun;HU Jiang-kai;MA Xin;ZHANG Xiao-ye(Jiangnan Institute of Computing Technology,Wuxi,Jiangsu 214083,China;Numerical Weather Prediction Center of CMA,Beijing 100081,China;Chinese Academy of Meteorological Sciences,Beijing 100081,China)
出处
《计算机科学》
CSCD
北大核心
2019年第S11期528-534,共7页
Computer Science
基金
国家重点研发计划(2016YFC0203300)
国家重大专项基金(2016YFA0602202,2017YFB0202603)资助
关键词
异步IO
粗粒度
连续访存
集合通信
Asynchronous IO
Coarse-grained
Continuous memory access
Aggregate communicatio