摘要
ScaLAPACK(Scalable Linear Algebra PACKage)是并行计算软件包,适用于分布式存储的MIMD(Multiple Instruction,Multiple Data)并行计算机,被广泛应用于基于线性代数运算的并行应用程序开发。然而在进行LU分解过程中,ScaLAPACK库中的例程并不是通信最优的,没有充分利用当前的并行架构。针对上述问题,提出一种基于鲲鹏处理器的LU并行分解优化算法(Parallel LU Factorization,PLF),实现了负载均衡,适配国产鲲鹏环境。PLF对不同进程的不同分区的数据进行差异化处理,并将每个进程所拥有的部分数据分配给根进程进行计算,之后再由根进程散播回各个子进程,这有利于充分利用CPU资源,实现负载均衡。在单节点Intel 9320R处理器以及鲲鹏(Kunpeng)920处理器环境中进行测试,其中,Intel平台下使用Intel MKL(Math Kernel Library),Kunpeng平台下使用PLF算法。对比两个平台关于不同规模的方程组求解的性能发现,Kunpeng平台的求解性能有显著优势。在NUMA数进程和单线程的情况下,优化后的计算效率在小规模平均达到4.35%,相比Intel的1.38%提升了215%;中规模平均达到4.24%,相比Intel平台的1.86%提升了118%;大规模平均达到4.24%,相比Intel的1.99%提升了113%。
Scalable linear algebra PACKage(ScaLAPACK)is a parallel computing package suitable for MIMD(multiple instruction,multiple data)parallel computers with distributed storage.It is widely used in parallel application program development based on linear algebra operation.However,during the LU decomposition process,the routines in the ScaLAPACK library are not communication optimal and do not take full advantage of the current parallel architecture.To solve the above problems,a parallel LU factorization optimization algorithm(PLF)based on Kunpeng processor is proposed to achieve load balancing and adapt to domestic Kunpeng environment.PLF processes the data of different partitions of different processes differently.PLF allocates part of the data of each process to the root process for calculation.After the calculation is completed,the root process spreads the data back to each sub-process,which helps to fully utilize CPU resources and achieve load balancing.Tests are performed on single-node Intel 9320R processors and Kunpeng 920 processors.Intel MKL(Math Kernel Library)is used on the Intel platform,and PLF algorithm is used on the Kunpeng platform.After comparing the performance of solving equations of different scales on two platforms,it is found that the performance of solving equations on Kunpeng platform has a significant advantage compared with Intel platform.In the case of NUMA process and single thread,the optimized computing efficiency reaches 4.35%on a small scale on average,which is 215%higher than Intel’s 1.38%.The average size of the medium scale reaches 4.24%,compared with 1.86%of Intel platform,an increase of 118%.The large-scale average reaches 4.24%,compared to Intel’s 1.99%,an increase of 113%.
作者
徐鹤
周涛
李鹏
秦芳芳
季一木
XU He;ZHOU Tao;LI Peng;QIN Fangfang;JI Yimu(School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;Jiangsu HPC and Intelligent Processing Engineer Research Center,Nanjing 210023,China;College of Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处
《计算机科学》
CSCD
北大核心
2024年第9期51-58,共8页
Computer Science
基金
国家自然科学基金(62102194,62102196)
江苏省六大人才高峰高层次人才项目(RJFW-111)
江苏省研究生实践创新计划(SJCX22_0267,SJCX22_0275)
华为鲲鹏众智计划(2022外241,2022外243)。