基于BLACS的2.5D并行矩阵乘法被引量：1

New 2.5D Parallel Matrix Multiplication Algorithm Based on BLACS

下载PDF

导出

摘要并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之一.本文提出一种新型的分布式并行稠密矩阵乘算法,即2.5D版本的PUMMA(Parallel Universal Matrix Multiplication Algorithm)算法,该算法是通过将初始的进程分成c组,利用计算节点的额外内存,在每个进程组上同时存储矩阵A、B和执行1/c的PUMMA算法,最后通过规约操作来得到矩阵乘的最终结果.本文基于BLACS(Basic Linear Algebra Communication Subprograms)通信库实现了一种从2D到2.5D的新型数据重分配算法,与PUMMA算法相结合,最终得到2.5D PUMMA算法,可直接替换PDGEMM(Parallel Double-precision General Matrix-matrix Multiplication),具有良好的可移植性.与国际标准算法库ScaLAPACK(Scalable Linear Algebra PACKage)中的PDGEMM等经典2D算法相比,本文算法缩减了通信次数,提高了数据局部性,具有更好的可扩展性.在进程数较多时,例如4096进程时,系统测试表明相对PDGEMM的加速比可达到2.20~2.93.进一步地,本文将2.5D PUMMA算法应用于加速计算对称三对角矩阵的特征值分解,其加速比可达到1.2以上.本文通过大量数值算例分析了2.5D PUMMA算法的性能,并给出了实用性建议和总结了未来的工作. Matrix-matrix multiplication is one of the most important basic operations in linear algebra and a building block of many scientific applications.As HPC(High Performance Computing)moves towards Exa-scale,the degree of parallelism of HPC systems is increasing.The communication cost of parallel matrix multiplication will be more and more important.How to reduce the communication cost,improve the scalability of parallel matrix multiplication and make full use of the advantages of supercomputer computing resources are well-studied subjects.In this paper,a novel distributed parallel dense matrix multiplication algorithm is proposed,which can be regarded as 2.5D PUMMA(Parallel Universal Matrix Multiplication Algorithm)algorithm.The underlying idea of this implementation is improving the scalability by decreasing the data transfer volume between processors at the cost of the extra memory usage.The 2.5D matrix multiplication algorithm in this paper firstly divides the initial processes into c groups,stores matrix A and B,and perform 1/c PUMMA algorithm simultaneously on each process group by utilizing the extra memory on the computing nodes,and finally gets the final result of matrix multiplication through MPI communications.Based on BLACS(Basic Linear Algebra Communication Subprograms),this paper implements a new data redistribution algorithm from 2D to 2.5D,combined with the PUMMA algorithm,and finally gets the 2.5D PUMMA algorithm which can directly replace PDGEMM(Parallel Double-precision General Matrix-matrix Multiply).Compared with classic 2D algorithms such as PDGEMM in ScaLAPACK(Scalable Linear Algebra PACKage),the algorithm in this paper reduces the number of communications,improves data locality,and has better scalability.The performance experiments are carried out on a supercomputer system with 300 computing nodes,each of which consists of two 10-core Xeon E5-2692 v3 CPUs.These nodes are interconnected by Tianhe 2 interconnection networks(TH-Express 2)with fat tree structure.The effectiveness and efficiency of the 2.5D PUMMA algorithm are evaluated with various of the matrix size,degree of blocking,the stack size(duplication factor)and the process number.In the case that the number of processes is high,such as 4096 processes,system tests show that the acceleration ratio relative to PDGEMM can reach 2.20—2.93.When the number of processes is small,such as 64 or 256 processes,the performance of the 2.5D PUMMA algorithm may be worse than the PDGEMM function.In this case,besides the additional overhead of data redistribution,the computation is the dominant factor rather than the communication.Furthermore,the 2.5D PUMMA algorithm proposed in this paper is applied to solve symmetric eigenvalue problems,which is used to accelerate the tridiagonal eigenvalue decomposition step.The speedup over the original implementation in ScaLAPACK can reach more than 1.2.The next stage of work is to study the 2.5D PUMMA algorithm for special matrices,such as Cauchy matrix,Toeplitz matrix and Vandermonde matrix,which are widely used in scientific,industrial and clinical fields.In this paper,the performance of 2.5D PUMMA algorithm is analyzed through a large number of numerical experiments,practical suggestions are given and the future work is summarized and forecasted.

作者廖霞李胜国卢宇彤杨灿群 LIAO Xia;LI Sheng-Guo;LU Yu-Tong;YANG Can-Qun(College of Computer Science,National University of Defense Technology,Changsha 410073;National Supercomputer Center in Guangzhou,Sun Yat-Sen University,Guangzhou 510006;Science and Technology on Parallel and Distributed Processing Laboratory,National University of Defense Technology,Changsha 410073)

机构地区国防科技大学计算机学院中山大学国家超级计算中心国防科技大学并行与分布处理重点实验室

出处《计算机学报》 EI CAS CSCD 北大核心 2021年第5期1037-1050,共14页 Chinese Journal of Computers

基金科技部重点研发计划项目(2018YFB0204301) 国家重点研发计划(2018YFB0204303) 国家自然科学基金项目(61872392,U1811461) 国家数值风洞项目(NNW2019ZT6-B20、NNW2019ZT6-B21、NNW2019ZT5-A10) 广东省自然科学基金(2018B030312002) 广东省“珠江人才计划”引进创新创业团队(2016ZT06D211) 湖南省面上项目(2019JJ40339) 校科研项目(ZK18-03-01)资助~~

关键词 2.5D并行矩阵乘算法 SCALAPACK PUMMA矩阵乘算法 SUMMA算法分布式并行 2.5D parallel matrix multiplication algorithm ScaLAPACK parallel universal matrix multiplication algorithm scalable universal matrix multiplication algorithm distributed parallel

分类号 TP301 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献1

1吴建平,迟学斌.分布式系统上并行矩阵乘法[J].计算数学,1999,21(1):99-108. 被引量：11

二级参考文献2

1李晓梅，面向结构的并行算法.设计与分析，1996年
2孙家昶，网络并行计算与分布式编程环境，1996年

共引文献10

1蔡昭权,魏文红,王高才,郑宗晖,卢庆武.一种基于De Bruijn网络结构的并行矩阵乘算法[J].计算机应用,2009,29(3):880-883. 被引量：1
2陈晶,黄曙光.分布式并行矩阵乘算法分析[J].兵工自动化,2005,24(5):52-54. 被引量：4
3吴建平,王正华,李晓梅.利用混合编程改善SMP机群上并行矩阵乘法的性能[J].国防科技大学学报,2006,28(4):68-72. 被引量：6
4夏磊,刘方爱.矩阵乘在一组规则WDM光网络上的波长分配[J].计算机工程与应用,2007,43(28):131-133.
5魏文红,秦勇,李清霞.OTIS网络结构的并行矩阵乘算法[J].华侨大学学报（自然科学版）,2008,29(3):357-359. 被引量：2
6刘青昆,马名威,阎慰椿.基于MPI+CUDA异步模型的并行矩阵乘法[J].计算机应用,2011,31(12):3327-3330. 被引量：2
7姜海韬,羊帆,陈媛媛,付继发,张辉国.基于R的并行优化环境设计[J].科技创新与应用,2018,8(23):37-39.
8陈翔,刘金刚.一种适合于并行计算的新方法——相对标准法[J].计算机工程与应用,2002,38(24):91-93. 被引量：1
9张学波,李晓梅.分布式环境下几种矩阵乘并行算法分析与比较[J].装备指挥技术学院学报,2003,14(4):82-85.
10张学波,李晓梅.基于对角划分的矩阵乘并行算法[J].计算机工程,2004,30(6):42-43. 被引量：6

同被引文献6

1龙国平,范东睿.LU分解在Godson-Tv1众核体系结构上的并行化研究[J].计算机学报,2009,32(11):2157-2167. 被引量：2
2冯健,倪明,赵建波.一种基于分布式平台Hadoop的矩阵相乘算法[J].计算机系统应用,2013,22(12):149-154. 被引量：1
3郭平,王可,罗阿理,薛明志.大数据分析中的计算智能研究现状与展望[J].软件学报,2015,26(11):3010-3025. 被引量：59
4申小伟,叶笑春,王达,张浩,王飞,谭旭,张志敏,范东睿,唐志敏,孙凝晖.一种面向科学计算的数据流优化方法[J].计算机学报,2017,40(9):2181-2196. 被引量：9
5顾荣,仇红剑,杨文家,胡伟,袁春风,黄宜华.Goldfish:基于矩阵分解的大规模RDF数据存储与查询系统[J].计算机学报,2017,40(10):2212-2230. 被引量：11
6李亿渊,薛巍,陈德训,王欣亮,许平,张武生,杨广文.稀疏矩阵向量乘法在申威众核架构上的性能优化[J].计算机学报,2020,43(6):1037-1051. 被引量：13

引证文献1

1陈梓浩,徐辰,钱卫宁,周傲英.面向大数据分析的分布式矩阵计算系统研究进展[J].软件学报,2023,34(3):1236-1258. 被引量：4

二级引证文献4

1翁瑞.大数据分析技术的应用和发展趋势[J].计算机应用文摘,2023,39(21):37-39.
2王应战.大数据安全高效搜索隐私保护机制展望[J].软件,2023,44(10):140-142.
3韩镇畴.复杂工况下桥式起重机吊装路径自适应规划方法研究[J].机械设计与制造工程,2024,53(7):135-139.
4赵静,孔陶茹.基于决策树分类算法的计算机大数据分析研究[J].自动化与仪器仪表,2024(10):154-158.

1王庆林,李东升,梅松竹,赖志权,窦勇.面向飞腾多核处理器的Winograd快速卷积算法优化[J].计算机研究与发展,2020,57(6):1140-1151. 被引量：9
2肖汉,肖诗洋,李彩林,周清雷.异构平台上基于OpenCL的矩阵乘并行算法[J].西南大学学报（自然科学版）,2020,42(11):147-153. 被引量：3
3任海科,羊富贵,李辉,陈小龙.云数据中心网络多用户多业务并发性算法[J].计算机科学与应用,2020,10(7):1422-1430.
4格力.如何提高小学生计算能力[J].传奇故事（百家讲堂）,2021(1):70-70.
5戴永红,赖凡,刘荣贵.硅基自旋量子比特技术研究进展[J].微电子学,2021,51(1):91-95.
6虞志坚.关于线性变换教学的注记[J].台州学院学报,2020,42(6):97-102.
7王吉军,郝子宇,李宏亮.3D-MMA：基于3D集成电路的矩阵乘加速结构[J].计算机工程与科学,2019,41(12):2110-2118.
8福建卷[J].数理天地（高中版）,2006(9):18-19.
9Hai-Sheng Li,Ping Fan,Haiying Xia,Huiling Peng,Gui-Lu Long.Efficient quantum arithmetic operation circuits for quantum image processing[J].Science China(Physics,Mechanics & Astronomy),2020,63(8):33-45. 被引量：4
10老万.Windows 10“开始”按钮失灵的解决办法[J].电脑爱好者,2021(8):28-28.

计算机学报

2021年第5期

浏览历史

内容加载中请稍等...

基于BLACS的2.5D并行矩阵乘法被引量：1

参考文献1

二级参考文献2

共引文献10

同被引文献6

引证文献1

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于BLACS的2.5D并行矩阵乘法 被引量：1

参考文献1

二级参考文献2

共引文献10

同被引文献6

引证文献1

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于BLACS的2.5D并行矩阵乘法被引量：1