面向SW26010-Pro的1、2级BLAS函数众核并行优化技术

Many-core Optimization of Level 1 and Level 2 BLAS Routines on SW26010-Pro

下载PDF

导出

摘要 BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用所提出的高性能BLAS 1、2级函数取得了平均10.99倍的加速效果. BLAS(basic linear algebra subprograms)is an important module of the high-performance extended math library,which is widely used in the field of scientific and engineering computing.Level 1 BLAS provides vector-vector operation,Level 2 BLAS provides matrix-vector operation.This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro,a domestic many-core processor.A reduction strategy among CPEs is designed based on the RMA communication mechanism,which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines.For TRSV and TPSV and other routines that have data dependencies,a series of efficient parallelization algorithms are proposed.The algorithm maintains data dependencies through point-topoint synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices,which reduces the number of point-to-point synchronizations effectively,and improves the execution efficiency.In this study,adaptive optimization,vector compression,data multiplexing,and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines.The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%,with an average bandwidth of more than 90%.The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%,with an average bandwidth of more than 80%.Compared with the widely used open-source linear algebra library GotoBLAS,the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times.With the optimized Level 1 and Level 2 BLAS routines,LQ decomposition,QR decomposition,and eigenvalue problems achieved an average speedup of 10.99 times.

作者胡怡陈道琨杨超刘芳芳马文静尹万旺袁欣辉林蓉芬 HU Yi;CHEN Dao-Kun;YANG Chao;LIU Fang-Fang;MA Wen-Jing;YIN Wan-Wang;YUAN Xin-Hui;LIN Rong-Fen(Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;School of Mathematical Sciences,Peking University,Beijing 100871,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)

机构地区中国科学院软件研究所并行软件与计算科学实验室中国科学院大学北京大学数学科学学院国家并行计算机工程技术研究中心

出处《软件学报》 EI CSCD 北大核心 2023年第9期4421-4436,共16页 Journal of Software

基金国家重点研发计划(2020YFB0204601)。

关键词 BLAS 1级 BLAS 2级访存带宽 SW26010-Pro众核处理器 RMA通信点对点同步自适应优化 level 1 BLAS level 2 BLAS memory access bandwidth Sunway 26010-Pro many-core processor RMA communication point-to-point synchronization adaptive optimization

分类号 TP303 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献6

1何颂颂,顾乃杰,朱海涛,刘燕君.面向龙芯3A体系结构的BLAS库优化[J].小型微型计算机系统,2012,33(3):571-575. 被引量：8
2李毅,何颂颂,李恺.多核龙芯3A上二级BLAS库的优化[J].计算机系统应用,2011,20(1):163-167. 被引量：8
3刘昊,刘芳芳,张鹏,杨超,蒋丽娟.基于申威1600的3级BLAS GEMM函数优化[J].计算机系统应用,2016,25(12):234-239. 被引量：9
4顾乃杰,李凯,陈国良,吴超.基于龙芯2F体系结构的BLAS库优化[J].中国科学技术大学学报,2008,38(7):854-859. 被引量：13
5吴少刚,许解峰,杨耀忠,任钢.高性能BLAS在类Beowulf机群系统上的实现[J].小型微型计算机系统,2001,22(8):897-900. 被引量：4
6孙家栋,孙乔,邓攀,杨超.基于申威众核处理器的1、2级BLAS函数优化研究[J].计算机系统应用,2017,26(11):101-108. 被引量：5

二级参考文献27

1王鼎兴,庄伟强.一种实现并行计算的新主流技术──NOW[J].小型微型计算机系统,1995,16(2):29-34. 被引量：22
2Lawson C L, Hanson R J, Kincaid D R, et al. Basic linear algebra subprograms for Fortran usage[J]. ACM Transactions on Mathematical Software, 1979, 5 (3) : 308-323.
3Dongarra J J, Croz J D, Hammarling S, et al. An extended set of Fortran basic linear algebra subprograms[J]. ACM Transactions on Mathematical Software, 1988, 14(1): 1-17.
4Dongarra J J, Croz J D, Hammarling S, et al. A set of level 3 basic linear algebra subprograms [J]. ACM Transactions on Mathematical Software, 1990, 16(1): 1-17.
5Dongarra J J, Croz J D, Hammarling S, et al. A set of level 3 basic linear algebra subprograms: model implementation and test programs[J]. ACM Transactions on Mathematical Software, 1990, 16(1):18-28.
6Mannheim University, University of Tennessee. Top500 [EB/OL ]. http://www.netlib.org/ benchmark/top500. html.
7Chi X B, Li Y C, Sun J C, et al. Developing high performance bLAS, LAPACK & ScaLAPACK on HITACHI SRS000 [C]// Proceedings of the 4th International Conference/Exhibition on High Performance Computing in the Aisa-Pacific Region. Beijing, China: IEEE Computer Society, 2000, 2: 993-997.
8Zhuo L, Prasanna V K. Design tradeoffs for BLAS operations on reconfigurable hardware [ C ]// International Conference on Parallel Processing. Oslo, Norway: IEEE Press, 2005: 78-86.
9KD-50-I[EB/OL].http://kd50.ustc.edu.cn/.
10中国科学院计算技术研究所.龙芯2F处理器用户手册0.2版[Z].2007.

共引文献33

1顾乃杰,李凯,陈国良,吴超.基于龙芯2F体系结构的BLAS库优化[J].中国科学技术大学学报,2008,38(7):854-859. 被引量：13
2张俊霞,李春生,张焕杰.KD-50-I-E:一台增强型高性能计算机[J].中国科学技术大学学报,2009,39(8):894-896. 被引量：5
3李晖,李凯,吴俊敏,孙广中,陈国良.KD-50-I中的无盘启动技术、文件系统架构及BLAS库优化[J].小型微型计算机系统,2009,30(10):2085-2089.
4李毅,何颂颂,李恺.多核龙芯3A上二级BLAS库的优化[J].计算机系统应用,2011,20(1):163-167. 被引量：8
5陈国良,蔡晔,罗秋明.国产个人高性能计算机系统研制[J].深圳大学学报（理工版）,2011,28(6):471-477. 被引量：4
6陈强,何颂颂,王坤.龙芯3A上复数矩阵乘法的多线程优化[J].电子技术（上海）,2011,38(12):1-3. 被引量：1
7何颂颂,顾乃杰,朱海涛,刘燕君.面向龙芯3A体系结构的BLAS库优化[J].小型微型计算机系统,2012,33(3):571-575. 被引量：8
8谢林川.计算机性能优化技术中存在的问题[J].硅谷,2012,5(5):170-170. 被引量：5
9张斌,顾乃杰,何颂颂,刘斌斌.基于龙芯3A的LAPACK函数优化[J].计算机系统应用,2012,21(11):63-67.
10蔡晔,刘刚,毛睿,罗秋明,陈国良.KD-90普及型个人高性能计算机系统设计与性能优化[J].深圳大学学报（理工版）,2013,30(2):138-143. 被引量：8

1吴斌,彭辉,何腾蛟.基于高性能计算机的并行优化技术科普探析[J].科技视界,2023(5):153-156. 被引量：1
2阿力木江·亚森,阿布都克力木·阿布力孜,朱义鑫,哈里旦木·阿布都克里木.λ-演算归约策略的简易建模[J].计算机工程与设计,2022,43(9):2578-2583.
3王影.基于区块链的成人教育Python语言程序设计课程线上教学平台设计[J].信息与电脑,2023,35(12):245-247.
4崔亚彤,王胜侯,李金伟,吴宇豪,梁思维,马振宁.基于无展开随机QR分解算法的地质雷达数据重建方法[J].工程地球物理学报,2023,20(4):555-563.
5曹添,张显库.船舶纵向运动多输入多输出不稳定系统的非线性反馈控制[J].上海交通大学学报,2023,57(8):972-980.
6许天一,甄贞,赵颖慧.基于UAV-LiDAR和误差变量回归的落叶松人工林单木参数估测[J].中南林业科技大学学报,2023,43(7):52-64. 被引量：1
7Xiaona FANG,Lihua YOU.Regular and Maximal Graphs with Prescribed Tripartite Graph as a Star Complement[J].Chinese Annals of Mathematics,Series B,2023,44(4):517-532.
8余小龙,张炎,陶小辉,姜力晖,彭国良,曹锐.MIMO雷达近场成像校准与互耦合补偿方法[J].雷达科学与技术,2023,21(4):420-430.
9赵晨,马云.面向用户体验的影院座椅功能优化设计研究[J].机械设计,2023,40(S01):189-195.
10刘超,梁安婷,刘小洋,黄贤英.融合多元信息的社交网络节点分类方法[J].计算机科学与探索,2023,17(9):2198-2208.

软件学报

2023年第9期

浏览历史

内容加载中请稍等...

面向SW26010-Pro的1、2级BLAS函数众核并行优化技术

参考文献6

二级参考文献27

共引文献33

相关作者

相关机构

相关主题

浏览历史