期刊文献+

面向SW26010-Pro的1、2级BLAS函数众核并行优化技术

Many-core Optimization of Level 1 and Level 2 BLAS Routines on SW26010-Pro
下载PDF
导出
摘要 BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用所提出的高性能BLAS 1、2级函数取得了平均10.99倍的加速效果. BLAS(basic linear algebra subprograms)is an important module of the high-performance extended math library,which is widely used in the field of scientific and engineering computing.Level 1 BLAS provides vector-vector operation,Level 2 BLAS provides matrix-vector operation.This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro,a domestic many-core processor.A reduction strategy among CPEs is designed based on the RMA communication mechanism,which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines.For TRSV and TPSV and other routines that have data dependencies,a series of efficient parallelization algorithms are proposed.The algorithm maintains data dependencies through point-topoint synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices,which reduces the number of point-to-point synchronizations effectively,and improves the execution efficiency.In this study,adaptive optimization,vector compression,data multiplexing,and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines.The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%,with an average bandwidth of more than 90%.The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%,with an average bandwidth of more than 80%.Compared with the widely used open-source linear algebra library GotoBLAS,the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times.With the optimized Level 1 and Level 2 BLAS routines,LQ decomposition,QR decomposition,and eigenvalue problems achieved an average speedup of 10.99 times.
作者 胡怡 陈道琨 杨超 刘芳芳 马文静 尹万旺 袁欣辉 林蓉芬 HU Yi;CHEN Dao-Kun;YANG Chao;LIU Fang-Fang;MA Wen-Jing;YIN Wan-Wang;YUAN Xin-Hui;LIN Rong-Fen(Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;School of Mathematical Sciences,Peking University,Beijing 100871,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)
出处 《软件学报》 EI CSCD 北大核心 2023年第9期4421-4436,共16页 Journal of Software
基金 国家重点研发计划(2020YFB0204601)。
关键词 BLAS 1级 BLAS 2级 访存带宽 SW26010-Pro众核处理器 RMA通信 点对点同步 自适应优化 level 1 BLAS level 2 BLAS memory access bandwidth Sunway 26010-Pro many-core processor RMA communication point-to-point synchronization adaptive optimization
  • 相关文献

参考文献6

二级参考文献27

  • 1王鼎兴,庄伟强.一种实现并行计算的新主流技术──NOW[J].小型微型计算机系统,1995,16(2):29-34. 被引量:22
  • 2Lawson C L, Hanson R J, Kincaid D R, et al. Basic linear algebra subprograms for Fortran usage[J]. ACM Transactions on Mathematical Software, 1979, 5 (3) : 308-323.
  • 3Dongarra J J, Croz J D, Hammarling S, et al. An extended set of Fortran basic linear algebra subprograms[J]. ACM Transactions on Mathematical Software, 1988, 14(1): 1-17.
  • 4Dongarra J J, Croz J D, Hammarling S, et al. A set of level 3 basic linear algebra subprograms [J]. ACM Transactions on Mathematical Software, 1990, 16(1): 1-17.
  • 5Dongarra J J, Croz J D, Hammarling S, et al. A set of level 3 basic linear algebra subprograms: model implementation and test programs[J]. ACM Transactions on Mathematical Software, 1990, 16(1):18-28.
  • 6Mannheim University, University of Tennessee. Top500 [EB/OL ]. http://www.netlib.org/ benchmark/top500. html.
  • 7Chi X B, Li Y C, Sun J C, et al. Developing high performance bLAS, LAPACK & ScaLAPACK on HITACHI SRS000 [C]// Proceedings of the 4th International Conference/Exhibition on High Performance Computing in the Aisa-Pacific Region. Beijing, China: IEEE Computer Society, 2000, 2: 993-997.
  • 8Zhuo L, Prasanna V K. Design tradeoffs for BLAS operations on reconfigurable hardware [ C ]// International Conference on Parallel Processing. Oslo, Norway: IEEE Press, 2005: 78-86.
  • 9KD-50-I[EB/OL].http://kd50.ustc.edu.cn/.
  • 10中国科学院计算技术研究所.龙芯2F处理器用户手册0.2版[Z].2007.

共引文献33

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部