期刊文献+

异构HPL算法中CPU端高性能BLAS库优化 被引量:2

CPU-side High Performance BLAS Library Optimization in Heterogeneous HPL Algorithm
下载PDF
导出
摘要 异构HPL(high-performance Linpack)效率的提高需要充分发挥加速部件和通用CPU计算能力,加速部件集成了更多的计算核心,负责主要的计算,通用CPU负责任务调度的同时也参与计算.在合理划分任务、平衡负载的前提下,优化CPU端计算性能对整体效率的提升尤为重要.针对具体平台体系结构特点对BLAS(basic linear algebra subprograms)函数进行优化往往可以更加充分地利用通用CPU计算能力,提高系统整体效率.BLIS(BLAS-like library instantiation software)算法库是开源的BLAS函数框架,具有易开发、易移植和模块化等优点.基于异构系统平台体系结构以及HPL算法特点,充分利用三级缓存、向量化指令和多线程并行等技术手段优化CPU端调用的各级BLAS函数,应用auto-tuning技术优化矩阵分块参数,从而形成了异构环境下优化的BLIS算法库HBLIS.与MKL相比,HPL整体性能提高了11.8%. Improving the efficiency of heterogeneous HPL needs to fully utilize the computing power of acceleration components and CPU,the acceleration components integrate more computing cores and are responsible for the main calculation.The general CPU is responsible for task scheduling and also participates in calculation.Under the premise of reasonable division of tasks and load balancing,optimizing CPU-side computing performance is particularly important to improve overall efficiency.Optimizing the basic linear algebra subprogram(BLAS)functions for specific platform architecture characteristics can often make full use of general-purpose CPU computing capabilities to improve the overall system efficiency.The BLIS(BLAS-like library instantiation software)algorithm library is an open source BLAS function framework,which has the advantages of easy development,portability,and modularity.Based on the heterogeneous system platform architecture and HPL algorithm characteristics,this study uses three-level cache,vectorized instructions,and multi-threaded parallel technology to optimize the BLAS functions called by the CPU,applies auto-tuning technology to optimize the matrix block parameters,and eventually forms the optimized BLIS algorithm library in heterogeneous environment.Compared with MKL,the overall performance of the HPL using the optimized HBLIS has been improved by 11.8%.
作者 蔡雨 孙成国 杜朝晖 刘子行 康梦博 李双双 CAI Yu;SUN Cheng-Guo;DU Zhao-Hui;LIU Zi-Xing;KANG Meng-Bo;LI Shuang-Shuang(Information Technology Co.,Ltd.,Suzhou 215000,China)
出处 《软件学报》 EI CSCD 北大核心 2021年第8期2289-2306,共18页 Journal of Software
关键词 BLAS 遗传算法auto-tuning 向量化指令 数据预取 多线程并行 BLAS genetic algorithm auto-tuning vectorization instruction data prefetching multi-threading parallelization
  • 相关文献

参考文献6

二级参考文献32

  • 1Lawson C L, Hanson R J, Kincaid D R, et al. Basic linear algebra subprograms for Fortran usage[J]. ACM Transactions on Mathematical Software, 1979, 5 (3) : 308-323.
  • 2Dongarra J J, Croz J D, Hammarling S, et al. An extended set of Fortran basic linear algebra subprograms[J]. ACM Transactions on Mathematical Software, 1988, 14(1): 1-17.
  • 3Dongarra J J, Croz J D, Hammarling S, et al. A set of level 3 basic linear algebra subprograms [J]. ACM Transactions on Mathematical Software, 1990, 16(1): 1-17.
  • 4Dongarra J J, Croz J D, Hammarling S, et al. A set of level 3 basic linear algebra subprograms: model implementation and test programs[J]. ACM Transactions on Mathematical Software, 1990, 16(1):18-28.
  • 5Mannheim University, University of Tennessee. Top500 [EB/OL ]. http://www.netlib.org/ benchmark/top500. html.
  • 6Chi X B, Li Y C, Sun J C, et al. Developing high performance bLAS, LAPACK & ScaLAPACK on HITACHI SRS000 [C]// Proceedings of the 4th International Conference/Exhibition on High Performance Computing in the Aisa-Pacific Region. Beijing, China: IEEE Computer Society, 2000, 2: 993-997.
  • 7Zhuo L, Prasanna V K. Design tradeoffs for BLAS operations on reconfigurable hardware [ C ]// International Conference on Parallel Processing. Oslo, Norway: IEEE Press, 2005: 78-86.
  • 8KD-50-I[EB/OL].http://kd50.ustc.edu.cn/.
  • 9中国科学院计算技术研究所.龙芯2F处理器用户手册0.2版[Z].2007.
  • 10Whaley R C, Petitet A, Dongarra J J. Automated empirical optimization of software and the ATLAS project[J]. Parallel Computing, 2001, 27 (1-2) : 3-35.

共引文献24

同被引文献9

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部