期刊文献+

面向多核向量处理器的矩阵乘法向量化方法 被引量:9

Vectorization of Matrix Multiplication for Multi-Core Vector Processors
下载PDF
导出
摘要 稠密矩阵乘法是大规模科学计算中许多算法的核心计算之一,文中提出一种高效的面向多核向量处理器的矩阵乘法向量化方法.提出一种按行计算的矩阵乘法向量化方法,该向量化方法的基本思想是每次同时计算C矩阵的一行元素,C矩阵第i行元素的值由k次向量乘累加完成,每次计算都是先将A矩阵第i行的第j个元素扩展为值相同的向量,再与B矩阵的第j行向量进行乘累加计算,每一次的向量乘累加计算是在各个VPE上并行进行,计算的源数据和结果数据均保存在VPE的本地寄存器上,每个计算结果涉及的乘累加计算均在同一个VPE上完成,并且A、B、C三个矩阵的数据均是按行顺序读取,访存效率高,在k循环结束时,同时完成C矩阵第i行元素值的计算.该方法能充分开发向量处理器的标量、向量协同数据加载能力,有效减少对DDR的存储带宽需求,能够避免低效的对乘数矩阵列向量数据的访问和各个VPE间的浮点归约求和计算,取得最优的内核计算性能;将处理器的一级数据缓存和阵列存储配置为SRAM访问模式,能够避免由于Cache数据不命中而导致的存储访问延迟,提高核心计算访问一级数据缓存和阵列存储的效率,采用组播DMA传输矩阵数据,能够显著提高从DDR读取矩阵数据的效率;提出依据向量处理单元VPE数量、VPE的FMAC运算单元数量、向量存储器的容量和矩阵元素的数据类型等向量处理器体系结构特点设计最优的核心子块矩阵分块参数设计方法,能够充分开发向量处理器的多核间数据并行、核内的多VPE间的向量SIMD并行、VPE内的多个FMAC单元并行、VPE内的标、向量指令级并行等多级并行性,并根据FMAC指令延迟槽进行完全循环展开,让内核始终以峰值速度运行;提出基于两级DMA双缓冲数据搬移策略,优化和平滑多级存储结构间的数据传输,使得DMA的数据搬移时间完全重叠于内核的计算时间中,让整个矩阵计算以接近内核计算的速度运行,实现最优的计算性能和效率.在MATRIX2上的实验结果表明,提出的双精度矩阵乘法的性能达到1106.88GFLOPS,效率为96.08%,核心计算的效率达到99.39%. Dense matrix multiplication is one of the core computations in many algorithms from large scientific computing.An efficient vectorization of matrix multiplication for multi-core vector processors was presented.A vectorization of matrix multiplication according to row computation were presented.The basic idea of the vectorization method is that the one row elements of the C matrix is calculated at the same time.The value of the i-th row elements of the C matrix is completed by k vector multiply and accumulate operations.For each calculation,we extend the j th element of the i-th row of the A matrix into the vector of the same value,and then multiply and accumulate the j th row elements of the B matrix.Each vector multiply and accumulate calculation is carried out in parallel on each VPE.The calculated source data and the result data are stored in the local registers of VPE,each involved multiply and accumulate operation of calculation results are completed on the same VPE.The A,B,C matrix data are read in line order,which achieve a higher access efficiency,the calculation of the values of the i-th row element of the C matrix is completed at the end of the k cycle.This method fully exploits scalar and vector collaborative data loading capacity of vector processor and effectively reduces the storage bandwidth requirements for DDR,it avoids low efficiency data access to column vectors of multiplier matrix and float reduction summation calculation among all VPEs,and achieves optimization kernel computation performance.The level-1 data cache and array memory of vector processor was configured as SRAM access pattern,which can avoid the storage access delay caused by the cache data miss and improve the access efficiency of core computing to the level-1 data cache and array memory,it use multicast DMA to transfer matrix data,which significantly improves the efficiency of reading matrix data from DDR.An optimized core sub-block matrix blocking method was designed based on the vector processor architecture features including the number of vector processing unit VPE,the number of FMAC operation units of VPE,the capacity of vector memory and the data type of matrix elements,which fully exploits data parallelism of multi-core vector processors,vector SIMD parallelism between multiple VPEs,parallelism of multiple FMAC elements within VPE,scalar and vector instruction level parallelism in VPEs,it make full of looping expansion in accordance with the FMAC instruction delay slot,so that the kernel computing is always running at peak speed.A data transfer strategy based on two-level DMA double buffering scheme was designed to optimize and smooth the data transfers between multilevel storage architecture,which makes kernel computation and DMA data transfer fully overlapped,so that the whole matrix calculate is always running close to the kernel computing speed and achieve optimal computing performance and efficiency.Experimental results on MATRIX2 show that the performance of presented double precision matrix multiplication achieves 1106.88 GFLOPS,an efficiency of 96.08%,and the efficiency of kernel computation achieves 99.39%.
作者 刘仲 田希 LIU Zhong;TIAN Xi(College of Computer,National University of Defense Technology,Changsha 410073)
出处 《计算机学报》 EI CSCD 北大核心 2018年第10期2251-2264,共14页 Chinese Journal of Computers
基金 国家自然科学基金(61572025 61472432)资助
关键词 多核向量处理器 高性能计算 矩阵乘法 分块矩阵 向量化 multi-core vector processor high performance computing matrix multiplication blocked matrix vectorization
  • 相关文献

参考文献2

二级参考文献27

  • 1李辉,张安,赵敏,徐琦.粒子群优化算法在FIR数字滤波器设计中的应用[J].电子学报,2005,33(7):1338-1341. 被引量:37
  • 2马宝山,朱义胜.一种用于基因预测的FIR数字滤波器[J].电子学报,2007,35(9):1710-1713. 被引量:8
  • 3Mehrara M, Jablin T, Upton D, et al. Multicore compilation strnd challenges[ J ]. II.EIZ. Signal g Magazine, 21309, 26(6) :55 - 63.
  • 4Mirzaei S, Hosangadi A, Kastner R. FPGA implementation of high speed FIR filters using add and shift method[ A]. Proceed- ings of International Conference on Computer Design[ C]. San Jose, California: IEEE, 2006.308 - 313.
  • 5Shahbahrami A, Juurlink B H H. Vassiliadis S. Efficient vector- ization of the FIR filter[A]. Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing [ C ]. Veldhoven, the Netherlands: , 2005.432 - 437.
  • 6Kutil R, Eder P. Parallelizafion of wavelet filters using SIMD extensions[ J]. Parallel Processing Letters, 2006, 16 (3) : 335 -349.
  • 7Kutil R. Parallelization of fiR filters using SIMD extensions [ A]. Proceedings of the 15th Intemational Conference on Sys- tems, Signals and Image Processing[ C]. Bratislava, Slovak Re- public: n,l:,l:,, 2008.65 - 68.
  • 8Dang B L, Engin N, Gaydadjiev G N. Efficient filtering with the co-vector processor[ A ]. Proceedings of the 14th Annual Workshop on Circuits, Systems and Signal Processing [ C ].Veldhoven, The Netherlands:l,2003.351 - 356.
  • 9Texas Instruments. C67x floating point benchmarks[ R/OL ]. http://www, ft. com/sc/docs/products/dsp/c6000/67bench. hlm,2011 - 11 - 03.
  • 10Texas InslnLrnents. C64x floating point benchrnarks[ R/OL]. ht://focus, ft. com/dsp/docs/dspplatformscontentaut, tsp? secfionId = 2familyId = 4778aabId = 496,2011 - 11 - 03.

共引文献14

同被引文献55

引证文献9

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部