期刊文献+

面向矩阵计算的加速系统设计

Acceleration System Design for Matrix Computation
下载PDF
导出
摘要 通用中央处理器单元(CPU)往往花费大部分资源用于缓存管理和逻辑控制,只有少部分资源用于计算。因此将专用的计算模块例如图形处理单元(GPU)、数字信号处理器(DSP)、现场可编程逻辑门阵列(FPGA)和其他可编程逻辑单元作为加速器加入系统从而构建异构多核系统以增强计算性能的设计方法已经成为趋势。基于此趋势,提出一种面向矩阵计算的加速系统,通过使用自研专用指令集、特别设计的硬件加速器阵列以及存储架构优化实现对矩阵计算的加速。此外,还通过信箱机制实现与其他系统异构集成后的通信操作。通过Python与UVM验证方法学搭建性能验证平台,进行寄存器传输级(RTL)的性能验证。结果表明,在500 MHz工作频率下,方案中子系统的运算性能最高可达到32 GFLOPS,且与单纯使用二维脉动阵列执行加速的协处理器方案相比,通用矩阵乘(GEMM)算子的计算效率提升了12倍。 A general-purpose central processing unit(CPU)usually spends most of its resources on cache management and logic control,and only a small portion of its resources on computation.Therefore,it has become a trend to design a heterogeneous multi-core system with dedicated computing modules such as graphics processing unit(GPU),digital signal processor(DSP),FPGA and other programmable logic units,to enhance the computation performance.Based on this trend,an acceleration system for matrix computation is proposed with self-developed special instruction set,specially designed hardware accelerator array and optimization in storage architecture,to speed up matrix computing operations.In addition,communication operations with other systems after heterogeneous integration are realized through the mailbox mechanism.A performance verification platform is built through Python and UVM verification methodology to carry out the register transfer level(RTL)performance verification.The results show that the operational performance of the subsystems in the scheme can reach up to 32 GFLOPS at 500 MHz operating frequency,and the computational efficiency of the general matrix multiplication(GEMM)operator is improved by 12 times compared to the coprocessor scheme that performs acceleration using a 2D pulsating array alone.
作者 孙长江 李皇 王文青 SUN Changjiang;LI Huang;WANG Wenqing(Shenzhen Statemicro Electronics Co.,Ltd.,Shenzhen 518057,China)
出处 《电子与封装》 2023年第4期51-59,共9页 Electronics & Packaging
关键词 矩阵计算 异构 硬件加速器 算子映射 matrix computation heterogeneous hardware accelerator operator mapping
  • 相关文献

参考文献5

二级参考文献44

  • 1雷晶,金心宇,王锐.矩阵相乘的并行计算及其DSP实现[J].传感技术学报,2006,19(3):737-740. 被引量:2
  • 2Gustavson F G. High-performance Linear Algebra Algorithms Using New Generalized Data Structures for Matrices[J]. IBM J. RES. & DEV., 2003, 47(1).
  • 3Goto K. Anatomy of High-Performance Matrix Multiplication[J]. ACM Transactions on Mathematical Software, 2007, 34(3): 1-24.
  • 4蒋孟奇,张云泉,宋刚,等.综合递归分块技术及其在数值计算中的应用[C].全国高性能计算学术年会会议论文集.中国,北京[出版社不祥],2006.
  • 5Robert A. van de Geijn Enrique S. Quintana-Ort' I. The Science of Programming Matrix Computations[M]. [S. l.]: MIT Press, 2006.
  • 6Herrero J R, Navarro J J Building Libraries for Small Matrix Kemels[EB/OL]. (2007-02-20). www.citeseer.ist.psu.edu/703531. html.
  • 7UNDERWOOD K. FPGAs vs. CPUs: trends in peak floating-point performance [C] // Proceedings of the International Symposium on Field Programmable Gate Arrays. Monterey: ACM , 2004: 171- 180.
  • 8UNDERWOOD K, HEMMERT K. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance [C]//Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '04). Washington: IEEE, 2004: 219 - 228.
  • 9AMIRA A, BENSAALI F. An FPGA based parametrisable system for matrix product implementation [C] // Proceedings of the IEEE Workshop on Signal Processing Systems Design and Implementation (SIPS2002). San Diego: IEEE, 2002: 75-79.
  • 10JANG J, CHOI S, PRASANNA V K. Area and time efficient implementation of matrix multiplication on FPGAs [C]//Proeeedings of IEEE International Conference on Field Programmable Technology. [S. I. ]: IEEE, 2002:93 - 100.

共引文献33

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部