面向矩阵计算的加速系统设计

Acceleration System Design for Matrix Computation

下载PDF

导出

摘要通用中央处理器单元(CPU)往往花费大部分资源用于缓存管理和逻辑控制,只有少部分资源用于计算。因此将专用的计算模块例如图形处理单元(GPU)、数字信号处理器(DSP)、现场可编程逻辑门阵列(FPGA)和其他可编程逻辑单元作为加速器加入系统从而构建异构多核系统以增强计算性能的设计方法已经成为趋势。基于此趋势,提出一种面向矩阵计算的加速系统,通过使用自研专用指令集、特别设计的硬件加速器阵列以及存储架构优化实现对矩阵计算的加速。此外,还通过信箱机制实现与其他系统异构集成后的通信操作。通过Python与UVM验证方法学搭建性能验证平台,进行寄存器传输级(RTL)的性能验证。结果表明,在500 MHz工作频率下,方案中子系统的运算性能最高可达到32 GFLOPS,且与单纯使用二维脉动阵列执行加速的协处理器方案相比,通用矩阵乘(GEMM)算子的计算效率提升了12倍。 A general-purpose central processing unit(CPU)usually spends most of its resources on cache management and logic control,and only a small portion of its resources on computation.Therefore,it has become a trend to design a heterogeneous multi-core system with dedicated computing modules such as graphics processing unit(GPU),digital signal processor(DSP),FPGA and other programmable logic units,to enhance the computation performance.Based on this trend,an acceleration system for matrix computation is proposed with self-developed special instruction set,specially designed hardware accelerator array and optimization in storage architecture,to speed up matrix computing operations.In addition,communication operations with other systems after heterogeneous integration are realized through the mailbox mechanism.A performance verification platform is built through Python and UVM verification methodology to carry out the register transfer level(RTL)performance verification.The results show that the operational performance of the subsystems in the scheme can reach up to 32 GFLOPS at 500 MHz operating frequency,and the computational efficiency of the general matrix multiplication(GEMM)operator is improved by 12 times compared to the coprocessor scheme that performs acceleration using a 2D pulsating array alone.

作者孙长江李皇王文青 SUN Changjiang;LI Huang;WANG Wenqing(Shenzhen Statemicro Electronics Co.,Ltd.,Shenzhen 518057,China)

机构地区深圳市国微电子有限公司

出处《电子与封装》 2023年第4期51-59,共9页 Electronics & Packaging

关键词矩阵计算异构硬件加速器算子映射 matrix computation heterogeneous hardware accelerator operator mapping

分类号 TP302.1 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献5

1王阳,陶华敏,肖山竹,戴华东.基于脉动阵列的矩阵乘法器硬件加速技术研究[J].微电子学与计算机,2015,32(11):120-124. 被引量：6
2田翔,周凡,陈耀武,刘莉,陈耀.基于FPGA的实时双精度浮点矩阵乘法器设计[J].浙江大学学报（工学版）,2008,42(9):1611-1615. 被引量：21
3马邺晨,李醒飞.用于导航解算的矩阵运算硬件加速器设计[J].计算机工程,2014,40(8):259-263. 被引量：2
4刘沛华,鲁华祥,龚国良,刘文鹏.基于FPGA的全流水双精度浮点矩阵乘法器设计[J].智能系统学报,2012,7(4):302-306. 被引量：8
5蒋孟奇,张云泉,宋刚,李玉成.GOTOBLAS一般矩阵乘法高效实现机制的研究[J].计算机工程,2008,34(7):84-86. 被引量：8

二级参考文献44

1雷晶,金心宇,王锐.矩阵相乘的并行计算及其DSP实现[J].传感技术学报,2006,19(3):737-740. 被引量：2
2Gustavson F G. High-performance Linear Algebra Algorithms Using New Generalized Data Structures for Matrices[J]. IBM J. RES. & DEV., 2003, 47(1).
3Goto K. Anatomy of High-Performance Matrix Multiplication[J]. ACM Transactions on Mathematical Software, 2007, 34(3): 1-24.
4蒋孟奇,张云泉,宋刚,等.综合递归分块技术及其在数值计算中的应用[C].全国高性能计算学术年会会议论文集.中国,北京[出版社不祥],2006.
5Robert A. van de Geijn Enrique S. Quintana-Ort' I. The Science of Programming Matrix Computations[M]. [S. l.]: MIT Press, 2006.
6Herrero J R, Navarro J J Building Libraries for Small Matrix Kemels[EB/OL]. (2007-02-20). www.citeseer.ist.psu.edu/703531. html.
7UNDERWOOD K. FPGAs vs. CPUs: trends in peak floating-point performance [C] // Proceedings of the International Symposium on Field Programmable Gate Arrays. Monterey: ACM , 2004: 171- 180.
8UNDERWOOD K, HEMMERT K. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance [C]//Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '04). Washington: IEEE, 2004: 219 - 228.
9AMIRA A, BENSAALI F. An FPGA based parametrisable system for matrix product implementation [C] // Proceedings of the IEEE Workshop on Signal Processing Systems Design and Implementation (SIPS2002). San Diego: IEEE, 2002: 75-79.
10JANG J, CHOI S, PRASANNA V K. Area and time efficient implementation of matrix multiplication on FPGAs [C]//Proeeedings of IEEE International Conference on Field Programmable Technology. [S. I. ]: IEEE, 2002:93 - 100.

共引文献33

1吴猛,刘振.基于并行存储优化的矩阵乘法运算[J].电脑知识与技术,2010(01X):693-695.
2许芳,席毅,陈虹,靳伟伟.基于FPGA/Nios-Ⅱ的矩阵运算硬件加速器设计[J].电子测量与仪器学报,2011,25(4):377-383. 被引量：32
3肖宇,王建业,张伟.基于IP核的数选式浮点矩阵相乘设计[J].电子技术应用,2011,37(6):52-55. 被引量：1
4刘冬冬,张天宏,黄向华,陈建.基于FPGA的航空发动机电子控制器设计技术研究[J].测控技术,2012,31(1):57-61. 被引量：6
5张国礼,王建业,肖宇.浮点矩阵相乘IP核并行改进的设计与实现[J].电子技术应用,2012,38(2):43-46. 被引量：1
6朱海涛,李玲,陈云霁,钱诚.一种用于通用处理器结构优化的矩阵乘法性能模型[J].小型微型计算机系统,2012,33(5):981-986. 被引量：2
7刘沛华,鲁华祥,龚国良,刘文鹏.基于FPGA的全流水双精度浮点矩阵乘法器设计[J].智能系统学报,2012,7(4):302-306. 被引量：8
8沈俊,沈海斌,虞玉龙.一种低延迟高吞吐率的浮点整型乘累加单元[J].计算机工程,2013,39(6):91-94. 被引量：1
9马邺晨,李醒飞.用于导航解算的矩阵运算硬件加速器设计[J].计算机工程,2014,40(8):259-263. 被引量：2
10王阳,陶华敏,肖山竹,戴华东.基于脉动阵列的矩阵乘法器硬件加速技术研究[J].微电子学与计算机,2015,32(11):120-124. 被引量：6

1苏文俊,张学军,许先富,谭伊璇,李斌,班艳娇.改进YOLOv4的人脸口罩检测与硬件加速[J].计算机工程与设计,2023,44(3):798-806. 被引量：1
2王清源,高振斌,杨晓龙.基于UVM的PCIe桥接芯片验证平台设计[J].微电子学与计算机,2023,40(5):104-111. 被引量：1
3陈锐,孙羽菲,郭强,隋轶丞,周振辉,石昌青,张玉志.OclDNN:一种可应用于TensorFlow的通用DNN库[J].计算机工程,2023,49(4):138-148.
4谢凌东,王丽鹏,周宏辉,翁东雷,杨平,钟良亮.应用神经网络与声纹识别的锂电池运行状态预警[J].单片机与嵌入式系统应用,2023,23(4):45-49.
5武铮,许乐,安虹,金旭,文可.针对SW26010众核处理器的单精度矩阵乘算法[J].小型微型计算机系统,2023,44(4):673-681.
6方延刚.基于三维控制模型的市政道路水泥混凝土路缘石铺设质量控制技术[J].山东交通学院学报,2023,31(2):67-74. 被引量：2
7董刚,胡克坤,杨宏斌,赵雅倩,李仁刚,赵坤,曹其春,鲁璐.一种通用型卷积神经网络加速器架构研究[J].微电子学与计算机,2023,40(5):97-103. 被引量：1
8尤德安,杜雨书,曾美琪,刘洪山,梁亨茂.基于FPGA的冰雪赛场补冰巡检一体化机器人设计[J].信息与电脑,2023,35(3):182-184.
9宋乐,侯宇鹏,张俊鹏,吴桐,齐昊鸣,商恩浩.基于Mask-RCNN与SFM的单目视觉长方体三维测量方法[J].Journal of Measurement Science and Instrumentation,2023,14(2):127-136.
10乔松.2022私有云企业30强[J].互联网周刊,2023(8):26-26.

电子与封装

2023年第4期

浏览历史

内容加载中请稍等...

面向矩阵计算的加速系统设计

参考文献5

二级参考文献44

共引文献33

相关作者

相关机构

相关主题

浏览历史