基于“承影”GPGPU的张量处理器设计

Design of tensor core based on"Ventus"GPGPU

下载PDF

导出

摘要针对神经网络对算力和通用性的需求进一步扩大,基于开源项目“承影”GPGPU,设计了张量处理器,可以对卷积、通用矩阵乘进行加速。首先,分析现有张量处理器设计方案及其对应算法,与直接进行卷积计算进行对比,分析性能差异。然后,提出基于三维乘法树结构的张量处理器设计,将其部署在Xilinx VCU128开发板上。在VCU128开发板上,张量处理器的工作频率为222 MHz。同时,开发了指数运算单元,辅助完成神经网络运算。在VCU128开发板上的工作频率为159 MHz。最后,利用编写汇编程序的方法,验证张量处理器的功能正确性。引入张量处理器后,预期运行时间明显减少。 To meet the growing demands for computational power and versatility in neural networks,a tensor processor is designed based on the open-source project"Ventus"GPGPU.The tensor processor can accelerate convolution and general matrix multiplication operations.This study analyzes existing tensor processor design schemes and their corresponding algorithms and compares their performance differences with direct convolution calculations.Subsequently,a novel tensor processor design based on a three-dimensional multiplication tree structure is proposed.The proposed design is deployed on the Xilinx VCU128 development board.The tensor processor operates at a frequency of 222 MHz on the VCU128 development board.Additionally,an exponential operation unit is developed to aid in neural network operations.The frequency is 159 MHz on the VCU128 development board.The functionality of the tensor processor is verified using assembly language programming,and the results demonstrated a significant reduction in expected execution time after introducing the tensor processor.These findings contribute to the advancement of hardware acceleration for deep learning applications and provide a foundation for further research in this field.

作者师雨洁杨轲翔刘旭东何虎 SHI Yujie;YANG Kexiang;LIU Xudong;HE Hu(School of Integrated Circuits,Tsinghua University,Beijing 100084,China)

机构地区清华大学集成电路学院

出处《微电子学与计算机》 2024年第5期109-116,共8页 Microelectronics & Computer

关键词通用图形处理器张量处理器卷积通用矩阵乘指数运算 GPGPU tensor core convolution general matrix multiplication exponential operation

分类号 TN47 [电子电信—微电子学与固体电子学]

引文网络
相关文献

参考文献4

1王少军,张启荣,彭宇,彭喜元.超越函数FPGA计算的最佳等距分段线性逼近方法[J].仪器仪表学报,2014,35(6):1209-1216. 被引量：19
2程甜甜,宋宇鲲.基于FPGA的2底指数函数算法优化与实现[J].电子科技,2023,36(9):66-72. 被引量：2
3史雄伟,王成,张春雷,陈乃奎.基于FPGA的浮点指数函数算法研究与实现[J].计算机测量与控制,2017,25(10):221-223. 被引量：3
4林凯文,陈志坚,刘东启.基于泰勒展开的低成本e指数函数电路设计[J].计算机应用研究,2018,35(6):1761-1763. 被引量：4

二级参考文献38

1吴明权,李志军.一种应用于AGC的可编程CMOS指数函数发生器[J].微电子学与计算机,2015,32(6):90-95. 被引量：1
2夏欣,贾永刚,王素珍.RBF神经网络中指数函数e^x的FPGA实现[J].微计算机信息,2005,21(07Z):145-146. 被引量：6
3赵海燕,周晓方,周电.对数/指数算法的改进及其VLSI实现[J].计算机工程与应用,2007,43(7):104-107. 被引量：6
4牟胜梅,杨晓东.e^θ的CORDIC迭代初值选取策略及其硬件实现[J].计算机工程与应用,2007,43(6):79-80. 被引量：5
5NASA. Space Technology Roadmaps for the FourteenTechnology Areas/ Technology Area Strategic Roadmapsand Breakdown Structure [EB/OL]. http://www. nasa.gov/offices/oct/home/roadmaps/index, html. 2011.
6SCHRAUWEN B,D, HAENE M,CAMPENHOUT J V .Compact Hardware liquid state machines on FPGA for re-al-time speech recognition [J]. Neural Networks,2008,21: 511-523.
7WANG SH J,PENG Y,ZHAO G Q, et al. Accelera-ting on-line training of LS-SVM with run-time reconfig-uration [C]. International Conference on Field-Pro-grammable Technology ,2011 : 1 -4.
8ZAHEERUDDIN M A. Implementation of a digital neuronwith nonlinear activation function using piecewise linearapproximation technique [C]. International conferenceon microelectronics. 2007 ;279-282.
9BHURIA S,MURAALDHAR P. FPGA implementation ofsine and cosine value generators using cordic algorithm forsatellite attitude determination and calculators [C].ICPCES International Conference on Power, Control andEmbedded Systems, 2010 : 1-5.
10EHSAN R,IMAN R, MOHAMMAD E. PWL approxima-tion of hyperbolic tangent and the first derivative for VLSIimplementation [C]. Canadian Conference on Electricaland Computer Engineering. 2010.

共引文献19

1刘禹,张赫.基于ADV212的数字域TDI图像压缩设计[J].国外电子测量技术,2015,34(4):86-90. 被引量：2
2黄小康,杜慧敏,李涛,周佳佳.多核处理器中的超越函数协处理器设计[J].微电子学与计算机,2016,33(5):42-46. 被引量：1
3田征,杜慧敏,黄小康.改进的超越函数分段线性逼近方法[J].计算机应用,2016,36(7):1807-1810. 被引量：1
4刘喜梅,陈亚斐,覃庆良.基于DSP和FPGA的LVDS高速串行通信方案设计[J].电子测量技术,2016,39(7):178-182. 被引量：8
5张超,许建华,张志.多通道均方根值-平均值检波器的FPGA设计[J].国外电子测量技术,2017,36(1):43-46. 被引量：5
6姜志健,庄建军,陈旭东,赵之轩.基于FPGA的高精度频率计的设计与实现[J].电子测量技术,2017,40(5):41-46. 被引量：25
7唐然,吴虹,赵迎新,穆巍炜,徐锡燕,马肖旭,刘兵,刘之洋.AIS多小区同频信号实时盲分离的FPGA设计[J].电子学报,2017,45(9):2121-2126. 被引量：1
8史雄伟,王成,张春雷,陈乃奎.基于FPGA的浮点指数函数算法研究与实现[J].计算机测量与控制,2017,25(10):221-223. 被引量：3
9高兵益,徐磊.CORDIC算法及其展开结构的FPGA实现[J].电子测量技术,2017,40(11):85-88. 被引量：6
10秦晨蕊,李涛,圣飞,张凯.基于超越函数协处理器的定点格式的研究[J].信息技术,2018,42(1):121-123.

1潘于,田映辉,张伟,杨建磊,申奇.一种节省资源的矩阵运算单元硬件微架构设计[J].现代电子技术,2024,47(5):160-166.
2高岚,赵雨晨,张伟功,王晶,钱德沛.面向GPU并行编程的线程同步综述[J].软件学报,2024,35(2):1028-1047.
3杨晓丹,赵越,王煜晶.基于GeoGebra软件的矩阵乘法的可视化教学研究[J].科技风,2024(11):118-120.
4赵博涵.RISC-V标量处理器的应用与优化分析[J].集成电路应用,2024,41(3):40-43.
5薛慧敏,李坤坤,眭畅豪.基于FPGA的卷积神经网络加速技术研究[J].信息技术与信息化,2024(4):192-195.

微电子学与计算机

2024年第5期

浏览历史

内容加载中请稍等...

基于“承影”GPGPU的张量处理器设计

参考文献4

二级参考文献38

共引文献19

相关作者

相关机构

相关主题

浏览历史