面向神经网络池化层的灵活高效硬件设计

Flexible and Efficient Hardware Design for Neural Network Pooling Layer

下载PDF

导出

摘要近年来,神经网络加速器逐渐成为研究热点,其中池化层是神经网络加速器的重要组成部分。使用专门的硬件设计方法设计池化层具有过程快和方便修改的优势,但也存在以下问题:不同的池化设计方案由于缺乏向上兼容性而无法适配到最新的神经网络;由于现有的池化方案数据间的复用程度低,导致池化性能偏低。基于此,提出一种面向神经网络池化层的灵活高效的硬件设计。该设计使用Verilog硬件描述语言实现,尽可能考虑到池化算法的各项参数,进而适配最新的神经网络,采取二维拆分与多数据递进处理使其具备高兼容性;结合行缓存提高该设计的性能;乒乓缓存、伪填充及特定池化核延展进一步降低资源使用量。通过实验对多个神经网络中的池化层进行了验证,结果表明,在200 MHz的工作频率下,与CPU(AMD TR Pro 3995WX)相比,运行最大池化最高可实现536倍的加速效果;运行平均池化最高可实现11 248倍的加速效果;运行YOLOv5的池化层时,与通用的数据不复用的方案相比,可以达到以3.5倍的资源获得27倍的加速比;运行GoogleNet的池化层时,与HLS设计方案相比,可实现接近同等的资源获得555倍的加速比。 In recent years,neural network accelerator has gradually become a research hotspot,among which pooling layer is an important part of neural network accelerator.Using specialized hardware design methods to design the pooling layer has the advantages of fast process and easy modification,but it also has the following problems:Different pooling design schemes cannot adapt to the latest neural networks due to lack of upward compatibility.Due to the low reuse degree of data in existing pooling schemes,the pooling performance is low.Based on this,a flexible and efficient hardware design for neural network pooling layer is proposed.The design is implemented by using Verilog hardware description language,and the parameters of the pooling algorithm are considered as much as possible to adapt to the latest neural network.It adopts two dimensional splitting and multi-data progressive processing to make it have high compatibility.Combined with line cache,the performance of the design is improved.Ping-pong caching,spurious padding,and specific pooling kernel extensions further reduce resource usage.The experimental results show that the maximum pooling can achieve up to 536 times faster than CPU(AMD TR Pro 3995WX) at 200 MHz operating frequency.The average pooling can achieve up to 11 248 times of acceleration effect.When running the pooling layer of YOLOv5,it can achieve a speedup of27 times with 3.5 times resources compared to the common scheme without data reuse.When running the pooling layer of GoogleNet,it can achieve nearly 555 times speedup over the HLS design for comparable resources.

作者何增朱国权岳克强 HE Zeng;ZHU Guoquan;YUE Keqiang(School of Electronic Information,Hangzhou Dianzi University,Hangzhou 310018,China;Intelligent Computing Hardware Research Center,Zhijiang Laboratory,Hangzhou 311100,China)

机构地区杭州电子科技大学电子信息学院之江实验室智能计算硬件研究中心

出处《计算机工程与应用》 CSCD 北大核心 2023年第22期315-321,共7页 Computer Engineering and Applications

基金浙江省重点研发计划(2022C01048) 之江实验室探索性项目(2022PF0AN01)。

关键词灵活高效池化硬件加速 Verilog HDL 数据复用 flexible and efficient pooling hardware acceleration Verilog HDL data reuse

分类号 TP183 [自动化与计算机技术—控制理论与控制工程] TP399 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1张卫,刘宇红,张荣芬.可实现时分复用的CNN卷积层和池化层IP核设计[J].计算机工程与应用,2020,56(24):66-71. 被引量：9
2魏武,杨靓.图像处理中数据复用及存储层次设计的研究[J].计算机技术与发展,2012,22(12):43-46. 被引量：1
3王肖,邓军勇,谢晓燕.可重构卷积神经网络加速器设计与实现[J].传感器与微系统,2022,41(2):82-85. 被引量：8
4许杰,张子恒,王新宇,佟诚,梅青,肖建.一种基于Zynq的CNN加速器设计与实现[J].计算机技术与发展,2021,31(11):108-113. 被引量：4
5陈浩敏,姚森敬,席禹,张凡,辛文成,王龙海,任超.YOLOv3-tiny的硬件加速设计及FPGA实现[J].计算机工程与科学,2021,43(12):2139-2149. 被引量：12

二级参考文献20

1冈萨雷斯.数字图像处理(MATLAB版)[M].北京:电子工业出版社.2005.
2Wuytack S,Catthoor F, Franssen F,et al. Global communica- tion and memory optimizing transformations for low power sys- tems[ J]. VKSI Signal Processing, 1994(10) : 178-187.
3van Achtern T, Lauwereins R, Catthoor F. Systematic data re- use exploration methodology for irregular access patterns [ C ]//Proceedings of the 13th International Symposium on System Synthesis. Washington : IEEE Computer Society, 2000 : 115-121.
4Diguet * J P, Wuytack S, Catthoor F, et al. Formalized method- ology for data reuse exploration in hierarchical memory map- pings[ J ]. Iw Power Electronics and Design, 1997 ( 8 ) : 30- 35.
5van Achtern T, Catthoor F. Data Reuse Exploration Techniques for Loop-dominated Applications [ C ]//5th ACM/IEEE De- sign Test Europe Conf.. [ s. 1. ] : [ s. n. ] ,2002:428-435.
6Tuan Jen-Chieh, Chang Tian-'Sheuan ,Jen Chein-Wei. On theData Reuse and Memory Bandwidth Analysis for Full-search Block-matching VLSI Architecture[ J]. IEEE Transaction on Circuits and Systems for Video Technology ,2002 (1) :61-72.
7HennessyJL,PattersonDA.计算机系统结构-量化研究方法[M].北京:电子工业出版社,2007.
8Panda P R, Dutt N D, Nicolau A. Efficient Utilization of Sc- raLch- pad Memory in Embedded Processor Applications [ C ]//EDTC "97 Proceedings of the 1997 European Confer- ence on Design and Test. [ s. 1. ] : [ s. n. ], 1997.
9于方波.基于MATLAB的图像处理[M].第2版.北京:清华大学出版社,2011.
10宋淑娜,李金霞,胡学坤,高尚.一种自适应模糊阈值区间的图像分割方法[J].计算机技术与发展,2010,20(5):121-123. 被引量：6

共引文献27

1杜忠文,李庚霖,蒋菡,褚江恒,伍俊.基于次级缓存的SDRAM调度策略的研究[J].电子测量技术,2023,46(14):37-42. 被引量：1
2王利翔,林珊玲,林志贤,郭太良.基于Zynq平台的图像目标检测系统[J].半导体光电,2023,44(1):147-152.
3吕浩,张盛兵,王佳,刘硕,景德胜.卷积神经网络SIP微系统实现[J].计算机工程与应用,2021,57(5):216-221. 被引量：7
4杜煜章,潘家华,宗容,粟炜,王威廉.基于硬件加速的轻量级网络心音分类器[J].计算机工程与应用,2021,57(23):263-269. 被引量：1
5赵凡,白雪,杨涛,赵不贿,徐雷钧.基于FPGA的通用卷积神经网络识别系统研究[J].自动化仪表,2022,43(1):42-47. 被引量：2
6冯帆.基于贝叶斯网络的车用空气弹簧智能测量与数值分析技术[J].电子设计工程,2022,30(14):34-38. 被引量：1
7吴宇航,何军.基于FPGA加速的行为识别算法研究[J].电子测量技术,2022,45(13):25-32. 被引量：4
8林亦雷,彭炜舟.5G技术下电力公网通信任务路由分配方法[J].信息技术,2022,46(9):129-133.
9王蕊,张旭,韩宇迪,王开宇.基于FPGA的数字识别装置设计与仿真[J].工业和信息化教育,2022(10):85-89. 被引量：2
10叶应辉.基于深度学习的卫星遥感图像边缘检测方法[J].计算机测量与控制,2022,30(10):39-44. 被引量：4

1吕飞,梁杏成,张冷,王宇,罗元勇.低面积低功耗的定点数指数计算方法及其VLSI实现[J].金陵科技学院学报,2023,39(2):15-22.
2李尚恒,丁召,周骅.基于FPGA和Sinc插值的分数阶傅里叶变换建模与设计[J].建模与仿真,2023,12(5):4415-4424.
3李晓琪,王云峰,吴倩楠,洪应平.基于ZYNQ的CLAHE图像增强算法实时加速设计[J].单片机与嵌入式系统应用,2023,23(11):49-53.
4温凯杰,郭力,夏诏杰,陈建华.一种耦合CFD与深度学习的气固快速模拟方法[J].化工学报,2023,74(9):3775-3785.

计算机工程与应用

2023年第22期

浏览历史

内容加载中请稍等...

面向神经网络池化层的灵活高效硬件设计

参考文献5

二级参考文献20

共引文献27

相关作者

相关机构

相关主题

浏览历史