摘要
针对卷积神经网络模型ZynqNet现有FPGA实现版本中卷积运算单元并行度低,存储结构过度依赖片外存储等问题,提出一种针对ZynqNet的FPGA优化设计.设计了双缓冲结构将中间运算结果放到片内以减少片外存储访问;将数据位宽从32位降为16位;设计了具有64个卷积运算单元的并行结构.实验结果表明,在ImageNet测试准确度相同的情况下,本文所提出的设计工作频率可达200 MHz,运算速率峰值达到1.85GMAC/s,是原ZynqNet实现的10倍,相比i5-5200UCPU可实现20倍加速.同时,其计算能效达到了NVIDIA GTX 970GPU的5.4倍.
In the hardware design of ZynqNet implemented on FPGA,the parallelism of convolution unit is low and the storage structure is almost dependent on off-chip memory.A FPGA accelerator optimization is proposed based on ZynqNet and it is easy to apply in other CNN models.The double buffering stores intermediate result of the network into the chip to reduce off-chip access;The data precision is changed from 32 bits to 16 bits,thus a parallel structure of64 convolution operation units is designed to improve computing parallelism.The ImageNet results show that the optimized accelerator based on FPGA can achieve peak performance of 1.85 GMAC/s under 200 MHz,it is 10 times speedup compared to the original ZynqNet and 20 times speedup compared to i5-5200 UCPU.In terms of performance power ratio,the FPGA accelerator is 5.4 times of NVIDIA GTX 970 GPU version.
作者
仇越
马文涛
柴志雷
QIU Yue;MA Wen-tao;CHAI Zhi-lei(School of Internet of Things,Jiangnan University,Wuxi 214122,China;State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi 214125,)
出处
《微电子学与计算机》
CSCD
北大核心
2018年第8期68-72,77,共6页
Microelectronics & Computer
基金
数学工程与先进计算国家重点实验室开放基金(2015A07)