摘要
近年来,卷积神经网络(Convolutional Neural Network,CNN)在目标检测、场景分割、图片分类等领域的应用中取得了举世瞩目的成绩,然而随着应用对精度要求的不断提升、模型参数量不断增加、所需计算量不断增大,这使得在“边”“端”侧资源有限的平台上部署低延时应用极具挑战.虽然采用GPU可以完成CNN模型加速计算的理论验证,但受限于GPU定制改造成本以及其自身功耗,无法在实际低功耗系统中应用.相比而言,作为低功耗高性能系统,FPGA平台具备高性能计算能力以及硬件可重构的特点,适于完成CNN加速.当前利用FPGA可重构性的定制计算技术虽然可整合加速器以应对多变的CNN应用场景,调整加速器结构以适配应用计算保证功耗效率.然而,现有卷积神经网络FPGA加速器的瓶颈在于卷积神经网络算法适配性不佳,由此会导致计算间隙大、时延浪费、计算资源使用率低的问题.本文针对CNN算法因局部参数共享所导致的计算密集的特点,重新组织数据流结构以适应并行运算.针对资源有限制的FPGA板卡,本文自底向上定制了矩阵乘法、卷积计算、池化计算等单元,最终组成Ultra加速器(Ultra Accelerator,UltraAcc);同时,本文设计了评估模型进行超参数调优,从底层单元到计算层单元再到整个计算链都做了对存储资源、计算资源、运行时延的评估,再配合神经网络训练的精度,从而实现在软件硬件两个方面平衡优化整个应用系统.Ultra加速器在Ultra96板卡上平均吞吐量可以达到126.72 GOPs,是IEEE/ACM DAC-SDC’19冠军方法的5.47倍.采用Ultra加速器本团队参与了DAC-SDC’20低功耗目标检测的比赛,最终以精度IoU 0.65、速度FPS 212.73、消耗能量1.64kJ夺得2020年该赛项冠军.
Convolutional Neural Network(CNN)has remarkable application effect in object detection,semantic segmentation and image classification in recent years.In order to meet the requirements of high precision,CNN models with deep layers need to be constructed.Due to the large number of parameters of the CNN and its intensive computational demands,it is a great challenge to the deployment of CNN applications with low latency requirements on edge devices which are resource-limited.Although GPU can be used to complete theoretical verification of accelerated computation of CNN model.Due to the limitation of GPU customization cost and power consumption,it cannot be applied in the actual low-power system.In contrast,as a low power consumption and high performance system,FPGA has the characteristics of high performance computing capability and reconfigurability,which are suitable for customized computing of CNNs.The method to solve the acceleration problem is to use the customized computing technology with FPGA reconfigurability.We can use the composable accelerator to deal with various CNN application scenarios and adjust the accelerator structure to suit the application to ensure power consumption efficiency.The bottleneck of the existing CNN accelerator on FPGA lies in the poor adaptation of CNN algorithm,which leads to the problems of large computing gap,the waste of latency and low utilization of computing resources.In this paper,we reorganize the dataflow structure to adapt to CNN parallel operation.According to the limited FPGA resources,the matrix multiplication,convolution calculation,pooling calculation and other units were customized from the bottom up to top,and the Ultra accelerator(UltraAcc)is proposed.An evaluation model is designed for hyperparameter tuning.From the bottom unit to the computing layer unit and then to the whole computing chain,storage resources,computing resources and latency are evaluated.With the precision result of CNN training,the whole application system is balanced and optimized from both software and hardware.The UltraAcc can achieve an average throughput of 126.72 GOPs on the Ultra96v2,5.47 times higher than the first place method in IEEE/ACM DAC-SDC’19 on the same platform.The UltraAcc was used to participate in the DAC-SDC’20.And we won the first prize with accuracy of IoU 0.65,speed of FPS 212.73 and energy consumption of 1.64 kJ.
作者
包振山
郭俊南
张文博
党鸿博
BAO Zhen-Shan;GUO Jun-Nan;ZHANG Wen-Bo;DANG Hong-Bo(Faulty of Information Technology,Beijing University of Technology,Beijing 100024)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2023年第6期1139-1155,共17页
Chinese Journal of Computers
基金
国家自然科学基金(62072016)资助。