摘要
卷积神经网络(Convolutional Neural Network,CNN)是目前主流视觉算法不可或缺的关键部分.为提高CNN模型推理速度,学界提出了众多异构加速方法以满足不同场景下的多元加速需求.但如何在资源与能耗受限的在轨卫星上稳定高效地加速CNN仍是极具挑战的课题.为此,本文通过软硬件协同设计,着力优化微指令编码、指令级并行和运算级并行3个加速器设计的关键部分,在星上常见的Xilinx VX690T FPGA芯片上设计实现了一种微指令序列调度数据流的CNN加速器.在软件层面,本文提出一种可扩展的微指令编码格式及相应的编译方法.通过卷积循环分块和算子融合策略实现图级别优化,生成加速器可执行的微指令序列.在硬件层面,本文设计实现了一个由微控制器与逻辑运算器组成的RTL级CNN加速器.微控制器通过粗粒度流水线实现各类指令的并行执行.逻辑运算器通过DSP48E1计算资源级联所构建的计算阵列实现卷积算子的细粒度并行运算.实验结果表明,加速器设计功耗10.68 W,在加速YOLOV3Tiny算法时,峰值吞吐率(Runtime Max Throughput,RMT)达到378.63 GOP/s,计算资源利用效率(MAC Efficiency,ME)达到91.5%.相较典型GPU加速方法,本文的加速器有14倍能效提升.相较同类FPGA加速器,ME有6.9%以上的提升.
Recently,with the evolvement of space remote sensing technology,the main earth observation device has been gradually transitioning from the single-satellite to a constellation composed of light and small satellites.A constellation of several high-resolution satellites collects hundreds of TBs(Terabytes)of RSI(Remote Sensing Image)data every day.The traditional satellite-to-ground data transmission mechanism has been unable to match the massive remote sensing data processing.In-orbit satellites need to improve their data processing capabilities to deal with increasingly complex observation missions.Meanwhile,in the field of RSI processing,deep learning algorithms based on CNN(Convolutional Neural Network)have become the mainstream method due to their excellent performance.However,the computation-intensive and memory-intensive features have brought many challenges to the deployment of CNN.Academia and industry propose many specific acceleration methods for the CNN domain to cope with the various application scenarios.Numerous FPGA(Field Programmable Gate Array)and ASIC(Application Specific Integrated Circuit)accelerators have been designed to accelerate CNN in edge and data center scenarios.Compared with ASIC,FPGA has higher flexibility and faster development iteration speed,making it very suitable for spaceborne scenarios.In this paper,we propose a microinstruction driven CNN Accelerator for RSI processing on FPGA.This accelerator is jointly designed by software and hardware,which mainly optimizes microinstruction coding,instruction-level parallelism(Coarse-Grained Parallelism)and operation-level parallelism(Fine-Grained Parallelism)under the constraints of limited storage bandwidth and computing resources on satellites.At software level,we propose an extensible microinstruction encoding format and the corresponding compilation method(Micro Assembler).A microinstruction code covers 14 instructions in 4 types,which can schedule the dataflow between different components of the accelerator.The micro assembler performs graph-level optimization on the CNN topology by convolutional loop tiling and operator fusion,and then generates micro-instruction sequences that can be executed by the accelerator.At hardware level,we design and implement an RTL(Register Transfer Level)CNN accelerator,which is mainly composed of micro controller and logic operator.The micro controller achieves the parallel execution of different types of instruction by a 5-stage coarse-grained pipeline(Data Load,Data Fetch,Compute,Post Process,Write Back).The logic operator is a computing array with DSP48E1 hard core resources cascaded,which can achieve parallel execution of convolution operations by a 32-stage fine-grained pipeline.When the pipeline is established,the logic operator can complete 32×32 MAC(Multiply-accumulate)operations in one clock cycle.The performance of our proposed accelerator is evaluated on the Xilinx VX690T FPGA chip commonly found on satellites.The designed power consumption is 10.68 W.The RMT(Runtime Max Throughput)reaches 378.63 GOP/s,and the ME(MAC Efficiency)reaches 91.5%.When our accelerator is used as a coprocessor to accelerate the CNN object detection algorithm YOLOV3Tiny,the average accuracy of the RSI data set reaches 0.9 and the detection speed reaches 102 frames/s.The evaluation results show that our accelerator is 14 times more energy efficiency than the typical GPU acceleration method,and has more than 6.9%improvement in ME compared with other FPGA accelerators.
作者
郭子博
刘凯
胡航天
李奕铎
璩泽旭
GUO Zi-Bo;LIU Kai;HU Hang-Tian;LI Yi-Duo;QU Ze-Xu(School of Computer Science and Technology,Xidian University,Xi’an 710000;CAST-Xi’an Institute of Space Radio Technology,Xi’an 710000)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2022年第10期2047-2064,共18页
Chinese Journal of Computers
基金
国家自然科学基金(62171342,61850410523)
空间测控通信创新探索基金(201701B)资助.