This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden...This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models,which maximizes the data reuse rate and improve the computation speed.Furthermore,an integrated computation architecture has been implemented for the activation function,max-pooling,and activation function after convolution calculation,reducing the hardware resource.To evaluate the effectiveness of the proposed architecture,a CNN accelerator has been implemented for You Only Look Once version 2(YOLOv2)-Tiny consisting of 9 layers.Furthermore,the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work.We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+Field Programmable Gate Array(FPGA)and ISE Design Suite.The FPGA implementation uses 34,336 Look Up Tables(LUTs),576 Digital Signal Processing(DSP)blocks,and an on-chip memory of only 58 KB,and it could achieve accuracies of 57.92% and 56.42% mean Average Precession@0.5 thresholds for intersection over union(mAP@0.5)using quantized 16-bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8-bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz.These speeds are expected to be doubled five times using a clock speed of 1GHz if implemented in a silicon System on Chip(SoC)using a sub-micron process.展开更多
基金supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2022R1A5A8026986)supported by the Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by the Korean government(MSIT)(No.2020-0-01304,Development of Self-learnable Mobile Recursive Neural Network Processor Technology)supported by the MSIT(Ministry of Science and ICT),Korea,under the Grand Information Technology Research Center support program(IITP-2023-2020-0-01462)'supervised by the IITP(Institute for Information&communications Technology Planning&Evaluation)and supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2021R1F1A1061314).
文摘This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models,which maximizes the data reuse rate and improve the computation speed.Furthermore,an integrated computation architecture has been implemented for the activation function,max-pooling,and activation function after convolution calculation,reducing the hardware resource.To evaluate the effectiveness of the proposed architecture,a CNN accelerator has been implemented for You Only Look Once version 2(YOLOv2)-Tiny consisting of 9 layers.Furthermore,the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work.We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+Field Programmable Gate Array(FPGA)and ISE Design Suite.The FPGA implementation uses 34,336 Look Up Tables(LUTs),576 Digital Signal Processing(DSP)blocks,and an on-chip memory of only 58 KB,and it could achieve accuracies of 57.92% and 56.42% mean Average Precession@0.5 thresholds for intersection over union(mAP@0.5)using quantized 16-bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8-bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz.These speeds are expected to be doubled five times using a clock speed of 1GHz if implemented in a silicon System on Chip(SoC)using a sub-micron process.