This paper describes an efficient, low latency systolic array architecture for full searches in block matching motion estimation. Conventional one dimensional systolic array architecture is used to develop a nove...This paper describes an efficient, low latency systolic array architecture for full searches in block matching motion estimation. Conventional one dimensional systolic array architecture is used to develop a novel ring like systolic array architecture through operator rescheduling considering the symmetry of the data flow. High latency delay due to stuffing of the array pipeline in the conventional architecture was eliminated. The new architecture delivers a higher throughput rate, achieves higher processor utilization, and has low power consumption. In addition, the minimum memory bandwidth of the conventional architecture is preserved.展开更多
In order to make the typical Montgomery’s algorithm suitable for implementation on FPGA, a modified version is proposed and then a high-performance systolic linear array architecture is designed for RSA cryptosystem ...In order to make the typical Montgomery’s algorithm suitable for implementation on FPGA, a modified version is proposed and then a high-performance systolic linear array architecture is designed for RSA cryptosystem on the basis of the optimized algorithm. The proposed systolic array architecture has dis- tinctive features, i.e. not only the computation speed is significantly fast but also the hardware overhead is drastically decreased. As a major practical result, the paper shows that it is possible to implement public-key cryptosystem at secure bit lengths on a single commercially available FPGA.展开更多
This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden...This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models,which maximizes the data reuse rate and improve the computation speed.Furthermore,an integrated computation architecture has been implemented for the activation function,max-pooling,and activation function after convolution calculation,reducing the hardware resource.To evaluate the effectiveness of the proposed architecture,a CNN accelerator has been implemented for You Only Look Once version 2(YOLOv2)-Tiny consisting of 9 layers.Furthermore,the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work.We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+Field Programmable Gate Array(FPGA)and ISE Design Suite.The FPGA implementation uses 34,336 Look Up Tables(LUTs),576 Digital Signal Processing(DSP)blocks,and an on-chip memory of only 58 KB,and it could achieve accuracies of 57.92% and 56.42% mean Average Precession@0.5 thresholds for intersection over union(mAP@0.5)using quantized 16-bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8-bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz.These speeds are expected to be doubled five times using a clock speed of 1GHz if implemented in a silicon System on Chip(SoC)using a sub-micron process.展开更多
A modified extended binary Euclid' s algorithm which is more regularly iterative for computing an inversion in GF(2^m) is presented. Based on above modified algorithm, a serial-in serial-out architecture is propose...A modified extended binary Euclid' s algorithm which is more regularly iterative for computing an inversion in GF(2^m) is presented. Based on above modified algorithm, a serial-in serial-out architecture is proposed. It has area complexity of O(m), latency of 5m - 2, and throughput of 1/m. Compared with other serial systolic arehiteetures, the proposed one has the smallest area complexity, shorter latency. It is highly regular, modular, and thus well suited for high-speed VLSI design.展开更多
A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly...A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly and form a ring structure. All processing cells are identical and programmable. Each processing cell has the peak performance of 20 million floating-point operations per second (20MFLOPS). The machine therefore has a peak performance of 320 M FLOPS. It is integrated as an attached processor into a host system through VME bus interface. Programs for FXCQ are written in a high-level language -B language, which is supported by a parallel optimizing compiler. This paper describes the architecture of FXCQ, B language and its compiler.展开更多
A novel DSP to ASIC (Application Specific Integrated Circuit) architecture design methodology is presented in this paper for reducing power/area consumption. Traditional methods always focus on optimizing hardware str...A novel DSP to ASIC (Application Specific Integrated Circuit) architecture design methodology is presented in this paper for reducing power/area consumption. Traditional methods always focus on optimizing hardware structure or algorithm separately. The authors propose a new method called PRF (Paralleling Reducing Folding) framework to combine hardware optimization with algorithm simplification. In the first step, paralleling, unfolding technology is applied to divide one data path into several channels and expose the redundancy of the algorithm. In the second step, reducing, decoupling theory is used to reduce computational complexity. In the last step, folding, time multiplexing method is used to merge similar components. As an exoteric methodology framework, many optimization methods can be integrated into the PRF framework. To optimize a 3N taps FIR (Fincte Impact Response) and obtain a content result, PRF methodology framework is applied.展开更多
文摘This paper describes an efficient, low latency systolic array architecture for full searches in block matching motion estimation. Conventional one dimensional systolic array architecture is used to develop a novel ring like systolic array architecture through operator rescheduling considering the symmetry of the data flow. High latency delay due to stuffing of the array pipeline in the conventional architecture was eliminated. The new architecture delivers a higher throughput rate, achieves higher processor utilization, and has low power consumption. In addition, the minimum memory bandwidth of the conventional architecture is preserved.
文摘In order to make the typical Montgomery’s algorithm suitable for implementation on FPGA, a modified version is proposed and then a high-performance systolic linear array architecture is designed for RSA cryptosystem on the basis of the optimized algorithm. The proposed systolic array architecture has dis- tinctive features, i.e. not only the computation speed is significantly fast but also the hardware overhead is drastically decreased. As a major practical result, the paper shows that it is possible to implement public-key cryptosystem at secure bit lengths on a single commercially available FPGA.
基金supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2022R1A5A8026986)supported by the Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by the Korean government(MSIT)(No.2020-0-01304,Development of Self-learnable Mobile Recursive Neural Network Processor Technology)supported by the MSIT(Ministry of Science and ICT),Korea,under the Grand Information Technology Research Center support program(IITP-2023-2020-0-01462)'supervised by the IITP(Institute for Information&communications Technology Planning&Evaluation)and supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2021R1F1A1061314).
文摘This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models,which maximizes the data reuse rate and improve the computation speed.Furthermore,an integrated computation architecture has been implemented for the activation function,max-pooling,and activation function after convolution calculation,reducing the hardware resource.To evaluate the effectiveness of the proposed architecture,a CNN accelerator has been implemented for You Only Look Once version 2(YOLOv2)-Tiny consisting of 9 layers.Furthermore,the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work.We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+Field Programmable Gate Array(FPGA)and ISE Design Suite.The FPGA implementation uses 34,336 Look Up Tables(LUTs),576 Digital Signal Processing(DSP)blocks,and an on-chip memory of only 58 KB,and it could achieve accuracies of 57.92% and 56.42% mean Average Precession@0.5 thresholds for intersection over union(mAP@0.5)using quantized 16-bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8-bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz.These speeds are expected to be doubled five times using a clock speed of 1GHz if implemented in a silicon System on Chip(SoC)using a sub-micron process.
文摘A modified extended binary Euclid' s algorithm which is more regularly iterative for computing an inversion in GF(2^m) is presented. Based on above modified algorithm, a serial-in serial-out architecture is proposed. It has area complexity of O(m), latency of 5m - 2, and throughput of 1/m. Compared with other serial systolic arehiteetures, the proposed one has the smallest area complexity, shorter latency. It is highly regular, modular, and thus well suited for high-speed VLSI design.
文摘A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly and form a ring structure. All processing cells are identical and programmable. Each processing cell has the peak performance of 20 million floating-point operations per second (20MFLOPS). The machine therefore has a peak performance of 320 M FLOPS. It is integrated as an attached processor into a host system through VME bus interface. Programs for FXCQ are written in a high-level language -B language, which is supported by a parallel optimizing compiler. This paper describes the architecture of FXCQ, B language and its compiler.
文摘A novel DSP to ASIC (Application Specific Integrated Circuit) architecture design methodology is presented in this paper for reducing power/area consumption. Traditional methods always focus on optimizing hardware structure or algorithm separately. The authors propose a new method called PRF (Paralleling Reducing Folding) framework to combine hardware optimization with algorithm simplification. In the first step, paralleling, unfolding technology is applied to divide one data path into several channels and expose the redundancy of the algorithm. In the second step, reducing, decoupling theory is used to reduce computational complexity. In the last step, folding, time multiplexing method is used to merge similar components. As an exoteric methodology framework, many optimization methods can be integrated into the PRF framework. To optimize a 3N taps FIR (Fincte Impact Response) and obtain a content result, PRF methodology framework is applied.