With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware ...With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×−3.14×energy efficiency compared with others.展开更多
We design a reconfigurable pipelined multiplier embedded in an FPGA. This design is based on the modified Booth algorithm and performs 18 × 18 signed or 17 × 17 unsigned multiplication. We propose a novel me...We design a reconfigurable pipelined multiplier embedded in an FPGA. This design is based on the modified Booth algorithm and performs 18 × 18 signed or 17 × 17 unsigned multiplication. We propose a novel method for circuit optimization to reduce the number of partial products. A new layout floorplan design of the multiplier block is reported to comply with the constraints imposed by the tile-based FPGA chip design. The multiplier can be configured as synchronous or asynchronous. Its operation can also be configured as pipelined for high-frequency operation. This design can be easily extended for different input and output bit-widths. We employ a novel carry look-ahead adder circuit to generate the final product. The transmission-gate logic is used for the low-level circuits throughout the entire multiplier for fast logic operations. The design of the multiplier block is based on SMIC 0.13μm CMOS technology using full-custom design methodology. The operation of the 18 × 18 multiplier takes 4. lns. The two-stage pipelined operation cycle is 2.5ns. This is 29.1% faster than the commercial multiplier and is 17.5% faster than the multipliers reported in other academic designs. Compared with the distributed LUT-based multiplier,it demonstrates an area efficiency ratio of 33 : 1.展开更多
Improvement of digital FIR filter is vital in the field of Digital Signal Processing in order to reduce the area, delay and power. Multiplication and Accumulation (MAC) unit of Finite Impulse Response (FIR) filte...Improvement of digital FIR filter is vital in the field of Digital Signal Processing in order to reduce the area, delay and power. Multiplication and Accumulation (MAC) unit of Finite Impulse Response (FIR) filter has been designed using efficient multiplier and adder circuits for optimized APT (Area,Power and Timing) product. In this paper, the design of direct form FIR filter with efficient MAC unit has been presented. Initially, full adder and half adder structures are shrunk down by reducing number of gates. These compact full adder and half adder structures are incorporated into Wallace Multiplier and Improved Carry-Save Adder. The proposed 16-bit Carry-Save Adder has been improved by splitting into four parallel phases. Consequently the delay of enhanced Carry- Save Adder is reduced. Generation of carry output is performed using number of OR gates in a sequential manner. All these enhanced architectures are incorporated into the Digital FIR Filter to reduce the area, delay and power utilization.展开更多
基金supported in part by the Major Program of the Ministry of Science and Technology of China under Grant 2019YFB2205102in part by the National Natural Science Foundation of China under Grant 61974164,62074166,61804181,62004219,62004220,62104256.
文摘With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×−3.14×energy efficiency compared with others.
文摘We design a reconfigurable pipelined multiplier embedded in an FPGA. This design is based on the modified Booth algorithm and performs 18 × 18 signed or 17 × 17 unsigned multiplication. We propose a novel method for circuit optimization to reduce the number of partial products. A new layout floorplan design of the multiplier block is reported to comply with the constraints imposed by the tile-based FPGA chip design. The multiplier can be configured as synchronous or asynchronous. Its operation can also be configured as pipelined for high-frequency operation. This design can be easily extended for different input and output bit-widths. We employ a novel carry look-ahead adder circuit to generate the final product. The transmission-gate logic is used for the low-level circuits throughout the entire multiplier for fast logic operations. The design of the multiplier block is based on SMIC 0.13μm CMOS technology using full-custom design methodology. The operation of the 18 × 18 multiplier takes 4. lns. The two-stage pipelined operation cycle is 2.5ns. This is 29.1% faster than the commercial multiplier and is 17.5% faster than the multipliers reported in other academic designs. Compared with the distributed LUT-based multiplier,it demonstrates an area efficiency ratio of 33 : 1.
文摘Improvement of digital FIR filter is vital in the field of Digital Signal Processing in order to reduce the area, delay and power. Multiplication and Accumulation (MAC) unit of Finite Impulse Response (FIR) filter has been designed using efficient multiplier and adder circuits for optimized APT (Area,Power and Timing) product. In this paper, the design of direct form FIR filter with efficient MAC unit has been presented. Initially, full adder and half adder structures are shrunk down by reducing number of gates. These compact full adder and half adder structures are incorporated into Wallace Multiplier and Improved Carry-Save Adder. The proposed 16-bit Carry-Save Adder has been improved by splitting into four parallel phases. Consequently the delay of enhanced Carry- Save Adder is reduced. Generation of carry output is performed using number of OR gates in a sequential manner. All these enhanced architectures are incorporated into the Digital FIR Filter to reduce the area, delay and power utilization.