To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convo...To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convolution architecture built on a field-programmable gate array using integer multipliers and addition trees is used.With the help of the Winograd algorithm,the optimization of convolution and multiplication is realized to reduce the computational complexity.The LUT-based operator is further optimized to construct a processing unit(PE).Simultaneously optimized storage streams improve memory access efficiency and solve bandwidth constraints.The data toggle rate is reduced to optimize power consumption.The experimental results show that the use of the Winograd algorithm to build basic processing units can significantly reduce the number of multipliers and achieve hardware deployment acceleration,while the time-division multiplexing of processing units improves resource utilization.Under this experimental condition,compared with the traditional convolution method,the architecture optimizes computing resources by 2.25 times and improves the peak throughput by 19.3 times.The LUT-based Winograd accelerator can effectively solve the deployment problem caused by limited hardware resources.展开更多
This brief proposes an area and speed efficient implementation of symmetric finite impulse response (FIR) digital filter using reduced parallel look-up table (LUT) distributed arithmetic (DA) based approach. The compl...This brief proposes an area and speed efficient implementation of symmetric finite impulse response (FIR) digital filter using reduced parallel look-up table (LUT) distributed arithmetic (DA) based approach. The complexity lying in the realization of FIR filter is dominated by the multiplier structure. This complexity grows further with filter order, which results in increased area, power, and reduced speed of operation. The speed of operation is improved over multiply-accumulate approach using multiplier less conventional DA based design and decomposed DA based design. Both the structure requires B clock cycles to get the filter output for the input width of B, which limits the speed of DA structure. This limitation is addressed using parallel LUTs, called high speed DA FIR, at the expense of additional hardware cost. With large number of taps, the number of LUTs and its size also becomes large. In the proposed method, by exploiting coefficient symmetry property, the number of LUTs in the decomposed DA form is reduced by a factor of about 2. This proposed approach is applied in high speed DA based FIR design, to obtain area and speed efficient structure. The proposed design offers around 40% less area and 53.98% less slice-delay product (SDP) than the high throughput DA based structure when it’s implemented over Xilinx Virtex-5 FPGA device-XC5VSX95T-1FF1136 for 16-tap symmetric FIR filter. The proposed design on the same FPGA device, supports up to 607 MHz input sampling frequency, and offers 60.5% more speed and 67.71% less SDP than the systolic DA based design.展开更多
In this paper,a Radio Over Fiber (ROF) system with a Digital Pre-Distorter (DPD) for WCDMA signal transmission is investigated.A Look-Up Table (LUT) based DPD and a Memory Polynomial (MP) DPD are applied in the ROF li...In this paper,a Radio Over Fiber (ROF) system with a Digital Pre-Distorter (DPD) for WCDMA signal transmission is investigated.A Look-Up Table (LUT) based DPD and a Memory Polynomial (MP) DPD are applied in the ROF link so as to suppress the out-of-band spurious spectrum and improve the transmission performance.The experimental results show that the out-of-band emission due to existence of the third-order Inter-Modulation Distortion (IMD3) is obviously sup-pressed by these two DPD.An Adjacent Channel Power Ratio (ACPR) improvement of 8 dB is ob-tained for a single-carrier WCDMA signal transmission.These two DPD have equal ability in lin-earization of the ROF system for a three-carrier WCDMA signal transmission.There is no apparent memory effects exist in the ROF link.展开更多
A real time mixing module for high definition television (HDTV) data of SMPTE 274M and PC video data is designed. The hardware implementation, algorithm and simulation of the mixing module are given. In order to impro...A real time mixing module for high definition television (HDTV) data of SMPTE 274M and PC video data is designed. The hardware implementation, algorithm and simulation of the mixing module are given. In order to improve the capability of data processing, an anti-fuse FPGA chip and a mechanism of pipelining and modularization are adopted. With 6 parallel LUTs and a fast algorithm, it can mix 4∶2∶2 component signals in luminance and chrominance space respectively in real time. According to the simulation, the module has the ability to mix the uncompressed HDTV data with PC video data in real time, which can not be fulfilled by current ASIC chips. Furthermore, it can be extended to multi-stage mixing with the thoughts implied by the design. The mixing module can be widely used in HDTV production systems.展开更多
基金The Academic Colleges and Universities Innovation Program 2.0(No.BP0719013)。
文摘To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convolution architecture built on a field-programmable gate array using integer multipliers and addition trees is used.With the help of the Winograd algorithm,the optimization of convolution and multiplication is realized to reduce the computational complexity.The LUT-based operator is further optimized to construct a processing unit(PE).Simultaneously optimized storage streams improve memory access efficiency and solve bandwidth constraints.The data toggle rate is reduced to optimize power consumption.The experimental results show that the use of the Winograd algorithm to build basic processing units can significantly reduce the number of multipliers and achieve hardware deployment acceleration,while the time-division multiplexing of processing units improves resource utilization.Under this experimental condition,compared with the traditional convolution method,the architecture optimizes computing resources by 2.25 times and improves the peak throughput by 19.3 times.The LUT-based Winograd accelerator can effectively solve the deployment problem caused by limited hardware resources.
文摘This brief proposes an area and speed efficient implementation of symmetric finite impulse response (FIR) digital filter using reduced parallel look-up table (LUT) distributed arithmetic (DA) based approach. The complexity lying in the realization of FIR filter is dominated by the multiplier structure. This complexity grows further with filter order, which results in increased area, power, and reduced speed of operation. The speed of operation is improved over multiply-accumulate approach using multiplier less conventional DA based design and decomposed DA based design. Both the structure requires B clock cycles to get the filter output for the input width of B, which limits the speed of DA structure. This limitation is addressed using parallel LUTs, called high speed DA FIR, at the expense of additional hardware cost. With large number of taps, the number of LUTs and its size also becomes large. In the proposed method, by exploiting coefficient symmetry property, the number of LUTs in the decomposed DA form is reduced by a factor of about 2. This proposed approach is applied in high speed DA based FIR design, to obtain area and speed efficient structure. The proposed design offers around 40% less area and 53.98% less slice-delay product (SDP) than the high throughput DA based structure when it’s implemented over Xilinx Virtex-5 FPGA device-XC5VSX95T-1FF1136 for 16-tap symmetric FIR filter. The proposed design on the same FPGA device, supports up to 607 MHz input sampling frequency, and offers 60.5% more speed and 67.71% less SDP than the systolic DA based design.
基金Supported by the National Natural Science Foundation of China (No. 60972064)
文摘In this paper,a Radio Over Fiber (ROF) system with a Digital Pre-Distorter (DPD) for WCDMA signal transmission is investigated.A Look-Up Table (LUT) based DPD and a Memory Polynomial (MP) DPD are applied in the ROF link so as to suppress the out-of-band spurious spectrum and improve the transmission performance.The experimental results show that the out-of-band emission due to existence of the third-order Inter-Modulation Distortion (IMD3) is obviously sup-pressed by these two DPD.An Adjacent Channel Power Ratio (ACPR) improvement of 8 dB is ob-tained for a single-carrier WCDMA signal transmission.These two DPD have equal ability in lin-earization of the ROF system for a three-carrier WCDMA signal transmission.There is no apparent memory effects exist in the ROF link.
文摘A real time mixing module for high definition television (HDTV) data of SMPTE 274M and PC video data is designed. The hardware implementation, algorithm and simulation of the mixing module are given. In order to improve the capability of data processing, an anti-fuse FPGA chip and a mechanism of pipelining and modularization are adopted. With 6 parallel LUTs and a fast algorithm, it can mix 4∶2∶2 component signals in luminance and chrominance space respectively in real time. According to the simulation, the module has the ability to mix the uncompressed HDTV data with PC video data in real time, which can not be fulfilled by current ASIC chips. Furthermore, it can be extended to multi-stage mixing with the thoughts implied by the design. The mixing module can be widely used in HDTV production systems.