期刊文献+
共找到6篇文章
< 1 >
每页显示 20 50 100
Efficient and Low-Latency Systolic Array Architecture for Full Searches in Block-Matching Motion Estimation
1
作者 张武健 邱晓海 +1 位作者 周润德 陈弘毅 《Tsinghua Science and Technology》 SCIE EI CAS 2001年第4期361-368,共8页
This paper describes an efficient, low latency systolic array architecture for full searches in block matching motion estimation. Conventional one dimensional systolic array architecture is used to develop a nove... This paper describes an efficient, low latency systolic array architecture for full searches in block matching motion estimation. Conventional one dimensional systolic array architecture is used to develop a novel ring like systolic array architecture through operator rescheduling considering the symmetry of the data flow. High latency delay due to stuffing of the array pipeline in the conventional architecture was eliminated. The new architecture delivers a higher throughput rate, achieves higher processor utilization, and has low power consumption. In addition, the minimum memory bandwidth of the conventional architecture is preserved. 展开更多
关键词 motion estimation full search systolic array low latency low power
原文传递
FPGA IMPLEMENTATION OF RSA PUBLIC-KEY CRYPTOGRAPHIC COPROCESSOR BASED ON SYSTOLIC LINEAR ARRAY ARCHITECTURE 被引量:2
2
作者 Wen Nuan Dai Zibin Zhang Yongfu 《Journal of Electronics(China)》 2006年第5期718-722,共5页
In order to make the typical Montgomery’s algorithm suitable for implementation on FPGA, a modified version is proposed and then a high-performance systolic linear array architecture is designed for RSA cryptosystem ... In order to make the typical Montgomery’s algorithm suitable for implementation on FPGA, a modified version is proposed and then a high-performance systolic linear array architecture is designed for RSA cryptosystem on the basis of the optimized algorithm. The proposed systolic array architecture has dis- tinctive features, i.e. not only the computation speed is significantly fast but also the hardware overhead is drastically decreased. As a major practical result, the paper shows that it is possible to implement public-key cryptosystem at secure bit lengths on a single commercially available FPGA. 展开更多
关键词 RSA Montgomery's algorithm systolic linear array Modular multiplication Modular exponentiation
下载PDF
CNN Accelerator Using Proposed Diagonal Cyclic Array for Minimizing Memory Accesses
3
作者 Hyun-Wook Son Ali AAl-Hamid +2 位作者 Yong-Seok Na Dong-Yeong Lee Hyung-Won Kim 《Computers, Materials & Continua》 SCIE EI 2023年第8期1665-1687,共23页
This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden... This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models,which maximizes the data reuse rate and improve the computation speed.Furthermore,an integrated computation architecture has been implemented for the activation function,max-pooling,and activation function after convolution calculation,reducing the hardware resource.To evaluate the effectiveness of the proposed architecture,a CNN accelerator has been implemented for You Only Look Once version 2(YOLOv2)-Tiny consisting of 9 layers.Furthermore,the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work.We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+Field Programmable Gate Array(FPGA)and ISE Design Suite.The FPGA implementation uses 34,336 Look Up Tables(LUTs),576 Digital Signal Processing(DSP)blocks,and an on-chip memory of only 58 KB,and it could achieve accuracies of 57.92% and 56.42% mean Average Precession@0.5 thresholds for intersection over union(mAP@0.5)using quantized 16-bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8-bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz.These speeds are expected to be doubled five times using a clock speed of 1GHz if implemented in a silicon System on Chip(SoC)using a sub-micron process. 展开更多
关键词 CNN ACCELERATOR systolic array memory optimization YOLOv2-tiny mAP@0.5
下载PDF
Low-complexity systolic architecture for inversion
4
作者 袁丹寿 Rong Mengtian 《High Technology Letters》 EI CAS 2006年第4期413-416,共4页
A modified extended binary Euclid' s algorithm which is more regularly iterative for computing an inversion in GF(2^m) is presented. Based on above modified algorithm, a serial-in serial-out architecture is propose... A modified extended binary Euclid' s algorithm which is more regularly iterative for computing an inversion in GF(2^m) is presented. Based on above modified algorithm, a serial-in serial-out architecture is proposed. It has area complexity of O(m), latency of 5m - 2, and throughput of 1/m. Compared with other serial systolic arehiteetures, the proposed one has the smallest area complexity, shorter latency. It is highly regular, modular, and thus well suited for high-speed VLSI design. 展开更多
关键词 VLSI INVERSION systolic array Finite field
下载PDF
A High Speed Signal Processing Machine -Its Architecture, Language and Compiler
5
作者 Wang Yufei and Yu ShiqiBeijing Institute of Data Processing Technology, P.O.Box 3927, Beijing 100039, China 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 1991年第1期119-128,共10页
A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly... A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly and form a ring structure. All processing cells are identical and programmable. Each processing cell has the peak performance of 20 million floating-point operations per second (20MFLOPS). The machine therefore has a peak performance of 320 M FLOPS. It is integrated as an attached processor into a host system through VME bus interface. Programs for FXCQ are written in a high-level language -B language, which is supported by a parallel optimizing compiler. This paper describes the architecture of FXCQ, B language and its compiler. 展开更多
关键词 Parallel processing systolic array processor Parallel language Compiler.
下载PDF
A Low Power/Area Digital FIR Filter Design Based on PRF Framework
6
作者 王栋 Wang +2 位作者 Wei Xu Xiaoming 《High Technology Letters》 EI CAS 2002年第3期57-61,共5页
A novel DSP to ASIC (Application Specific Integrated Circuit) architecture design methodology is presented in this paper for reducing power/area consumption. Traditional methods always focus on optimizing hardware str... A novel DSP to ASIC (Application Specific Integrated Circuit) architecture design methodology is presented in this paper for reducing power/area consumption. Traditional methods always focus on optimizing hardware structure or algorithm separately. The authors propose a new method called PRF (Paralleling Reducing Folding) framework to combine hardware optimization with algorithm simplification. In the first step, paralleling, unfolding technology is applied to divide one data path into several channels and expose the redundancy of the algorithm. In the second step, reducing, decoupling theory is used to reduce computational complexity. In the last step, folding, time multiplexing method is used to merge similar components. As an exoteric methodology framework, many optimization methods can be integrated into the PRF framework. To optimize a 3N taps FIR (Fincte Impact Response) and obtain a content result, PRF methodology framework is applied. 展开更多
关键词 ASIC architecture systolic array paralleling reducing folding power/area optimization
下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部