With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth ...With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth under the traditional von Neumann architecture is getting worse.Analyzing the algorithmic characteristics of convolutional neural network(CNN),it is found that the access characteristics of convolution(CONV)and fully connected(FC)operations are very different.Based on this feature,a dual-mode reronfigurable distributed memory architecture for CNN accelerator is designed.It can be configured in Bank mode or first input first output(FIFO)mode to accommodate the access needs of different operations.At the same time,a programmable memory control unit is designed,which can effectively control the dual-mode configurable distributed memory architecture by using customized special accessing instructions and reduce the data accessing delay.The proposed architecture is verified and tested by parallel implementation of some CNN algorithms.The experimental results show that the peak bandwidth can reach 13.44 GB·s^(-1)at an operating frequency of 120 MHz.This work can achieve 1.40,1.12,2.80 and 4.70 times the peak bandwidth compared with the existing work.展开更多
The reconfigurable chip,which integrates the advantages of high performance,high flexibility,high parallelism,low power consumption,and low cost,has achieved rapid development and wide application.Generally,the contro...The reconfigurable chip,which integrates the advantages of high performance,high flexibility,high parallelism,low power consumption,and low cost,has achieved rapid development and wide application.Generally,the control part and the computing part of algorithm is accelerated based on different reconfigurable architectures,but it is difficult to obtain overall performance improvement.For improving efficiency of reconfigurable structure both for the control part and the computing part,a hybrid of instruction-driven and data-driven self-reconfigurable cell array is proposed.On instruction-driven mode,processing element(PE)works like a reduced instruction set computer(RSIC)machine,which is mainly for the control part of algorithm.On data-driven mode,data is calculated by flowing between the preconfigured PEs,which is mainly for the computing of algorithm.For verifying the efficiency of architecture,some high-efficiency video coding(HEVC)video compression algorithms are implemented on the proposed architecture.The proposed architecture has been implemented on Xilinx FPGA Virtex UltraScale VU440 develop board.The same circuitry is able to run at75 MHz.Compared with the architecture that only supports instruction-driven,the proposed architecture has better calculation efficiency.展开更多
去块滤波算法是高效视频编码标准(high-efficiency video coding,HEVC)的重要组成部分,专用硬件实现的去块滤波电路结构难以满足不断革新的算法需求,可重构计算兼具计算高效性和编程灵活性成为研究热点。基于指令流与数据流混合驱动可...去块滤波算法是高效视频编码标准(high-efficiency video coding,HEVC)的重要组成部分,专用硬件实现的去块滤波电路结构难以满足不断革新的算法需求,可重构计算兼具计算高效性和编程灵活性成为研究热点。基于指令流与数据流混合驱动可重构视频阵列处理器(reconfigurable video array processor,RVAP),提出一种可重构的HEVC编码去块滤波电路的并行化实现方法,依据数据流图分析实现去块滤波算法的最大化并行,提高计算效率;通过强/弱滤波方式的灵活切换,提高计算资源利用率。实验结果表明,所提方法在满足算法灵活切换和计算速度要求的同时,硬件资源减少了47.6%,时钟频率达167 MHz。展开更多
Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different st...Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different structures for heterogeneous architecture.The long data transfer delay becomes the mainly problem to limit the processing speed for computer vision applications.For reducing data transfer delay and fasting computer vision applications,a clustered data-driven array processor is proposed.A three-level pipelining processing element is designed which supports two-buffer data flow interface and 8 bits,16 bits,32 bits subtext parallel computation.At the same time,for accelerating transcendental function computation,a four-way shared pipelining transcendental function accelerator is designed,which is based on Y-intercept adjusted piecewise linear segment algorithm.A distributed shared memory structure based on unified addressing is also employed.To verify efficiency of architecture,some image processing algorithms are implemented on proposed architecture.Simultaneously the proposed architecture has been implemented on Xilinx ZC 706 development board.The same circuitry has been synthesized using SMIC 130 nm CMOS technology.The circuitry is able to run at 100 MHz.Area is 26.58 mm2.展开更多
Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and compu...Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.展开更多
基金Supported by the National Key R&D Program of China(No.2022ZD0119001)the National Natural Science Foundation of China(No.61834005,61802304)+1 种基金the Education Department of Shaanxi Province(No.22JY060)the Shaanxi Provincial Key Research and Devel-opment Plan(No.2024GX-YBXM-100)。
文摘With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth under the traditional von Neumann architecture is getting worse.Analyzing the algorithmic characteristics of convolutional neural network(CNN),it is found that the access characteristics of convolution(CONV)and fully connected(FC)operations are very different.Based on this feature,a dual-mode reronfigurable distributed memory architecture for CNN accelerator is designed.It can be configured in Bank mode or first input first output(FIFO)mode to accommodate the access needs of different operations.At the same time,a programmable memory control unit is designed,which can effectively control the dual-mode configurable distributed memory architecture by using customized special accessing instructions and reduce the data accessing delay.The proposed architecture is verified and tested by parallel implementation of some CNN algorithms.The experimental results show that the peak bandwidth can reach 13.44 GB·s^(-1)at an operating frequency of 120 MHz.This work can achieve 1.40,1.12,2.80 and 4.70 times the peak bandwidth compared with the existing work.
基金Supported by the National Natural Science Foundation of China(No.61802304,61834005,61772417,61634004)the Shaanxi Province Key R&D Plan(No.2021GY-029).
文摘The reconfigurable chip,which integrates the advantages of high performance,high flexibility,high parallelism,low power consumption,and low cost,has achieved rapid development and wide application.Generally,the control part and the computing part of algorithm is accelerated based on different reconfigurable architectures,but it is difficult to obtain overall performance improvement.For improving efficiency of reconfigurable structure both for the control part and the computing part,a hybrid of instruction-driven and data-driven self-reconfigurable cell array is proposed.On instruction-driven mode,processing element(PE)works like a reduced instruction set computer(RSIC)machine,which is mainly for the control part of algorithm.On data-driven mode,data is calculated by flowing between the preconfigured PEs,which is mainly for the computing of algorithm.For verifying the efficiency of architecture,some high-efficiency video coding(HEVC)video compression algorithms are implemented on the proposed architecture.The proposed architecture has been implemented on Xilinx FPGA Virtex UltraScale VU440 develop board.The same circuitry is able to run at75 MHz.Compared with the architecture that only supports instruction-driven,the proposed architecture has better calculation efficiency.
文摘去块滤波算法是高效视频编码标准(high-efficiency video coding,HEVC)的重要组成部分,专用硬件实现的去块滤波电路结构难以满足不断革新的算法需求,可重构计算兼具计算高效性和编程灵活性成为研究热点。基于指令流与数据流混合驱动可重构视频阵列处理器(reconfigurable video array processor,RVAP),提出一种可重构的HEVC编码去块滤波电路的并行化实现方法,依据数据流图分析实现去块滤波算法的最大化并行,提高计算效率;通过强/弱滤波方式的灵活切换,提高计算资源利用率。实验结果表明,所提方法在满足算法灵活切换和计算速度要求的同时,硬件资源减少了47.6%,时钟频率达167 MHz。
基金the National Natural Science Foundation of China(No.61802304,61834005,61772417,61634004,61602377)Shaanxi Provincial Co-ordination Innovation Project of Science and Technology(No.2016KTZDGY02-04-02)+1 种基金Shaanxi Provincial Key R&D Plan(No.2017GY-060)Shaanxi International Science and Technology Cooperation Program(No.2018KW-006).
文摘Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different structures for heterogeneous architecture.The long data transfer delay becomes the mainly problem to limit the processing speed for computer vision applications.For reducing data transfer delay and fasting computer vision applications,a clustered data-driven array processor is proposed.A three-level pipelining processing element is designed which supports two-buffer data flow interface and 8 bits,16 bits,32 bits subtext parallel computation.At the same time,for accelerating transcendental function computation,a four-way shared pipelining transcendental function accelerator is designed,which is based on Y-intercept adjusted piecewise linear segment algorithm.A distributed shared memory structure based on unified addressing is also employed.To verify efficiency of architecture,some image processing algorithms are implemented on proposed architecture.Simultaneously the proposed architecture has been implemented on Xilinx ZC 706 development board.The same circuitry has been synthesized using SMIC 130 nm CMOS technology.The circuitry is able to run at 100 MHz.Area is 26.58 mm2.
基金Supported by the National Natural Science Foundation of China(No.61802304,61834005,61772417,61602377)the Shaanxi Province KeyR&D Plan(No.2021GY-029)。
文摘Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.