Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CN...Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with time-domain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).展开更多
为了优化粗粒度可重构架构REMUS-II(Reconfigurable Multimedia System 2)的数据流通路,使其能够完成高性能媒体解码,针对媒体算法的数据访问特征,对REMUS-II的片上存储与片外存储访问模块进行优化.片上存储通过二维数据传输和转置等访...为了优化粗粒度可重构架构REMUS-II(Reconfigurable Multimedia System 2)的数据流通路,使其能够完成高性能媒体解码,针对媒体算法的数据访问特征,对REMUS-II的片上存储与片外存储访问模块进行优化.片上存储通过二维数据传输和转置等访问模式进行优化,片上数据传输效率分别平均提高了69.6%和15.1%.片外存储通过块缓存设计优化参考帧访问,平均减少37%的外存访问时间.经过层次化存储设计,REMUS-II数据流可满足计算需求,在200MHz主频下实现H.264算法和MPEG2算法高级档次的1 920像素×1 080像素高清分辨率实时解码.展开更多
To apply a quasi-cyclic low density parity check(QC-LDPC)to different scenarios,a data-stream driven pipelined macro instruction set and a reconfigurable processor architecture are proposed for the typical QC-LDPC alg...To apply a quasi-cyclic low density parity check(QC-LDPC)to different scenarios,a data-stream driven pipelined macro instruction set and a reconfigurable processor architecture are proposed for the typical QC-LDPC algorithm.The data-level parallelism is improved by instructions to dynamically configure the multi-core computing units.Simultaneously,an intelligent adjustment strategy based on a programmable wake-up controller(WuC)is designed so that the computing mode,operating voltage,and frequency of the QC-LDPC algorithm can be adjusted.This adjustment can improve the computing efficiency of the processor.The QC-LDPC processors are verified on the Xilinx ZCU102 field programmable gate array(FPGA)board and the computing efficiency is measured.The experimental results indicate that the QC-LDPC processor can support two encoding lengths of three typical QC-LDPC algorithms and 20 adaptive operating modes of operating voltage and frequency.The maximum efficiency can reach up to 12.18 Gbit/(s·W),which is more flexible than existing state-of-the-art processors for QC-LDPC.展开更多
The computational capability of a coarse-grained reconfigurable array(CGRA)can be significantly restrained due to data and context memory bandwidth bottlenecks.Traditionally,two methods have been used to resolve this ...The computational capability of a coarse-grained reconfigurable array(CGRA)can be significantly restrained due to data and context memory bandwidth bottlenecks.Traditionally,two methods have been used to resolve this problem.One method loads the context into the CGRA at run time.This method occupies very small on-chip memory but induces very large latency,which leads to low computational efficiency.The other method adopts a multi-context structure.This method loads the context into the on-chip context memory at the boot phase.Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis.The size of the context memory induces a large area overhead in multi-context structures,which results in major restrictions on application complexity.This paper proposes a Predictable Context Cache(PCC)architecture to address the above context issues by buffering the context inside a CGRA.In this architecture,context is dynamically transferred into the CGRA.Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory.Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue.Rather than fundamentally reducing the amount of input data,the transferred data and computations are processed in parallel.However,the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases.This paper also presents a Hierarchical Data Memory(HDM)architecture as a solution to the efficiency problem.In this architecture,high internal bandwidth is provided to buffer both reused input data and intermediate data.The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved.As a result of using PCC and HDM,experiments running mainstream video decoding programs achieved performance improvements of 13.57%–19.48%when there was a reasonable memory size.Therefore,1080p@35.7fps for H.264high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency.Further,the size of the on-chip context memory no longer restricted complex applications,which were efficiently executed on the PCC and HDM architecture.展开更多
文摘Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with time-domain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).
基金the National Key Research and Development Program of China(2019YFB1803600)the Key Scientific Research Program of Shaanxi Provincial Department of Education(22JY059)the China Civil Aviation Airworthiness Center Open Foundation(SH2021111903)。
文摘To apply a quasi-cyclic low density parity check(QC-LDPC)to different scenarios,a data-stream driven pipelined macro instruction set and a reconfigurable processor architecture are proposed for the typical QC-LDPC algorithm.The data-level parallelism is improved by instructions to dynamically configure the multi-core computing units.Simultaneously,an intelligent adjustment strategy based on a programmable wake-up controller(WuC)is designed so that the computing mode,operating voltage,and frequency of the QC-LDPC algorithm can be adjusted.This adjustment can improve the computing efficiency of the processor.The QC-LDPC processors are verified on the Xilinx ZCU102 field programmable gate array(FPGA)board and the computing efficiency is measured.The experimental results indicate that the QC-LDPC processor can support two encoding lengths of three typical QC-LDPC algorithms and 20 adaptive operating modes of operating voltage and frequency.The maximum efficiency can reach up to 12.18 Gbit/(s·W),which is more flexible than existing state-of-the-art processors for QC-LDPC.
基金supported by the National High Technology Research and Development Program of China(Grant No.2012AA012701)
文摘The computational capability of a coarse-grained reconfigurable array(CGRA)can be significantly restrained due to data and context memory bandwidth bottlenecks.Traditionally,two methods have been used to resolve this problem.One method loads the context into the CGRA at run time.This method occupies very small on-chip memory but induces very large latency,which leads to low computational efficiency.The other method adopts a multi-context structure.This method loads the context into the on-chip context memory at the boot phase.Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis.The size of the context memory induces a large area overhead in multi-context structures,which results in major restrictions on application complexity.This paper proposes a Predictable Context Cache(PCC)architecture to address the above context issues by buffering the context inside a CGRA.In this architecture,context is dynamically transferred into the CGRA.Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory.Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue.Rather than fundamentally reducing the amount of input data,the transferred data and computations are processed in parallel.However,the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases.This paper also presents a Hierarchical Data Memory(HDM)architecture as a solution to the efficiency problem.In this architecture,high internal bandwidth is provided to buffer both reused input data and intermediate data.The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved.As a result of using PCC and HDM,experiments running mainstream video decoding programs achieved performance improvements of 13.57%–19.48%when there was a reasonable memory size.Therefore,1080p@35.7fps for H.264high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency.Further,the size of the on-chip context memory no longer restricted complex applications,which were efficiently executed on the PCC and HDM architecture.