In the case of massive data,matrix operations are very computationally intensive,and the memory limitation in standalone mode leads to the system inefficiencies.At the same time,it is difficult for matrix operations t...In the case of massive data,matrix operations are very computationally intensive,and the memory limitation in standalone mode leads to the system inefficiencies.At the same time,it is difficult for matrix operations to achieve flexible switching between different requirements when implemented in hardware.To address this problem,this paper proposes a matrix operation accelerator based on reconfigurable arrays in the context of the application of recommender systems(RS).Based on the reconfigurable array processor(APR-16)with reconfiguration,a parallelized design of matrix operations on processing element(PE)array is realized with flexibility.The experimental results show that,compared with the proposed central processing unit(CPU)and graphics processing unit(GPU)hybrid implementation matrix multiplication framework,the energy efficiency ratio of the accelerator proposed in this paper is improved by about 35×.Compared with blocked alternating least squares(BALS),its the energy efficiency ratio has been accelerated by about 1×,and the switching of matrix factorization(MF)schemes suitable for different sparsity can be realized.展开更多
Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different st...Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different structures for heterogeneous architecture.The long data transfer delay becomes the mainly problem to limit the processing speed for computer vision applications.For reducing data transfer delay and fasting computer vision applications,a clustered data-driven array processor is proposed.A three-level pipelining processing element is designed which supports two-buffer data flow interface and 8 bits,16 bits,32 bits subtext parallel computation.At the same time,for accelerating transcendental function computation,a four-way shared pipelining transcendental function accelerator is designed,which is based on Y-intercept adjusted piecewise linear segment algorithm.A distributed shared memory structure based on unified addressing is also employed.To verify efficiency of architecture,some image processing algorithms are implemented on proposed architecture.Simultaneously the proposed architecture has been implemented on Xilinx ZC 706 development board.The same circuitry has been synthesized using SMIC 130 nm CMOS technology.The circuitry is able to run at 100 MHz.Area is 26.58 mm2.展开更多
Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and compu...Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.展开更多
Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of d...Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of data access delay among these three structures in various cases. Finally these structures are realized on Xilinx FPGA development board and DCT,FFT,SAD,IME,FME,and de-blocking filtering algorithms are mapped onto the structures. Compared with available architectures,our proposed structures have lower data access delay and lower area.展开更多
Unstructured and irregular graph data causes strong randomness and poor locality of data accesses in graph processing.This paper optimizes the depth-branch-resorting algorithm(DBR),and proposes a branch-alternation-re...Unstructured and irregular graph data causes strong randomness and poor locality of data accesses in graph processing.This paper optimizes the depth-branch-resorting algorithm(DBR),and proposes a branch-alternation-resorting algorithm(BAR).In order to make the algorithm run in parallel and improve the efficiency of algorithm operation,the BAR algorithm is mapped onto the reconfigurable array processor(APR-16)to achieve vertex reordering,effectively improving the locality of graph data.This paper validates the BAR algorithm on the GraphBIG framework,by utilizing the reordered dataset with BAR on breadth-first search(BFS),single source shortest paht(SSSP)and betweenness centrality(BC)algorithms for traversal.The results show that compared with DBR and Corder algorithms,BAR can reduce execution time by up to 33.00%,and 51.00%seperatively.In terms of data movement,the BAR algorithm has a maximum reduction of 39.00%compared with the DBR algorithm and 29.66%compared with Corder algorithm.In terms of computational complexity,the BAR algorithm has a maximum reduction of 32.56%compared with DBR algorithm and53.05%compared with Corder algorithm.展开更多
After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To re...After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this paper.Based on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference blocks.Through the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original algorithm.To avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is proposed.The parallelization design reduces the storage access time,configuration time and saves the storage cost.Verified with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per second.Compared with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach 3.4539.On average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446.展开更多
A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly...A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly and form a ring structure. All processing cells are identical and programmable. Each processing cell has the peak performance of 20 million floating-point operations per second (20MFLOPS). The machine therefore has a peak performance of 320 M FLOPS. It is integrated as an attached processor into a host system through VME bus interface. Programs for FXCQ are written in a high-level language -B language, which is supported by a parallel optimizing compiler. This paper describes the architecture of FXCQ, B language and its compiler.展开更多
Many reconfiguration schemes for fault-tolerant binary tree architectures have been proposed in the lite..t.re[1-6]. The VLSI layouts of most previous studies are based on the classical H-tree layout, resulting in low...Many reconfiguration schemes for fault-tolerant binary tree architectures have been proposed in the lite..t.re[1-6]. The VLSI layouts of most previous studies are based on the classical H-tree layout, resulting in low area utilization and likely an unnecessarily high manufacturing cost simply due to the waste of a significaot portion of silicon area. In this paper, we present an area-efficient approach to the reconfigurable binary tree architecture. Area utilization and interconnection complexity of our design compare favorably with the other known approaches. In the reliability analysis, we take ioto arcount the faCt that accepted chips (after fabrication) are with dmereot degrees of redundancy initially, so as to obtain results which better reflect real situations.展开更多
The statistical performance of AR high resolution array processor in presence of correlated sensor signal fluctuation is studied. Mean square inverse beam pattern and pointing error are examined. Special attention is ...The statistical performance of AR high resolution array processor in presence of correlated sensor signal fluctuation is studied. Mean square inverse beam pattern and pointing error are examined. Special attention is paid to the effects of reference sensor and correlation between sensors. It is shown that fluctuation causes broadening or even distortion of the mean square inverse beam pattern. Phase fluctuation causes pointing error. Its standard variance is proportional to that of fluctuation and is related to the number of sensors of the array. Correlation between sensors has important effects on pointing error.展开更多
The evolution of chip architecture is discussed in this paper. Then MPP SoC architectures according to three kinds of computing paradigms are analyzed. Based on these discussions and analyses, array processor architec...The evolution of chip architecture is discussed in this paper. Then MPP SoC architectures according to three kinds of computing paradigms are analyzed. Based on these discussions and analyses, array processor architecture for unified change is presented, which could implement the simplification, effectiveness and versatility of both data level and non-data level parallel algorithm's programming.展开更多
In order to take into account the computing efficiency and flexibility of calculating transcendental functions, this paper proposes one kind of reconfigurable transcendental function generator. The generator is of a r...In order to take into account the computing efficiency and flexibility of calculating transcendental functions, this paper proposes one kind of reconfigurable transcendental function generator. The generator is of a reconfigurable array structure composed of 30 processing elements (PEs). The coordinate rotational digital computer (CORDIC) algorithm is implemented on this structure. Different functions, such as sine, cosine, inverse tangent, logarithmic, etc., can be calculated based on the structure by reconfiguring the functions of PEs. The functional simulation and field programmable gate array (FPGA) verification show that the proposed method obtains great flexibility with acceptable performance.展开更多
High efficiency video coding (HEVC) transform algorithm for residual coding uses 2-dimensional (2D) 4 × 4 transforms with higher precision than H.264's 4 ×4 transforms, resulting in increased hardware c...High efficiency video coding (HEVC) transform algorithm for residual coding uses 2-dimensional (2D) 4 × 4 transforms with higher precision than H.264's 4 ×4 transforms, resulting in increased hardware complexity. In this paper, we present a shared architecture that can compute the 4 ~4 forward discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) of HEVC using a new mapping scheme in the video processor array structure. The architecture is implemented with only adders and shills to an area-efficient design. The proposed architecture is synthesized using ISE 14.7 and implemented using the BEE4 platform with the Virtex-6 FF1759 LX550T field programmable gate array (FPGA). The result shows that the video processor array structure achieves a maximum operation frequency of 165.2 MHz. The architecture and its implementation are presented in this paper to demonstrate its programmable and high performance.展开更多
基金the National Key R&D Program of China(No.2022ZD0119001)the National Natural Science Foundation of China(No.61834005)+3 种基金the Shaanxi Province Key R&D Plan(No.2022GY-027)the Key Scientific Research Project of Shaanxi Department of Education(No.22JY060)the Education Research Project of Xi'an University of Posts and Telecommunications(No.JGA202108)the Graduate Student Innovation Fund of Xi’an University of Posts and Telecommunications(No.CXJJYL2022035).
文摘In the case of massive data,matrix operations are very computationally intensive,and the memory limitation in standalone mode leads to the system inefficiencies.At the same time,it is difficult for matrix operations to achieve flexible switching between different requirements when implemented in hardware.To address this problem,this paper proposes a matrix operation accelerator based on reconfigurable arrays in the context of the application of recommender systems(RS).Based on the reconfigurable array processor(APR-16)with reconfiguration,a parallelized design of matrix operations on processing element(PE)array is realized with flexibility.The experimental results show that,compared with the proposed central processing unit(CPU)and graphics processing unit(GPU)hybrid implementation matrix multiplication framework,the energy efficiency ratio of the accelerator proposed in this paper is improved by about 35×.Compared with blocked alternating least squares(BALS),its the energy efficiency ratio has been accelerated by about 1×,and the switching of matrix factorization(MF)schemes suitable for different sparsity can be realized.
基金the National Natural Science Foundation of China(No.61802304,61834005,61772417,61634004,61602377)Shaanxi Provincial Co-ordination Innovation Project of Science and Technology(No.2016KTZDGY02-04-02)+1 种基金Shaanxi Provincial Key R&D Plan(No.2017GY-060)Shaanxi International Science and Technology Cooperation Program(No.2018KW-006).
文摘Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different structures for heterogeneous architecture.The long data transfer delay becomes the mainly problem to limit the processing speed for computer vision applications.For reducing data transfer delay and fasting computer vision applications,a clustered data-driven array processor is proposed.A three-level pipelining processing element is designed which supports two-buffer data flow interface and 8 bits,16 bits,32 bits subtext parallel computation.At the same time,for accelerating transcendental function computation,a four-way shared pipelining transcendental function accelerator is designed,which is based on Y-intercept adjusted piecewise linear segment algorithm.A distributed shared memory structure based on unified addressing is also employed.To verify efficiency of architecture,some image processing algorithms are implemented on proposed architecture.Simultaneously the proposed architecture has been implemented on Xilinx ZC 706 development board.The same circuitry has been synthesized using SMIC 130 nm CMOS technology.The circuitry is able to run at 100 MHz.Area is 26.58 mm2.
基金Supported by the National Natural Science Foundation of China(No.61802304,61834005,61772417,61602377)the Shaanxi Province KeyR&D Plan(No.2021GY-029)。
文摘Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.
基金Supported by the National Natural Science Foundation of China(61272120,61634004,61602377)the Shaanxi Provincial Co-ordination Innovation Project of Science and Technology(2016KTZDGY02-04-02)Scientific Research Program Funded by Shannxi Provincial Education Department(17JK0689)
文摘Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of data access delay among these three structures in various cases. Finally these structures are realized on Xilinx FPGA development board and DCT,FFT,SAD,IME,FME,and de-blocking filtering algorithms are mapped onto the structures. Compared with available architectures,our proposed structures have lower data access delay and lower area.
基金the National Key R&D Program of China(No.2022ZD0119001)the National Natural Science Foundation of China(No.61834005)+3 种基金the Shaanxi Province Key R&D Plan(No.2022GY-027)the Key Scientific Research Project of Shaanxi Department of Education(No.22JY060)the Education Research Project of XUPT(No.JGA202108)the Graduate Student Innovation Fund of Xi'an University of Posts and Telecommunications(No.CXJJZL2022011)。
文摘Unstructured and irregular graph data causes strong randomness and poor locality of data accesses in graph processing.This paper optimizes the depth-branch-resorting algorithm(DBR),and proposes a branch-alternation-resorting algorithm(BAR).In order to make the algorithm run in parallel and improve the efficiency of algorithm operation,the BAR algorithm is mapped onto the reconfigurable array processor(APR-16)to achieve vertex reordering,effectively improving the locality of graph data.This paper validates the BAR algorithm on the GraphBIG framework,by utilizing the reordered dataset with BAR on breadth-first search(BFS),single source shortest paht(SSSP)and betweenness centrality(BC)algorithms for traversal.The results show that compared with DBR and Corder algorithms,BAR can reduce execution time by up to 33.00%,and 51.00%seperatively.In terms of data movement,the BAR algorithm has a maximum reduction of 39.00%compared with the DBR algorithm and 29.66%compared with Corder algorithm.In terms of computational complexity,the BAR algorithm has a maximum reduction of 32.56%compared with DBR algorithm and53.05%compared with Corder algorithm.
基金Supported by the National Natural Science Foundation of China(No.61834005,61772417,61802304,61602377,61874087,61634004)the Shaanxi Province Key R&D Plan(No.2020JM-525,2021GY-029,2021KW-16)。
文摘After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this paper.Based on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference blocks.Through the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original algorithm.To avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is proposed.The parallelization design reduces the storage access time,configuration time and saves the storage cost.Verified with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per second.Compared with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach 3.4539.On average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446.
文摘A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly and form a ring structure. All processing cells are identical and programmable. Each processing cell has the peak performance of 20 million floating-point operations per second (20MFLOPS). The machine therefore has a peak performance of 320 M FLOPS. It is integrated as an attached processor into a host system through VME bus interface. Programs for FXCQ are written in a high-level language -B language, which is supported by a parallel optimizing compiler. This paper describes the architecture of FXCQ, B language and its compiler.
文摘Many reconfiguration schemes for fault-tolerant binary tree architectures have been proposed in the lite..t.re[1-6]. The VLSI layouts of most previous studies are based on the classical H-tree layout, resulting in low area utilization and likely an unnecessarily high manufacturing cost simply due to the waste of a significaot portion of silicon area. In this paper, we present an area-efficient approach to the reconfigurable binary tree architecture. Area utilization and interconnection complexity of our design compare favorably with the other known approaches. In the reliability analysis, we take ioto arcount the faCt that accepted chips (after fabrication) are with dmereot degrees of redundancy initially, so as to obtain results which better reflect real situations.
文摘The statistical performance of AR high resolution array processor in presence of correlated sensor signal fluctuation is studied. Mean square inverse beam pattern and pointing error are examined. Special attention is paid to the effects of reference sensor and correlation between sensors. It is shown that fluctuation causes broadening or even distortion of the mean square inverse beam pattern. Phase fluctuation causes pointing error. Its standard variance is proportional to that of fluctuation and is related to the number of sensors of the array. Correlation between sensors has important effects on pointing error.
文摘The evolution of chip architecture is discussed in this paper. Then MPP SoC architectures according to three kinds of computing paradigms are analyzed. Based on these discussions and analyses, array processor architecture for unified change is presented, which could implement the simplification, effectiveness and versatility of both data level and non-data level parallel algorithm's programming.
基金supported by the National Natural Science Foundation of China(61272120,61602377,61634004)the Natural Science Foundation of Shaanxi Province of China(2015JM6326)+1 种基金Shaanxi Provincial Co-ordination Innovation Project of Science and Technology(2016KTZDGY02-04-02)the Project of Education Department of Shaanxi Provincial Government(15JK1683)
文摘In order to take into account the computing efficiency and flexibility of calculating transcendental functions, this paper proposes one kind of reconfigurable transcendental function generator. The generator is of a reconfigurable array structure composed of 30 processing elements (PEs). The coordinate rotational digital computer (CORDIC) algorithm is implemented on this structure. Different functions, such as sine, cosine, inverse tangent, logarithmic, etc., can be calculated based on the structure by reconfiguring the functions of PEs. The functional simulation and field programmable gate array (FPGA) verification show that the proposed method obtains great flexibility with acceptable performance.
基金supported by the National Natural Science Foundation of China (61272120,61602377,61634004)the Shaanxi Provincial Co-Ordination Innovation Project of Science and Technology (2016KTZDGY02-04-02)the National Science and Technology Major Project of China (2016ZX03001003-006)
文摘High efficiency video coding (HEVC) transform algorithm for residual coding uses 2-dimensional (2D) 4 × 4 transforms with higher precision than H.264's 4 ×4 transforms, resulting in increased hardware complexity. In this paper, we present a shared architecture that can compute the 4 ~4 forward discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) of HEVC using a new mapping scheme in the video processor array structure. The architecture is implemented with only adders and shills to an area-efficient design. The proposed architecture is synthesized using ISE 14.7 and implemented using the BEE4 platform with the Virtex-6 FF1759 LX550T field programmable gate array (FPGA). The result shows that the video processor array structure achieves a maximum operation frequency of 165.2 MHz. The architecture and its implementation are presented in this paper to demonstrate its programmable and high performance.