期刊文献+
共找到12篇文章
< 1 >
每页显示 20 50 100
Design of a clustered data-driven array processor for computer vision 被引量:2
1
作者 Shan Rui Deng Junyong +3 位作者 Jiang Lin Zhu Yun Wu Haoyue He Feilong 《High Technology Letters》 EI CAS 2020年第4期424-434,共11页
Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different st... Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different structures for heterogeneous architecture.The long data transfer delay becomes the mainly problem to limit the processing speed for computer vision applications.For reducing data transfer delay and fasting computer vision applications,a clustered data-driven array processor is proposed.A three-level pipelining processing element is designed which supports two-buffer data flow interface and 8 bits,16 bits,32 bits subtext parallel computation.At the same time,for accelerating transcendental function computation,a four-way shared pipelining transcendental function accelerator is designed,which is based on Y-intercept adjusted piecewise linear segment algorithm.A distributed shared memory structure based on unified addressing is also employed.To verify efficiency of architecture,some image processing algorithms are implemented on proposed architecture.Simultaneously the proposed architecture has been implemented on Xilinx ZC 706 development board.The same circuitry has been synthesized using SMIC 130 nm CMOS technology.The circuitry is able to run at 100 MHz.Area is 26.58 mm2. 展开更多
关键词 array processor DATA-DRIVEN adjacent interconnection distributed memory computer vision(CV)
下载PDF
Design and Implementation of Memory Access Fast Switching Structure in Cluster-Based Reconfigurable Array Processor
2
作者 Rui Shan Lin Jiang +2 位作者 Junyong Deng Xueting Li Xubang Shen 《Journal of Beijing Institute of Technology》 EI CAS 2017年第4期494-504,共11页
Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of d... Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of data access delay among these three structures in various cases. Finally these structures are realized on Xilinx FPGA development board and DCT,FFT,SAD,IME,FME,and de-blocking filtering algorithms are mapped onto the structures. Compared with available architectures,our proposed structures have lower data access delay and lower area. 展开更多
关键词 array processor distributed memory memory access switching structure
下载PDF
Research and design of matrix operation accelerator based on reconfigurable array
3
作者 邓军勇 ZHANG Pan +2 位作者 JIANG Lin XIE Xiaoyan DENG Jingwen 《High Technology Letters》 EI CAS 2024年第2期128-137,共10页
In the case of massive data,matrix operations are very computationally intensive,and the memory limitation in standalone mode leads to the system inefficiencies.At the same time,it is difficult for matrix operations t... In the case of massive data,matrix operations are very computationally intensive,and the memory limitation in standalone mode leads to the system inefficiencies.At the same time,it is difficult for matrix operations to achieve flexible switching between different requirements when implemented in hardware.To address this problem,this paper proposes a matrix operation accelerator based on reconfigurable arrays in the context of the application of recommender systems(RS).Based on the reconfigurable array processor(APR-16)with reconfiguration,a parallelized design of matrix operations on processing element(PE)array is realized with flexibility.The experimental results show that,compared with the proposed central processing unit(CPU)and graphics processing unit(GPU)hybrid implementation matrix multiplication framework,the energy efficiency ratio of the accelerator proposed in this paper is improved by about 35×.Compared with blocked alternating least squares(BALS),its the energy efficiency ratio has been accelerated by about 1×,and the switching of matrix factorization(MF)schemes suitable for different sparsity can be realized. 展开更多
关键词 matrix factorization(MF) recommender system(RS) array processor RECONFIGURABLE matrix multiplication
下载PDF
THE EFFECTS OF CORRELATED SENSOR SIGNAL FLUCTUATION ON THE STATISTICAL PERFORMANCE OF AN AR HIGH RESOLUTION ARRAY PROCESSOR
4
《Chinese Journal of Acoustics》 1989年第3期209-218,共10页
The statistical performance of AR high resolution array processor in presence of correlated sensor signal fluctuation is studied. Mean square inverse beam pattern and pointing error are examined. Special attention is ... The statistical performance of AR high resolution array processor in presence of correlated sensor signal fluctuation is studied. Mean square inverse beam pattern and pointing error are examined. Special attention is paid to the effects of reference sensor and correlation between sensors. It is shown that fluctuation causes broadening or even distortion of the mean square inverse beam pattern. Phase fluctuation causes pointing error. Its standard variance is proportional to that of fluctuation and is related to the number of sensors of the array. Correlation between sensors has important effects on pointing error. 展开更多
关键词 THE EFFECTS OF CORRELATED SENSOR SIGNAL FLUCTUATION ON THE STATISTICAL PERFORMANCE OF AN AR HIGH RESOLUTION array processor AR exp ASSP over
原文传递
BAR:a branch-alternation-resorting algorithm for locality exploration in graph processing
5
作者 邓军勇 WANG Junjie +2 位作者 JIANG Lin XIE Xiaoyan ZHOU Kai 《High Technology Letters》 EI CAS 2024年第1期31-42,共12页
Unstructured and irregular graph data causes strong randomness and poor locality of data accesses in graph processing.This paper optimizes the depth-branch-resorting algorithm(DBR),and proposes a branch-alternation-re... Unstructured and irregular graph data causes strong randomness and poor locality of data accesses in graph processing.This paper optimizes the depth-branch-resorting algorithm(DBR),and proposes a branch-alternation-resorting algorithm(BAR).In order to make the algorithm run in parallel and improve the efficiency of algorithm operation,the BAR algorithm is mapped onto the reconfigurable array processor(APR-16)to achieve vertex reordering,effectively improving the locality of graph data.This paper validates the BAR algorithm on the GraphBIG framework,by utilizing the reordered dataset with BAR on breadth-first search(BFS),single source shortest paht(SSSP)and betweenness centrality(BC)algorithms for traversal.The results show that compared with DBR and Corder algorithms,BAR can reduce execution time by up to 33.00%,and 51.00%seperatively.In terms of data movement,the BAR algorithm has a maximum reduction of 39.00%compared with the DBR algorithm and 29.66%compared with Corder algorithm.In terms of computational complexity,the BAR algorithm has a maximum reduction of 32.56%compared with DBR algorithm and53.05%compared with Corder algorithm. 展开更多
关键词 graph processing vertex reordering branch-alternation-resorting algorithm(BAR) reconfigurable array processor
下载PDF
Design and implementation of near-memory computing array architecture based on shared buffer 被引量:1
6
作者 SHAN Rui GAO Xu +3 位作者 FENG Yani HUI Chao CUI Xinyue CHAI Miaomiao 《High Technology Letters》 EI CAS 2022年第4期345-353,共9页
Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and compu... Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing. 展开更多
关键词 near-memory computing shared buffer reconfigurable array processor convolutional neural network(CNN)
下载PDF
A High Speed Signal Processing Machine -Its Architecture, Language and Compiler
7
作者 Wang Yufei and Yu ShiqiBeijing Institute of Data Processing Technology, P.O.Box 3927, Beijing 100039, China 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 1991年第1期119-128,共10页
A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly... A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly and form a ring structure. All processing cells are identical and programmable. Each processing cell has the peak performance of 20 million floating-point operations per second (20MFLOPS). The machine therefore has a peak performance of 320 M FLOPS. It is integrated as an attached processor into a host system through VME bus interface. Programs for FXCQ are written in a high-level language -B language, which is supported by a parallel optimizing compiler. This paper describes the architecture of FXCQ, B language and its compiler. 展开更多
关键词 Parallel processing Systolic array processor Parallel language Compiler.
下载PDF
A simplified hardware-friendly contour prediction algorithm in 3D-HEVC and parallelization design
8
作者 JIANG Lin DUAN Xueyao XIE Xiaoyan 《High Technology Letters》 EI CAS 2022年第4期392-400,共9页
After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To re... After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this paper.Based on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference blocks.Through the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original algorithm.To avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is proposed.The parallelization design reduces the storage access time,configuration time and saves the storage cost.Verified with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per second.Compared with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach 3.4539.On average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446. 展开更多
关键词 depth modeling mode 4(DMM-4) contour prediction 3D high efficiency video coding(3D-HEVC) PARALLELIZATION reconfigurable array processor
下载PDF
Embedding Binary Tree in VLSI/WSI Processor Array
9
作者 陈宗汉 《Journal of Computer Science & Technology》 SCIE EI CSCD 1996年第3期326-336,共11页
Many reconfiguration schemes for fault-tolerant binary tree architectures have been proposed in the lite..t.re[1-6]. The VLSI layouts of most previous studies are based on the classical H-tree layout, resulting in low... Many reconfiguration schemes for fault-tolerant binary tree architectures have been proposed in the lite..t.re[1-6]. The VLSI layouts of most previous studies are based on the classical H-tree layout, resulting in low area utilization and likely an unnecessarily high manufacturing cost simply due to the waste of a significaot portion of silicon area. In this paper, we present an area-efficient approach to the reconfigurable binary tree architecture. Area utilization and interconnection complexity of our design compare favorably with the other known approaches. In the reliability analysis, we take ioto arcount the faCt that accepted chips (after fabrication) are with dmereot degrees of redundancy initially, so as to obtain results which better reflect real situations. 展开更多
关键词 PRO PI Embedding Binary Tree in VLSI/WSI processor array
原文传递
Evolution of MPP SoC architecture techniques 被引量:7
10
作者 SHEN XuBang 《Science in China(Series F)》 2008年第6期756-764,共9页
The evolution of chip architecture is discussed in this paper. Then MPP SoC architectures according to three kinds of computing paradigms are analyzed. Based on these discussions and analyses, array processor architec... The evolution of chip architecture is discussed in this paper. Then MPP SoC architectures according to three kinds of computing paradigms are analyzed. Based on these discussions and analyses, array processor architecture for unified change is presented, which could implement the simplification, effectiveness and versatility of both data level and non-data level parallel algorithm's programming. 展开更多
关键词 MPP SOC array processor ARCHITECTURE
原文传递
Design of a reconfigurable transcendental function generator
11
作者 Jiang Lin Lü Qing +2 位作者 Xie Xiaoyan Shan Rui Deng Junyong 《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 2017年第1期96-102,共7页
In order to take into account the computing efficiency and flexibility of calculating transcendental functions, this paper proposes one kind of reconfigurable transcendental function generator. The generator is of a r... In order to take into account the computing efficiency and flexibility of calculating transcendental functions, this paper proposes one kind of reconfigurable transcendental function generator. The generator is of a reconfigurable array structure composed of 30 processing elements (PEs). The coordinate rotational digital computer (CORDIC) algorithm is implemented on this structure. Different functions, such as sine, cosine, inverse tangent, logarithmic, etc., can be calculated based on the structure by reconfiguring the functions of PEs. The functional simulation and field programmable gate array (FPGA) verification show that the proposed method obtains great flexibility with acceptable performance. 展开更多
关键词 reconfigurable computing reconfigurable transcendental function generator CORDIC array processor
原文传递
High performance architecture for unified forward and inverse transform of HEVC
12
作者 Jiang Lin Wang Xingjun +2 位作者 Wu Xin Deng Junyong Huang Xingjie 《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 2017年第3期16-23,共8页
High efficiency video coding (HEVC) transform algorithm for residual coding uses 2-dimensional (2D) 4 × 4 transforms with higher precision than H.264's 4 ×4 transforms, resulting in increased hardware c... High efficiency video coding (HEVC) transform algorithm for residual coding uses 2-dimensional (2D) 4 × 4 transforms with higher precision than H.264's 4 ×4 transforms, resulting in increased hardware complexity. In this paper, we present a shared architecture that can compute the 4 ~4 forward discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) of HEVC using a new mapping scheme in the video processor array structure. The architecture is implemented with only adders and shills to an area-efficient design. The proposed architecture is synthesized using ISE 14.7 and implemented using the BEE4 platform with the Virtex-6 FF1759 LX550T field programmable gate array (FPGA). The result shows that the video processor array structure achieves a maximum operation frequency of 165.2 MHz. The architecture and its implementation are presented in this paper to demonstrate its programmable and high performance. 展开更多
关键词 HEVC forward and inverse transform reconfigurable architecture video processor array structure
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部