期刊文献+
共找到9篇文章
< 1 >
每页显示 20 50 100
Design and implementation of dual-mode configurable memory architecture for CNN accelerator
1
作者 山蕊 LI Xiaoshuo +1 位作者 GAO Xu HUO Ziqing 《High Technology Letters》 EI CAS 2024年第2期211-220,共10页
With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth ... With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth under the traditional von Neumann architecture is getting worse.Analyzing the algorithmic characteristics of convolutional neural network(CNN),it is found that the access characteristics of convolution(CONV)and fully connected(FC)operations are very different.Based on this feature,a dual-mode reronfigurable distributed memory architecture for CNN accelerator is designed.It can be configured in Bank mode or first input first output(FIFO)mode to accommodate the access needs of different operations.At the same time,a programmable memory control unit is designed,which can effectively control the dual-mode configurable distributed memory architecture by using customized special accessing instructions and reduce the data accessing delay.The proposed architecture is verified and tested by parallel implementation of some CNN algorithms.The experimental results show that the peak bandwidth can reach 13.44 GB·s^(-1)at an operating frequency of 120 MHz.This work can achieve 1.40,1.12,2.80 and 4.70 times the peak bandwidth compared with the existing work. 展开更多
关键词 distributed memory structure neural network accelerator reconfigurable arrayprocessor configurable memory structure
下载PDF
Design and Implementation of Memory Access Fast Switching Structure in Cluster-Based Reconfigurable Array Processor
2
作者 Rui Shan Lin Jiang +2 位作者 Junyong Deng Xueting Li Xubang Shen 《Journal of Beijing Institute of Technology》 EI CAS 2017年第4期494-504,共11页
Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of d... Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of data access delay among these three structures in various cases. Finally these structures are realized on Xilinx FPGA development board and DCT,FFT,SAD,IME,FME,and de-blocking filtering algorithms are mapped onto the structures. Compared with available architectures,our proposed structures have lower data access delay and lower area. 展开更多
关键词 array processor distributed memory memory access switching structure
下载PDF
Parallel Computation of Fourier Transform on Distributed Memory Computer System
3
作者 Yihui Yan Qingfeng Hu Xinfang He 《Wuhan University Journal of Natural Sciences》 CAS 1996年第Z1期557-560,共4页
Multicomputer systems(distributed memory computer systems) are becoming more and more popular and will be wildly used in scientific researches. In this paper, we present a parallel algorithm of Fourier Transform of a ... Multicomputer systems(distributed memory computer systems) are becoming more and more popular and will be wildly used in scientific researches. In this paper, we present a parallel algorithm of Fourier Transform of a vector of complex numbers on multicomputer system and give its computing times and its speedup in parallel environment supported by EXPRESS system on the multicomputer system which consists of four SGI workstations. Our analysis shows that the results is ideal and this scheme is suitable to multicomputer systems. 展开更多
关键词 Fourier Transform Distributed memory Computer System Parallel Computing
下载PDF
Design of a clustered data-driven array processor for computer vision 被引量:2
4
作者 山蕊 Deng Junyong +3 位作者 Jiang Lin Zhu Yun Wu Haoyue He Feilong 《High Technology Letters》 EI CAS 2020年第4期424-434,共11页
Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different st... Computer vision(CV)is widely expected to be the next big thing in emerging applications.So many heterogeneous architectures for computer vision emerge.However,plenty of data need to be transferred between different structures for heterogeneous architecture.The long data transfer delay becomes the mainly problem to limit the processing speed for computer vision applications.For reducing data transfer delay and fasting computer vision applications,a clustered data-driven array processor is proposed.A three-level pipelining processing element is designed which supports two-buffer data flow interface and 8 bits,16 bits,32 bits subtext parallel computation.At the same time,for accelerating transcendental function computation,a four-way shared pipelining transcendental function accelerator is designed,which is based on Y-intercept adjusted piecewise linear segment algorithm.A distributed shared memory structure based on unified addressing is also employed.To verify efficiency of architecture,some image processing algorithms are implemented on proposed architecture.Simultaneously the proposed architecture has been implemented on Xilinx ZC 706 development board.The same circuitry has been synthesized using SMIC 130 nm CMOS technology.The circuitry is able to run at 100 MHz.Area is 26.58 mm2. 展开更多
关键词 array processor DATA-DRIVEN adjacent interconnection distributed memory computer vision(CV)
下载PDF
Optimized Parallel Execution of Declarative Programs on Distributed Memory Multiprocessors
5
作者 沈美明 田新民 +2 位作者 王鼎兴 郑纬民 温冬婵 《Journal of Computer Science & Technology》 SCIE EI CSCD 1993年第3期233-242,共10页
In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel G... In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstract machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed. The performance statistics on a transputer array demonstrate the effectiveness of our model,parallel ab- stract machine,optimizing compilation strategies and compiler. 展开更多
关键词 Declarative language parallel graph rewriting execution model optimized parallel compiler distributed memory multiprocessors parallel abstract machine
原文传递
Resource abstraction and data placement for distributed hybrid memory pool
6
作者 Tingting CHEN Haikun LIU +1 位作者 Xiaofei LIAO Hai JIN 《Frontiers of Computer Science》 SCIE EI CSCD 2021年第3期47-57,共11页
Emerging byte-addressable non-volatile memory(NVM)technologies offer higher density and lower cost than DRAM,at the expense of lower performance and limited write endurance.There have been many studies on hybrid NVM/D... Emerging byte-addressable non-volatile memory(NVM)technologies offer higher density and lower cost than DRAM,at the expense of lower performance and limited write endurance.There have been many studies on hybrid NVM/DRAM memory management in a single physical server.However,it is still an open problem on how to manage hybrid memories efficiently in a distributed environment.This paper proposes Alloy,a memory resource abstraction and data placement strategy for an RDMA-enabled distributed hybrid memory pool(DHMP).Alloy provides simple APIs for applications to utilize DRAM or NVM resource in the DHMP,without being aware of the hardware details of the DHMP.We propose a hotness-aware data placement scheme,which combines hot data migration,data replication and write merging together to improve application performance and reduce the cost of DRAM.We evaluate Alloy with several micro-benchmark workloads and public benchmark workloads.Experimental results show that Alloy can significantly reduce the DRAM usage in the DHMP by up to 95%,while reducing the total memory access time by up to 57%compared with the state-of-the-art approaches. 展开更多
关键词 load balance distributed hybrid memory CLOUDS
原文传递
Efficient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers 被引量:1
7
作者 Benjamín Sahelices Agustín de Dios +2 位作者 Pablo Ibáez Víctor Vials-Yúfera José María Llabería 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第1期75-91,共17页
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order ... Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters. 展开更多
关键词 distributed shared memory multiprocessors synchronization buffer coherence controller request bypass
原文传递
Evaluation of Remote-I/O Support for a DSM-Based Computation Offloading Scheme
8
作者 Yuhun Jun Jaemin Lee Euiseong Seo 《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第5期957-973,共17页
Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading a... Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading approach is expected to be especially popular in the near future because it can dynamically migrate running threads to computing nodes and does not require any modifications of existing applications to do so. The current DSM-based computation offloading scheme, however, has focused on efficiently offloading computationally intensive applications and has not considered the significant performance degradation caused by processing the I/O requests issued by offloaded threads. Because most mobile applications are interactive and thus yield frequent I/O requests, efficient handling of I/O operations is critically important. In this paper, we quantitatively analyze the performance degradation caused by I/O processing in DSM-based computation offloading schemes using representative commodity applications. To remedy the performance degradation, we apply a remote I/O scheme based on remote device support to computation offloading. The proposed approach improves the execution time by up to 43.6% and saves up to 17.7% of energy consumption in comparison with the existing offloading schemes. Selective compression of the remote I/O scheme reduces the network traffic by up to 53.5%. 展开更多
关键词 computation offloading mobile-cloud computing distributed shared memory (DSM) mobile computing
原文传递
NONH:A New Cache-Based Coherence Protocol for Linked List Structure DSM System and Its Performance Evaluation
9
作者 房至一 鞠九滨 《Journal of Computer Science & Technology》 SCIE EI CSCD 1996年第4期405-415,共11页
The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improvi... The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improving system performance is how to manage the owner in the linked list. This paper presents the design of a new management protocol-NONH (New-OwnerNew-Head) and its performance evaluation. The analysis results show that thisprotocol can improve the scalability and performence of a coherent DSM system using linked list. It is also suitable for managing the cache coherency in tree-like hierarchical architecture. 展开更多
关键词 Linked list cache coherence distributed shared memory
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部