Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and compu...Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.展开更多
As the number of cores in a multicore system increases,the communication pressure on the interconnection network also increases.The network-on-chip(NoC)architecture is expected to take on the ever-expanding communicat...As the number of cores in a multicore system increases,the communication pressure on the interconnection network also increases.The network-on-chip(NoC)architecture is expected to take on the ever-expanding communication demands triggered by the ever-increasing number of cores.The communication behavior of the NoC architecture exhibits significant spatial–temporal variation,posing a considerable challenge for NoC reconfiguration.In this paper,we propose a traffic-oriented reconfigurable NoC with augmented inter-port buffer sharing to adapt to the varying traffic flows with a high flexibility.First,a modified input port is introduced to support buffer sharing between adjacent ports.Specifically,the modified input port can be dynamically reconfigured to react to on-demand traffic.Second,it is ascertained that a centralized output-oriented buffer management works well with the reconfigurable input ports.Finally,this reconfiguration method can be implemented with a low overhead hardware design without imposing a great burden on the system implementation.The experimental results show that compared to other proposals,the proposed NoC architecture can greatly reduce the packet latency and improve the saturation throughput,without incurring significant area and power overhead.展开更多
In this paper, we analyze the queueing behaviour of wavelength division multiplexing (WDM) Internet router employing partial buffer sharing (PBS) mechanism with self-similar traffic input. In view of WDM technology in...In this paper, we analyze the queueing behaviour of wavelength division multiplexing (WDM) Internet router employing partial buffer sharing (PBS) mechanism with self-similar traffic input. In view of WDM technology in networking, each output port of the router is modelled as multi-server queueing system. To guarantee the quality of service (QoS) in Broadband integrated services digital network (B-ISDN), PBS mechanism is a promising one. As Markov modulated Poisson process (MMPP) emulates self-similar Internet traffic, we can use MMPP as input process of queueing system to investigate queueing behaviour of the router. In general, as network traffic is asynchronous (unslotted) and of variable packet lengths, service times (packet lengths) are assumed to follow Erlang-k distribution. Since, the said distribution is relatively general compared to deterministic and exponential. Hence, specific output port of the router is modelled as MMPP/Ek/s/C queueing system. The long-term performance measures namely high priority and low priority packet loss probabilities and the short-term performance measures namely mean lengths of critical and non-critical periods against the system parameters and traffic parameters are computed by means of matrix-geometric methods and approximate Markovian model. This kind of analysis is useful in dimensioning the router under self-similar traffic input employing PBS mechanism to provide differentiated services (DiffServ) and QoS guarantee.展开更多
This paper proposes a Shared Buffer Memory ATM Access Switch . The switches have significant benefits over Crossbar or Bus Based switches because its output buffer memories are shared by all the switch output ports an...This paper proposes a Shared Buffer Memory ATM Access Switch . The switches have significant benefits over Crossbar or Bus Based switches because its output buffer memories are shared by all the switch output ports and are allotted to one particular output port as the occasion demands. As Buffer allocation schemes in the ATM Switches is Partial Sharing, it is trade-off between Complete Sharing and Dedicated Allocation. In addition, the queuing structures used in the shared memory are independent of both the data path through the switch and the cell scheduling mechanism. The method for queue management is simple and effective.展开更多
OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several ac...OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several accelerators.However,to make full use of the computing capacity of such a system,programmers are requested to manage diverse OpenCL-enabled devices explicitly,including distributing the workload between different devices and managing data transfer between multiple devices.All these tedious jobs pose a huge challenge for programmers.In this paper,a distributed shared OpenCL memory(DSOM) is presented,which relieves users of having to manage data transfer explicitly,by supporting shared buffers across devices.DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer.To support fine-grained shared buffer management,we designed a kernel parser in DSOM for buffer access range analysis.A basic modified,shared,invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers.In addition,we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible.This strategy enables overlap of data transfer with kernel execution.Our experimental results show that the applicability of our method for buffer access range analysis is good,and the efficiency of DSOM is high.展开更多
基金Supported by the National Natural Science Foundation of China(No.61802304,61834005,61772417,61602377)the Shaanxi Province KeyR&D Plan(No.2021GY-029)。
文摘Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.
基金Project supported by the National Natural Science Foundation of China(No.62374049)。
文摘As the number of cores in a multicore system increases,the communication pressure on the interconnection network also increases.The network-on-chip(NoC)architecture is expected to take on the ever-expanding communication demands triggered by the ever-increasing number of cores.The communication behavior of the NoC architecture exhibits significant spatial–temporal variation,posing a considerable challenge for NoC reconfiguration.In this paper,we propose a traffic-oriented reconfigurable NoC with augmented inter-port buffer sharing to adapt to the varying traffic flows with a high flexibility.First,a modified input port is introduced to support buffer sharing between adjacent ports.Specifically,the modified input port can be dynamically reconfigured to react to on-demand traffic.Second,it is ascertained that a centralized output-oriented buffer management works well with the reconfigurable input ports.Finally,this reconfiguration method can be implemented with a low overhead hardware design without imposing a great burden on the system implementation.The experimental results show that compared to other proposals,the proposed NoC architecture can greatly reduce the packet latency and improve the saturation throughput,without incurring significant area and power overhead.
文摘In this paper, we analyze the queueing behaviour of wavelength division multiplexing (WDM) Internet router employing partial buffer sharing (PBS) mechanism with self-similar traffic input. In view of WDM technology in networking, each output port of the router is modelled as multi-server queueing system. To guarantee the quality of service (QoS) in Broadband integrated services digital network (B-ISDN), PBS mechanism is a promising one. As Markov modulated Poisson process (MMPP) emulates self-similar Internet traffic, we can use MMPP as input process of queueing system to investigate queueing behaviour of the router. In general, as network traffic is asynchronous (unslotted) and of variable packet lengths, service times (packet lengths) are assumed to follow Erlang-k distribution. Since, the said distribution is relatively general compared to deterministic and exponential. Hence, specific output port of the router is modelled as MMPP/Ek/s/C queueing system. The long-term performance measures namely high priority and low priority packet loss probabilities and the short-term performance measures namely mean lengths of critical and non-critical periods against the system parameters and traffic parameters are computed by means of matrix-geometric methods and approximate Markovian model. This kind of analysis is useful in dimensioning the router under self-similar traffic input employing PBS mechanism to provide differentiated services (DiffServ) and QoS guarantee.
文摘This paper proposes a Shared Buffer Memory ATM Access Switch . The switches have significant benefits over Crossbar or Bus Based switches because its output buffer memories are shared by all the switch output ports and are allotted to one particular output port as the occasion demands. As Buffer allocation schemes in the ATM Switches is Partial Sharing, it is trade-off between Complete Sharing and Dedicated Allocation. In addition, the queuing structures used in the shared memory are independent of both the data path through the switch and the cell scheduling mechanism. The method for queue management is simple and effective.
基金Project supported by the National Natural Science Foundation of China(Nos.61033008,61272145,60903041,and 61103080)the Research Fund for the Doctoral Program of Higher Education of China(No.20104307110002)+1 种基金the Hunan Provincial Innovation Foundation for Postgraduate(No.CX2010B028)the Fund of Innovation in Graduate School of NUDT(Nos.B100603 and B120605),China
文摘OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several accelerators.However,to make full use of the computing capacity of such a system,programmers are requested to manage diverse OpenCL-enabled devices explicitly,including distributing the workload between different devices and managing data transfer between multiple devices.All these tedious jobs pose a huge challenge for programmers.In this paper,a distributed shared OpenCL memory(DSOM) is presented,which relieves users of having to manage data transfer explicitly,by supporting shared buffers across devices.DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer.To support fine-grained shared buffer management,we designed a kernel parser in DSOM for buffer access range analysis.A basic modified,shared,invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers.In addition,we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible.This strategy enables overlap of data transfer with kernel execution.Our experimental results show that the applicability of our method for buffer access range analysis is good,and the efficiency of DSOM is high.