A novel PCI Express (peripheral component interconnection express) direct memory access (DMA) transaction method using bridge chip PEX 8311 is proposed. Furthermore, a new method on optimizing PC1 Express DMA tran...A novel PCI Express (peripheral component interconnection express) direct memory access (DMA) transaction method using bridge chip PEX 8311 is proposed. Furthermore, a new method on optimizing PC1 Express DMA transaction through improving both bus-efficiency and DMA-effieiency is presented. A finite state machine (FSM) responding for data and address cycles on PCI Express bus is introduced, and a continuous data burst is realized, which greatly promote bus-efficiency. In software design, a driver framework based on Windows driver model (WDM) and three DMA optimizing options for the proposed PCI Express interface are presented to improve DMA-efficiency. Experiments show that both read and write hardware transaction speed in this paper exceed PCI theoretical maximum speed (133 MBytes/s).展开更多
The processing speed of the communication between nodes in a parallel processor has become the major bottleneck of the processor's performance.RDMA(Remote Direct Memory Access) technology has drawn more attention ...The processing speed of the communication between nodes in a parallel processor has become the major bottleneck of the processor's performance.RDMA(Remote Direct Memory Access) technology has drawn more attention recently due to its capability of transferring a larger amount of data, higher speed and reliability.4DSP(4 Digital Signal Processing) module comprised of Tiger-SHARC201 chip is connected by LVDS(Low Voltage Differential Signal) circuits.This paper proposes a general and reconfigurable RDMA platform and its corresponding communication protocol with all the routes linked based on the zero copy.The protocol transfers message of DSP by interrupting of DMA and is applied on massive remote image impression, which reduces memory needs and working burden of CPU.The experiment results show this platform is efficient, flexible, and expandable of being integrated to a larger scale in the next development stages.展开更多
High speed data communication between digital signal processor and the host is required to meet the demand of most real-time systems. PCI bus technology is a solution of this problem. The principle of data communicati...High speed data communication between digital signal processor and the host is required to meet the demand of most real-time systems. PCI bus technology is a solution of this problem. The principle of data communication based on PCI has been explained. Meanwhile, the technology of data transfer between synchronous dynamic RAM(SDRAM) and an mapping space of on-chip memory(L2) by expansion direct memory access(EDMA) has also been realized.展开更多
Machine learning techniques have become ubiquitous both in industry and academic applications.Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches.Collective ...Machine learning techniques have become ubiquitous both in industry and academic applications.Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches.Collective communications greatly simplify inter-and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes.In this paper,we survey the current state-of-the-art collective communication libraries(namely xCCL,including NCCL,oneCCL,RCCL,MSCCL,ACCL,and Gloo),with a focus on the industry-led ones for deep learning workloads.We investigate the design features of these xCCLs,discuss their use cases in the industry deep learning workloads,compare their performance with industry-made benchmarks(i.e.,NCCL Tests and PARAM),and discuss key take-aways and interesting observations.We believe our survey sheds light on potential research directions of future designs for xCCLs.展开更多
Remote direct memory access (RDMA) has become one of the state-of-the-art high-performance network technologies in datacenters. The reliable transport of RDMA is designed based on a lossless underlying network and can...Remote direct memory access (RDMA) has become one of the state-of-the-art high-performance network technologies in datacenters. The reliable transport of RDMA is designed based on a lossless underlying network and cannot endure a high packet loss rate. However, except for switch buffer overflow, there is another kind of packet loss in the RDMA network, i.e., packet corruption, which has not been discussed in depth. The packet corruption incurs long application tail latency by causing timeout retransmissions. The challenges to solving packet corruption in the RDMA network include: 1) packet corruption is inevitable with any remedial mechanisms and 2) RDMA hardware is not programmable. This paper proposes some designs which can guarantee the expected tail latency of applications with the existence of packet corruption. The key idea is controlling the occurring probabilities of timeout events caused by packet corruption through transforming timeout retransmissions into out-of-order retransmissions. We build a probabilistic model to estimate the occurrence probabilities and real effects of the corruption patterns. We implement these two mechanisms with the help of programmable switches and the zero-byte message RDMA feature. We build an ns-3 simulation and implement optimization mechanisms on our testbed. The simulation and testbed experiments show that the optimizations can decrease the flow completion time by several orders of magnitudes with less than 3% bandwidth cost at different packet corruption rates.展开更多
In this paper, we propose a fast and simple system emulator, called a system performance emulator(SPE), to evaluate long read operations.The SPE estimates how much system-wide performance is enhanced by using a faster...In this paper, we propose a fast and simple system emulator, called a system performance emulator(SPE), to evaluate long read operations.The SPE estimates how much system-wide performance is enhanced by using a faster solid state disk(SSD).By suspending a CPU for a certain time during direct memory access(DMA) transfer and subtracting this suspended time from the total DMA time, the SPE estimates the improvement in system performance expected from an enhanced SSD prior to its manufacture.We also examine the relation between storage performance and system performance using the SPE.展开更多
文摘A novel PCI Express (peripheral component interconnection express) direct memory access (DMA) transaction method using bridge chip PEX 8311 is proposed. Furthermore, a new method on optimizing PC1 Express DMA transaction through improving both bus-efficiency and DMA-effieiency is presented. A finite state machine (FSM) responding for data and address cycles on PCI Express bus is introduced, and a continuous data burst is realized, which greatly promote bus-efficiency. In software design, a driver framework based on Windows driver model (WDM) and three DMA optimizing options for the proposed PCI Express interface are presented to improve DMA-efficiency. Experiments show that both read and write hardware transaction speed in this paper exceed PCI theoretical maximum speed (133 MBytes/s).
基金Supported by the NSFC (National Natural Science Foundation of China)the 863 Program (2006AA1332)ERIPKU, the Program for New Century Excellent Talents in University.
文摘The processing speed of the communication between nodes in a parallel processor has become the major bottleneck of the processor's performance.RDMA(Remote Direct Memory Access) technology has drawn more attention recently due to its capability of transferring a larger amount of data, higher speed and reliability.4DSP(4 Digital Signal Processing) module comprised of Tiger-SHARC201 chip is connected by LVDS(Low Voltage Differential Signal) circuits.This paper proposes a general and reconfigurable RDMA platform and its corresponding communication protocol with all the routes linked based on the zero copy.The protocol transfers message of DSP by interrupting of DMA and is applied on massive remote image impression, which reduces memory needs and working burden of CPU.The experiment results show this platform is efficient, flexible, and expandable of being integrated to a larger scale in the next development stages.
文摘High speed data communication between digital signal processor and the host is required to meet the demand of most real-time systems. PCI bus technology is a solution of this problem. The principle of data communication based on PCI has been explained. Meanwhile, the technology of data transfer between synchronous dynamic RAM(SDRAM) and an mapping space of on-chip memory(L2) by expansion direct memory access(EDMA) has also been realized.
基金supported in part by the U.S.National Science Foundation under Grant No.CCF-2132049,a Google Research Award,and a Meta Faculty Research Awardthe Expanse cluster at SDSC(San Diego Supercomputer Center)through allocation CIS210053 from the Advanced Cyberinfrastructure Coordination Ecosystem:Services&Support(ACCESS)program,which is supported by the U.S.National Science Foundation under Grant Nos.2138259,2138286,2138307,2137603,and 2138296.
文摘Machine learning techniques have become ubiquitous both in industry and academic applications.Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches.Collective communications greatly simplify inter-and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes.In this paper,we survey the current state-of-the-art collective communication libraries(namely xCCL,including NCCL,oneCCL,RCCL,MSCCL,ACCL,and Gloo),with a focus on the industry-led ones for deep learning workloads.We investigate the design features of these xCCLs,discuss their use cases in the industry deep learning workloads,compare their performance with industry-made benchmarks(i.e.,NCCL Tests and PARAM),and discuss key take-aways and interesting observations.We believe our survey sheds light on potential research directions of future designs for xCCLs.
基金This work was supported by the Key-Area Research and Development Program of Guangdong Province of China under Grant No.2020B0101390001the National Natural Science Foundation of China under Grant Nos.61772265 and 62072228the Fundamental Research Funds for the Central Universities of China,the Collaborative Innovation Center of Novel Software Technology and Industrialization of Jiangsu Province of China,and the Jiangsu Innovation and Entrepreneurship(Shuangchuang)Program of China.
文摘Remote direct memory access (RDMA) has become one of the state-of-the-art high-performance network technologies in datacenters. The reliable transport of RDMA is designed based on a lossless underlying network and cannot endure a high packet loss rate. However, except for switch buffer overflow, there is another kind of packet loss in the RDMA network, i.e., packet corruption, which has not been discussed in depth. The packet corruption incurs long application tail latency by causing timeout retransmissions. The challenges to solving packet corruption in the RDMA network include: 1) packet corruption is inevitable with any remedial mechanisms and 2) RDMA hardware is not programmable. This paper proposes some designs which can guarantee the expected tail latency of applications with the existence of packet corruption. The key idea is controlling the occurring probabilities of timeout events caused by packet corruption through transforming timeout retransmissions into out-of-order retransmissions. We build a probabilistic model to estimate the occurrence probabilities and real effects of the corruption patterns. We implement these two mechanisms with the help of programmable switches and the zero-byte message RDMA feature. We build an ns-3 simulation and implement optimization mechanisms on our testbed. The simulation and testbed experiments show that the optimizations can decrease the flow completion time by several orders of magnitudes with less than 3% bandwidth cost at different packet corruption rates.
基金Project supported by the Second Brain Korea 21 Project and Samsung Electronics
文摘In this paper, we propose a fast and simple system emulator, called a system performance emulator(SPE), to evaluate long read operations.The SPE estimates how much system-wide performance is enhanced by using a faster solid state disk(SSD).By suspending a CPU for a certain time during direct memory access(DMA) transfer and subtracting this suspended time from the total DMA time, the SPE estimates the improvement in system performance expected from an enhanced SSD prior to its manufacture.We also examine the relation between storage performance and system performance using the SPE.