期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
xCCL:A Survey of Industry-Led Collective Communication Libraries for Deep Learning
1
作者 Adam Weingram 李雨珂 +3 位作者 戚昊 Darren Ng 代柳瑶 鲁小亿 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第1期166-195,共30页
Machine learning techniques have become ubiquitous both in industry and academic applications.Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches.Collective ... Machine learning techniques have become ubiquitous both in industry and academic applications.Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches.Collective communications greatly simplify inter-and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes.In this paper,we survey the current state-of-the-art collective communication libraries(namely xCCL,including NCCL,oneCCL,RCCL,MSCCL,ACCL,and Gloo),with a focus on the industry-led ones for deep learning workloads.We investigate the design features of these xCCLs,discuss their use cases in the industry deep learning workloads,compare their performance with industry-made benchmarks(i.e.,NCCL Tests and PARAM),and discuss key take-aways and interesting observations.We believe our survey sheds light on potential research directions of future designs for xCCLs. 展开更多
关键词 COLLECTIVE deep learning distributed training GPUDirect RDMA(remote direct memory access)
原文传递
Analyzing and Optimizing Packet Corruption in RDMA Network
2
作者 高翼枭 田臣 +10 位作者 陈伟 李多星 闫健 龚媛媛 王炳权 吴涛 韩磊 齐法制 曾珊 窦万春 陈贵海 《Journal of Computer Science & Technology》 SCIE EI CSCD 2022年第4期743-762,共20页
Remote direct memory access (RDMA) has become one of the state-of-the-art high-performance network technologies in datacenters. The reliable transport of RDMA is designed based on a lossless underlying network and can... Remote direct memory access (RDMA) has become one of the state-of-the-art high-performance network technologies in datacenters. The reliable transport of RDMA is designed based on a lossless underlying network and cannot endure a high packet loss rate. However, except for switch buffer overflow, there is another kind of packet loss in the RDMA network, i.e., packet corruption, which has not been discussed in depth. The packet corruption incurs long application tail latency by causing timeout retransmissions. The challenges to solving packet corruption in the RDMA network include: 1) packet corruption is inevitable with any remedial mechanisms and 2) RDMA hardware is not programmable. This paper proposes some designs which can guarantee the expected tail latency of applications with the existence of packet corruption. The key idea is controlling the occurring probabilities of timeout events caused by packet corruption through transforming timeout retransmissions into out-of-order retransmissions. We build a probabilistic model to estimate the occurrence probabilities and real effects of the corruption patterns. We implement these two mechanisms with the help of programmable switches and the zero-byte message RDMA feature. We build an ns-3 simulation and implement optimization mechanisms on our testbed. The simulation and testbed experiments show that the optimizations can decrease the flow completion time by several orders of magnitudes with less than 3% bandwidth cost at different packet corruption rates. 展开更多
关键词 datacenter network packet corruption programmable switch remote direct memory access(RDMA)
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部