服务器网络一体化引擎是构成云服务器网络的基础单元,UNE需要能提供多个Pcie接口连接计算单元、提供多个以太网口做互联或连接外部网络,UNE需要支持连接多个Host和Pcie设备(如GPU、SSD硬盘),UNE需要支持以太网交换功能做互联和虚拟网络...服务器网络一体化引擎是构成云服务器网络的基础单元,UNE需要能提供多个Pcie接口连接计算单元、提供多个以太网口做互联或连接外部网络,UNE需要支持连接多个Host和Pcie设备(如GPU、SSD硬盘),UNE需要支持以太网交换功能做互联和虚拟网络的硬件卸载功能,UNE需要支持RoCE或iWarp实现低延时业务,同时UNE的转发时延需要较低,UNE还需要支持NVMe over Fabric和GPUDirect技术使最新技术的存储和高性能计算更好的云化。多个UNE可以按需要的拓扑(一次环、二次环或者Mesh)直接互联形成转发阵列,转发阵列的拓扑管理和路径管理可以智能构建也可以通过SDN Controller控制,转发阵列可以依需提供外联带宽。总体来说,局部的服务器网络使用UNE上的Ethernet Switch互联,再大的网络通过数据中心的以太网交换机做互联。展开更多
Machine learning techniques have become ubiquitous both in industry and academic applications.Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches.Collective ...Machine learning techniques have become ubiquitous both in industry and academic applications.Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches.Collective communications greatly simplify inter-and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes.In this paper,we survey the current state-of-the-art collective communication libraries(namely xCCL,including NCCL,oneCCL,RCCL,MSCCL,ACCL,and Gloo),with a focus on the industry-led ones for deep learning workloads.We investigate the design features of these xCCLs,discuss their use cases in the industry deep learning workloads,compare their performance with industry-made benchmarks(i.e.,NCCL Tests and PARAM),and discuss key take-aways and interesting observations.We believe our survey sheds light on potential research directions of future designs for xCCLs.展开更多
文摘服务器网络一体化引擎是构成云服务器网络的基础单元,UNE需要能提供多个Pcie接口连接计算单元、提供多个以太网口做互联或连接外部网络,UNE需要支持连接多个Host和Pcie设备(如GPU、SSD硬盘),UNE需要支持以太网交换功能做互联和虚拟网络的硬件卸载功能,UNE需要支持RoCE或iWarp实现低延时业务,同时UNE的转发时延需要较低,UNE还需要支持NVMe over Fabric和GPUDirect技术使最新技术的存储和高性能计算更好的云化。多个UNE可以按需要的拓扑(一次环、二次环或者Mesh)直接互联形成转发阵列,转发阵列的拓扑管理和路径管理可以智能构建也可以通过SDN Controller控制,转发阵列可以依需提供外联带宽。总体来说,局部的服务器网络使用UNE上的Ethernet Switch互联,再大的网络通过数据中心的以太网交换机做互联。
基金supported in part by the U.S.National Science Foundation under Grant No.CCF-2132049,a Google Research Award,and a Meta Faculty Research Awardthe Expanse cluster at SDSC(San Diego Supercomputer Center)through allocation CIS210053 from the Advanced Cyberinfrastructure Coordination Ecosystem:Services&Support(ACCESS)program,which is supported by the U.S.National Science Foundation under Grant Nos.2138259,2138286,2138307,2137603,and 2138296.
文摘Machine learning techniques have become ubiquitous both in industry and academic applications.Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches.Collective communications greatly simplify inter-and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes.In this paper,we survey the current state-of-the-art collective communication libraries(namely xCCL,including NCCL,oneCCL,RCCL,MSCCL,ACCL,and Gloo),with a focus on the industry-led ones for deep learning workloads.We investigate the design features of these xCCLs,discuss their use cases in the industry deep learning workloads,compare their performance with industry-made benchmarks(i.e.,NCCL Tests and PARAM),and discuss key take-aways and interesting observations.We believe our survey sheds light on potential research directions of future designs for xCCLs.