摘要
随着大模型的高速发展,智算需求的增长速度远超芯片性能提升速度,计算集群方案和“DC as a Computer”概念应运而生,数据中心网络变得尤为重要。在大模型训练和推理时,集群对网络系统的稳定性要求极高。针对大模型业务特点,结合主流集群网络技术,研究了训练场景下的超大规模组网、超高吞吐和超稳定的新一代智算中心网络技术,以及推理场景下通过SDN+SRv6可编程算网一体智能调度和切片技术构建高品质的入算网络,并研究了DC间协同训练的技术难点和应对方案。
With the rapid development of large models,the growth rate of intelligent computing demand far exceeds the speed of chip performance improvement.The computing cluster scheme and the concept of“DC as a Computer”emerges as a result,which makes the data center network become particularly important.During the training and inference of large models,clusters require extremely high stability of the network system.Based on the characteristics of large model services,and combined with the mainstream cluster network technology,it studies the new generation of intelligent computing center network technology in the training scenario of ultra-large scale networking,ultra-high throughput and ultra-stable,as well as the construction of high-quality computing networks through SDN+SRv6 programmable network integrated intelligent scheduling and slicing technology in inference scenarios,and the technical difficulties and countermeasures of DC collaborative training is also studied.
作者
陈斌
裴培
许鹏
Chen Bin;Pei Pei;Xu Peng(Intelligent Network&Innovation Center of China Unicom,Beijing 100048,China;China Information Technology Designing&Consulting Institute Co.,Ltd.,Beijing 100048,China)
出处
《邮电设计技术》
2024年第9期1-6,共6页
Designing Techniques of Posts and Telecommunications
关键词
广域网络
智算中心网络
带宽池化
跨集群模型训练
WAN
Intelligent computing center network
Bandwidth pooling
Cross-DC collaborative training