摘要
在进行大模型训练时,采用分布式训练是解决单个GPU卡或单个节点无法处理庞大模型参数和数据集的有效方法.通过将训练任务分配给多个节点,分布式模型训练实现了计算资源的并行利用,从而提高了训练效率.然而,随着模型规模的迅速增大,通信成为制约分布式训练性能的瓶颈.近年来,许多研究者对分布式训练中的通信问题进行了深入的研究,本文对相关研究进行全面的综述,从5个不同角度对分布式训练中的通信问题进行了分析,并总结了相应的优化方法.这些优化方法包括但不限于通信拓扑优化、梯度压缩技术、同步和异步算法、重叠通信与计算、以及通信库及硬件的优化.最后,本文对未来的研究方向进行了分析与展望.
When conducting large-scale model training,distributed training is an effective approach to address the challenge of handling extensive model parameters and datasets that a single GPU or node may struggle with.By distributing the training task across multiple nodes,distributed model training achieves parallel utilization of computational resources,thereby improving training efficiency.However,as the model scale rapidly increases,communication becomes a bottleneck limiting the performance of distributed training.In recent years,many researchers have delved into the study of communication issues in distributed training.This paper provides a comprehensive review of relevant research,analyzing communication problems in distributed training from five different perspectives.It summarizes corresponding optimization methods,including but not limited to communication topology optimization,gradient compression techniques,synchronous and asynchronous algorithms,overlapping communication and computation,as well as optimization of communication libraries and hardware.Finally,the paper analyzes and outlines future research directions and prospects.
作者
赵海燕
易庆奥
汤敬华
钱诗友
曹健
ZHAO Haiyan;YI Qingao;TANG Jinghua;QIAN Shiyou;CAO Jian(Shanghai Key Lab of Modern Optical System,Engineering Research Center of Optical Instrument and System,Ministry of Education,University of Shanghai for Science and Technology,Shanghai 200093,China;Department of Computer Science and Engineering,Shanghai Jiao Tong University,Shanghai 200240,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2024年第12期2964-2978,共15页
Journal of Chinese Computer Systems
基金
上海市科委科技创新计划项目(21511104700)资助。
关键词
大模型
分布式训练
并行
通信优化
large-scale model
distributed training
parallel
communication optimization