分布式模型训练中的通信优化方法:现状及展望

Communication Optimization in Distributed Model Training:Current Research and Prospects

下载PDF

导出

摘要在进行大模型训练时,采用分布式训练是解决单个GPU卡或单个节点无法处理庞大模型参数和数据集的有效方法.通过将训练任务分配给多个节点,分布式模型训练实现了计算资源的并行利用,从而提高了训练效率.然而,随着模型规模的迅速增大,通信成为制约分布式训练性能的瓶颈.近年来,许多研究者对分布式训练中的通信问题进行了深入的研究,本文对相关研究进行全面的综述,从5个不同角度对分布式训练中的通信问题进行了分析,并总结了相应的优化方法.这些优化方法包括但不限于通信拓扑优化、梯度压缩技术、同步和异步算法、重叠通信与计算、以及通信库及硬件的优化.最后,本文对未来的研究方向进行了分析与展望. When conducting large-scale model training,distributed training is an effective approach to address the challenge of handling extensive model parameters and datasets that a single GPU or node may struggle with.By distributing the training task across multiple nodes,distributed model training achieves parallel utilization of computational resources,thereby improving training efficiency.However,as the model scale rapidly increases,communication becomes a bottleneck limiting the performance of distributed training.In recent years,many researchers have delved into the study of communication issues in distributed training.This paper provides a comprehensive review of relevant research,analyzing communication problems in distributed training from five different perspectives.It summarizes corresponding optimization methods,including but not limited to communication topology optimization,gradient compression techniques,synchronous and asynchronous algorithms,overlapping communication and computation,as well as optimization of communication libraries and hardware.Finally,the paper analyzes and outlines future research directions and prospects.

作者赵海燕易庆奥汤敬华钱诗友曹健 ZHAO Haiyan;YI Qingao;TANG Jinghua;QIAN Shiyou;CAO Jian(Shanghai Key Lab of Modern Optical System,Engineering Research Center of Optical Instrument and System,Ministry of Education,University of Shanghai for Science and Technology,Shanghai 200093,China;Department of Computer Science and Engineering,Shanghai Jiao Tong University,Shanghai 200240,China)

机构地区上海理工大学光电信息与计算机工程学院上海交通大学计算机科学与工程系

出处《小型微型计算机系统》 CSCD 北大核心 2024年第12期2964-2978,共15页 Journal of Chinese Computer Systems

基金上海市科委科技创新计划项目(21511104700)资助。

关键词大模型分布式训练并行通信优化 large-scale model distributed training parallel communication optimization

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献1

1Xiaoge Deng,Tao Sun,Feng Liu,Dongsheng Li.SIGNGD with Error Feedback Meets Lazily Aggregated Technique:Communication-Efficient Algorithms for Distributed Learning[J].Tsinghua Science and Technology,2022,27(1):174-185. 被引量：1

二级参考文献6

1Farid Ablayev,Marat Ablayev,Joshua Zhexue Huang,Kamil Khadiev,Nailya Salikhova,Dingming Wu.On Quantum Methods for Machine Learning Problems Part Ⅱ: Quantum Classification Algorithms[J].Big Data Mining and Analytics,2020,3(1):56-67. 被引量：1
2Zhenxing Guo,Shihua Zhang.Sparse Deep Nonnegative Matrix Factorization[J].Big Data Mining and Analytics,2020,3(1):13-28. 被引量：1
3Kaiming Nan,Sicong Liu,Junzhao Du,Hui Liu.Deep Model Compression for Mobile Platforms:A Survey[J].Tsinghua Science and Technology,2019,24(6):677-693. 被引量：7
4Yong Dong,Juan Chen,Yuhua Tang,Junjie Wu,Huiquan Wang,Enqiang Zhou.Lazy Scheduling Based Disk Energy Optimization Method[J].Tsinghua Science and Technology,2020,25(2):203-216. 被引量：3
5Lei GUAN,Tao SUN,Lin-bo QIAO,Zhi-hui YANG,Dong-sheng LI,Ke-shi GE,Xi-cheng LU.Anefficient parallel and distributed solution to nonconvex penalized linear SVMs[J].Frontiers of Information Technology & Electronic Engineering,2020,21(4):587-603. 被引量：3
6Jianqiang Huang,Wentao Han,Xiaoying Wang,Wenguang Chen.Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer[J].Tsinghua Science and Technology,2020,25(1):56-67. 被引量：4

1钟甦.虚幻引擎在建筑遗产数字化教学中的应用与启示[J].建筑与文化,2024(11):204-206.
2胡佳,靳贺喜,蒋糎慧,苏霞,李琪,于修烛,高媛.食品微凝胶的类型、制备及应用研究进展[J].中国油脂,2024,49(12):83-91.
3杨晓军.无人机遥感技术在智慧农业中的应用研究进展[J].安徽农业科学,2024,52(23):11-15.
4张华华.退耕还林工程的生态效益分析与展望[J].花卉,2024(24):181-183.
5戴银芳,许丹青.机载超短波话音空旷问题分析与排除[J].电子制作,2024,32(22):99-101.
6张云泉,袁良,袁国兴,李希代.2024年中国高性能计算机发展现状分析与展望[J].数据与计算发展前沿（中英文）,2024,6(6):1-9.
7张韩,欧阳华,武曙光.基于轻量化改进的YOLOv7网络下仪表定位检测研究[J].舰船电子工程,2024,44(10):134-138.
8徐金龙,李鹏飞,李嘉楠,陈飙元,高伟,韩林.基于混合并行的分布式训练优化研究[J].计算机科学,2024,51(12):120-128.
9周朴,粟荣涛,李灿,马阎星,张雨秋,李俊,吴坚,王小林,冷进勇.高功率光纤激光的光束合成:进展、动向与展望(特邀)[J].中国激光,2024,51(19):37-52.
10王梓赫,张培茗,司博宇.基于DNN的自动语音识别系统错误率评估方法[J].北京生物医学工程,2024,43(6):613-618.

小型微型计算机系统

2024年第12期

浏览历史

内容加载中请稍等...

分布式模型训练中的通信优化方法:现状及展望

参考文献1

二级参考文献6

相关作者

相关机构

相关主题

浏览历史