期刊文献+

填充性载荷:减少集群资源浪费与深度学习训练成本的负载

Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs
下载PDF
导出
摘要 近年来,大模型在生物信息学、自然语言处理和计算机视觉等多个领域取得了显著成功。然而,这些模型在训练和推理阶段需要大量的计算资源,导致计算成本高昂。同时,计算集群中存在资源利用率低、任务调度难的供需失衡问题。为了解决这一问题,提出了填充性载荷的概念,即一种在计算集群中利用空闲资源进行计算的负载。填充性载荷的计算资源随时可能被其他负载抢占,但其使用的资源优先级较低,资源成本也相对较低。为此,设计了适用于填充性载荷的分布式深度学习训练框架PaddingTorch。基于阿里巴巴PAI集群的数据,使用4块GPU模拟了任务切换最频繁的4个GPU时间段上的作业调度情况,使用PaddingTorch将蛋白质复合物预测程序作为填充性载荷进行训练。训练时长为独占资源时训练时长的2.8倍,但训练成本降低了84%,在填充性载荷填充时间段内GPU资源利用率提升了25.8%。 In recent years,large-scale models have achieved remarkable success in multiple domains such as bioinformatics,natural language processing,and computer vision.However,these models often require substantial computational resources during the training and inference stages,resulting in considerable computational costs.Additionally,computing clusters experience imba-lances between supply and demand,manifesting as low resource utilization and difficulties in task scheduling.To address this problem,the concept of Padding Load is introduced,which leverages idle computing resources for computational tasks.Resources allocated to Padding Load can be preempted by other tasks at any time.However,they operate with a lower resource priority,leading to relatively lower costs.PaddingTorch is a distributed deep learning training framework tailored for Padding Load.Utilizing data from the Alibaba PAI cluster,job scheduling is simulated on four GPUs,specifically during peak task-switching intervals.PaddingTorch is employed to train a protein complex prediction model using the Padding Load approach.While the training duration is 2.8 times that of exclusive resource usage,there is an 84%reduction in training costs and a 25.8%increase in GPU resource utilization during the periods when Padding Load is employed.
作者 杜昱 俞子舒 彭晓晖 徐志伟 DU Yu;YU Zishu;PENG Xiaohui;XU Zhiwei(Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)
出处 《计算机科学》 CSCD 北大核心 2024年第9期71-79,共9页 Computer Science
基金 北京市自然科学基金(4212027) 国家自然科学基金(62072434)。
关键词 深度学习 分布式训练 资源利用率 计算集群 编程框架 Deep learning Distributed training Resource utilization Computing cluster Programming framework
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部