期刊文献+

Cloudless-Training:基于serverless的高效跨地域分布式ML训练框架

Cloudless-Training:a framework to improve efficiency of geo-distributed ML training based on serverless
下载PDF
导出
摘要 跨地域分布式机器学习(ML)训练能够联合多区域的云资源协作训练,可满足许多新兴ML场景(比如大型模型训练、联邦学习)的训练需求。但其训练效率仍受2方面挑战的制约。首先,多区域云资源缺乏有效的弹性调度,这会影响训练的资源利用率和性能;其次,模型跨地域同步需要在广域网(WAN)上高频通信,受WAN的低带宽和高波动的影响,会产生巨大通信开销。本文提出Cloudless-Training,从3个方面实现高效的跨地域分布式ML训练。首先,它基于serverless计算模式实现,使用控制层和训练执行层的2层架构,支持多云区域的弹性调度和通信。其次,它提供一种弹性调度策略,根据可用云资源的异构性和训练数据集的分布自适应地部署训练工作流。最后,它提供了2种高效的跨云同步策略,包括基于梯度累积的异步随机梯度下降(ASGD-GA)和跨云参数服务器(PS)间的模型平均(MA)。Cloudless-Training是基于OpenFaaS实现的,并被部署在腾讯云上评估,实验结果表明Cloudless-Training可显著地提高跨地域分布式ML训练的资源利用率(训练成本降低了9.2%~24.0%)和同步效率(训练速度最多比基线快1.7倍),并能保证模型的收敛精度。 Geo-distributed machine learning(ML)training can benefit many emerging ML scenarios(e.g.,large model training,federated learning)with multi-regional cloud resources and wide area network.However,its efficiency is limited due to two challenges.First,efficient elastic scheduling of multi-regional cloud resources is usually miss-ing,affecting resource utilization and performance of training.Second,training communication on wide area net-work(WAN)is still the main overhead,easily subjected to low bandwidth and high fluctuations of WAN.In this paper,a framework Cloudless-Training is proposed to realize efficient geo-distributed ML training in 3 aspects.First,it uses a two-layer architecture with control and physical training planes to support elastic scheduling and communication for multi-regional clouds in a serverless manner.Second,it provides an elastic scheduling strategy that can deploy training workflows adaptively according to the heterogeneity of available cloud resources and distri-bution of pre-existing training datasets.Third,it provides two new synchronization strategies for training partitions among clouds,including asynchronous stochastic gradient descent with gradient accumulation(ASGD-GA)and in-ter-parameter server(PS)model averaging(MA).It is implemented with OpenFaaS and evaluated on Tencent Cloud.Experimental results show that Cloudless-Training can support general ML training in a geo-distributed way,and greatly improve resource utilization(e.g.,9.2%-24.0%training cost reduction)and synchronization effi-ciency(e.g.,1.7 times speedup of training over baseline at most)with model correctness guarantees.
作者 谭文婷 吕存驰 史骁 赵晓芳 TAN Wenting;LV Cunchi;SHI Xiao;ZHAO Xiaofang(Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190;University of Chinese Academy of Sciences,Beijing 100049;Nanjing Institute of Information SuperBahn,Nanjing 211135;Institute of Intelligent Computing Technology,Suzhou,Chinese Academy of Sciences,Suzhou 215028)
出处 《高技术通讯》 CAS 北大核心 2024年第3期219-232,共14页 Chinese High Technology Letters
基金 国家重点研发计划(2021YFF0703800) 光合基金B类(202302028357)资助项目。
关键词 跨地域分布式机器学习(ML)训练 跨云ML训练 分布式训练框架 serverless 跨云模型同步 geo-distributed machine learning(ML)training cross cloud ML training distributed training framework serverless cross cloud model synchronization
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部