摘要
深度学习与Kubernetes相结合现今已经被业界广泛采用并有着大规模的实际应用,然而深度学习在Kubernetes集群上进行分布式训练有着诸多的不足和待优化点。本文针对Kubernetes集群的网络拓扑和深度学习分布式训练的特点对Kubernetes默认任务调度算法做了相关优化,在Kubernetes集群上采用Gang-scheduling和基于网络拓扑优化后的调度算法对分布式深度学习训练任务进行调度,经测试验证,相比默认的Kubernetes调度算法,该调度算法能够有效利用网络拓扑结构,并且对分布式深度学习训练速度有着明显的提高。
Deploying deep learning jobs and applications on Kubernetes cluster is already adopted widely by plenty of internet companies.However,distributed deep learning training on Kubernetes cluster reveals the shortcomings of the default scheduler of Kubernetes and it needs to be optimized for the distributed training.The optimized network topology scheduling algorithm in this article is designed for the distributed deep learning training on Kubernetes cluster.The Gang-scheduling and optimized network topology scheduling algorithm are exploited in the verification test,the results show that optimized scheduling algorithm has achieved the expected performance and increased the speed of distributed training comparing to the default Kubernetes scheduler.
作者
陈培
王超
王德奎
张东
房体盈
CHEN Pei;WANG Chao;WANG De-kui;ZHANG Dong;FANG Ti-ying
出处
《信息技术与信息化》
2019年第9期109-113,共5页
Information Technology and Informatization
基金
2017年泉城产业领军人才(创新团队)-面向云计算的网络化操作系统研发