摘要
文中介绍了基于Kubernetes的分布式TensorFlow平台的设计与实现,针对分布式TensorFlow存在的环境配置复杂、底层物理资源分布不均、训练效率过低、模型研发周期长等问题,提出了一种容器化TensorFlow的方法,并基于Kubernetes容器PaaS平台来统一调度管理TensorFlow容器。文中将Kubernetes和TensorFlow的优点相结合,由Kubernetes提供可靠、稳定的计算环境,以充分发挥TensorFlow异构的优势,极大地降低了大规模使用的难度,同时建立了一个敏捷的管理平台,实现了分布式TensorFlow资源的快速分配、一键部署、秒级启动、动态伸缩、高效训练等。
This paper designed and implemented a distributed deep learning platform based on Kubernetes.In order to solve the propblems of complex environment configuration of distributed TensorFlow,uneven distribution of underlying physical resources,low efficiency of training model and long development cycle,a method of containerized TensorFlow based on Kubernetes was proposed.By combining the advantages of Kubernetes and TensorFlow,Kubernetes provides a stable and reliable computing environment and gives full play to the advantages of heterogeneous TensorFlow,which greatly reduces the difficulty in large-scale use.Meanwhile,an agile management platform is established,which realizes the fast distribution of distributed TensorFlow resources,one key deployment,second level running,dynamic expansion,efficient training and so on.
作者
余昌发
程学林
杨小虎
YU Chang-fa;CHEN Xue-lin;YANG Xiao-hu(School of Software Technology,Zhejiang University,Hangzhou 310027,China)
出处
《计算机科学》
CSCD
北大核心
2018年第B11期527-531,共5页
Computer Science
基金
中央高校基本科研业务费专项资金
国家科技支撑计划:公共文化科技服务能力建设与绩效评估技术研究与示范(2015BAK26B00)资助