摘要
大型企业的深度学习工作存在管理散乱和大量重复建设的问题。为了支持大规模深度学习的全过程管理和模型成果的高效复用,以国家电网公司的两级多中心部署架构为背景,提出一种深度学习平台。系统将训练、推理、数据和模型的管理工作分布在不同中心完成,彼此间协同完成深度学习的闭环。构建基于Kubernetes的私有云来支撑大批量深度学习应用的并行计算。前端界面采用基于算子的流程编排实现建模可视化和功能的可扩展。实验结果表明系统能够支持多个深度学习任务的并行,且额外的性能开销是可以接受的。
There are some problems in the deep learning work of large enterprises,such as scattered management and a large number of redundant projects.In order to support the whole process management of large-scale deep learning and efficient reuse of model results,a deep learning platform is proposed based on the two level multi center deployment architecture of State Grid Corporation of China.The system distributed the management work of training,inferencing,data and models into different centers,and they cooperated to complete the closed-loop of deep learning.A private cloud based on Kubernetes was used to support the parallel computing of large number of deep learning applications.The front-end interface adopted operator-based flow arrangement to realize modeling visualization and function expansion.The experimental results show that the system can support the parallel execution of multiple deep learning tasks,and the additional performance overhead is acceptable.
作者
程仲汉
Cheng Zhonghan(Department of Computer and Information Security Management,Fujian Police College,Fuzhou 350007,Fujian,China)
出处
《计算机应用与软件》
北大核心
2024年第3期16-21,48,共7页
Computer Applications and Software
基金
福建省中青年教师教育科研项目(JAT200379)。