Angel^+:基于Angel的分布式机器学习平台

Angel^+:A Large-Scale Machine Learning Platform on Angel

下载PDF

导出

摘要【目的】随着大數据时代的来临,数据变得高维、稀疏,机器学习模型也变得复杂、高维,因此也给分布式机器学习系统带来了很多挑战。尽管研究人员已经开发了很多高性能的机器学习系统,比如TensorFlow、PyTorch、XGBoost等,但是这些系统存在以下两个问题:(1)不能与现有的大数据系统很好的结合;(2)不够通用,这些系统往往是为了某一类机器学习算法设计。【方法】为了解决以上两个挑战,本文介绍Angel^+:—个基于参数服务器架构的分布式机器学习平台。【结果】Angel^+能够高效的支持现有的大数据系统以及机器学习系统——浪赖于参数服务器处理高维模型的能力,Angel^+能够以无侵入的方式为大数据系统(比如Apache Spark)提供高效训练超大机器学习模型的能力,并且高效的运行已有的分布式机器学习系统(比如PyTorch)。此外,针对分布式机器学习中通信开销大和掉队者问题,Angel^+也提供了模型平均、梯度压缩和异构感知的随机梯度下降解法等。【结论】笔者结合Angel^+开发了很多高效、易用的机器学习模型,并且通过实验验证了Angel^+平台的高效性。 [Objective]Real-world data becomes much more complex,sparse and high-dimensional for the big data shock in this era.According to this,modem ML models are designed in a deep,complicated way,which arises challenges when designing a distributed machine learning(ML)system.Though researchers have developed many efficient centralized ML systems like TensorFlow,PyTorch and XGBoost,these systems suffer from the following two problems:(1)They cannot integrate well with existing big data systems,(2)they are not general enough and are usually designed for specific ML models.[Methods]To tackle these challenges,we introduce Angel^+,a large-scale ML platform based on parameter servers.[Results]With the power of parameter servers,Angel+can efficiently support existing big data systems and ML systems without neither breaking the core of big data systems,Apache Spark for instance,nor degrades the computation efficiency of current ML frameworks like PyTorch.Furthermore,Angel^+ provides algorithms like model averaging,gradient compression and heterogeneous-aware stochastic gradient descent,to deal with the huge communication cost and the straggler problem in distributed training process.[Conclusions]We also enhance the usability of Angel^+ by providing efficient implementation for many ML models.We conduct extensive experiments to demonstrate the superiority of Angel^+.

作者张智鹏江佳伟余乐乐崔斌 Zhang Zhipeng;Jiang Jiawei;Yu Lele;Cui Bin(Department of Computer Science&Key Laboratory of High Confidence Software Technologies(MOE),Peking University,Beijing 100871,China;Tencent,Beijing 100193,China)

机构地区北京大学腾讯公司

出处《数据与计算发展前沿》 2019年第1期63-72,共10页 Frontiers of Data & Computing

基金国家重点研发计划重点专项(2018YFB1004403) 国家自然科学基金(61832001)。

关键词分布式机器学习平台参数服务器大数据处理系统分布式机器学习系统 machine learning platform parameter servers big data systems distributed machine learning systems

分类号 TP3 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

1付文琦,裴俊瑞,张歆,刘国祥.卫生事业管理专业自组织学习系统模式构建及评价[J].中国医学教育技术,2020,34(1):7-11. 被引量：1
2王刚.朗读指导与小学语文教学[J].名师在线,2019,0(28):47-48.
3曹扬晨,朱国胜,祁小云,邹洁.基于5G边缘计算的Cloud VR研究[J].信息通信,2019,0(10):1-3. 被引量：12
4任友群,赵建华,孔晶,尚俊杰.国际学习科学研究的现状、核心领域与发展趋势——2018版《国际学习科学手册》之解析[J].远程教育杂志,2020,38(1):18-27. 被引量：15
5苗争鸣.可怕的“深度伪造”技术[J].世界知识,2019,0(22):70-71. 被引量：13
6俞晓晴(编译).奔跑吧,机器人![J].世界科学,2019,0(9):32-33.
7刘照邦,袁明辉.基于深度神经网络的货架商品识别方法[J].包装工程,2020,41(1):149-155. 被引量：5
8尹传红.警惕“通往极端的狭隘”[J].科学24小时,2020,0(1):42-43.
9张敏,许春香,黄闽英.远程医疗环境下面向多服务器的轻量级多因子身份认证协议研究[J].信息网络安全,2019(10):42-49. 被引量：4
10涂龙威,刘杰,刘光昭,张正.基于偏导全域积分的结构全局敏感性方法[J].机械强度,2019,41(6):1359-1364.

数据与计算发展前沿

2019年第1期

浏览历史

内容加载中请稍等...

Angel^+:基于Angel的分布式机器学习平台

相关作者

相关机构

相关主题

浏览历史