期刊文献+

面向模型并行训练的模型拆分策略自动生成方法

An automatic model splitting strategy generation method for model parallel training
下载PDF
导出
摘要 随着训练数据规模的增大以及训练模型的日趋复杂,深度神经网络的训练成本越来越高,对计算平台提出了更高的算力需求,模型训练并行化成为增强其应用时效性的迫切需求。近年来基于分布式训练的AI加速器(如FPGA、TPU、AI芯片等)层出不穷,为深度神经网络并行训练提供了硬件基础。为了充分利用各种硬件资源,研究人员需要在集合了多种不同算力、不同硬件架构AI加速器的计算平台上进行神经网络的模型并行训练,因此,如何高效利用各种AI加速器计算资源,并实现训练任务在多种加速器上的负载均衡,一直是研究人员关心的热点问题。提出了一种面向模型并行训练的模型拆分策略自动生成方法,该方法能够基于静态的网络模型自动生成模型拆分策略,实现网络层在不同AI加速器上的任务分配。基于该方法自动生成的模型分配策略,能够高效利用单个计算平台上的所有计算资源,并保证模型训练任务在各设备之间的负载均衡,与目前使用的人工拆分策略相比,具有更高的时效性,节省拆分策略生成时间100倍以上,且降低了由于人为因素带来的不确定性。 With the increase of the training data scale and the increasing complexity of the model,the training cost of the deep neural network is getting higher and higher,which requires higher computational power for the computing platform.In recent years,AI accelerators(such as FPGA,TPU,AI chip,etc.)based on heterogeneous distributed training have emerged endlessly,providing the hardware foundation for the parallelization of deep neural network.In order to make full use of all kinds of hardware resources,the researchers need to set a variety of different work force and hardware architecture AI accelerator computing platforms for neural network model training.Therefore,in the model paralle-lism training,how to efficient use all sorts of AI accelerator computing resources and realize the training mission in a variety of load balancing on the accelerator is the hot issue researchers concern about.This paper proposes a method that can automatically generate the model splitting strategy based on static network model,and map the model splitting strategy to model training,so as to realize the task assignment of network layers on different AI accelerators.The model allocation strategy automatically generated based on this method can efficiently utilize all computing resources on a single computing platform and ensure the load balancing of model training tasks among various devices.Compared with the current manual splitting strategy,it has higher timeliness,saves the generation time of the splitting strategy by more than 100 times,and reduces the uncertainty caused by human factors.
作者 王丽 郭振华 曹芳 高开 赵雅倩 赵坤 WANG Li;GUO Zhen-hua;CAO Fang;GAO Kai;ZHAO Ya-qian;ZHAO Kun(State Key Laboratory of High-End&Storage Technology,Inspur Electronic Information Industry Co.Ltd.,Jinan 250000;Guangdong Inspur Big Data Research Co.Ltd.,Guangzhou 510000,China)
出处 《计算机工程与科学》 CSCD 北大核心 2020年第9期1529-1537,共9页 Computer Engineering & Science
关键词 模型并行 模型训练 模型拆分 负载均衡 model parallelism model training model split load balancing
  • 相关文献

参考文献2

二级参考文献10

共引文献72

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部