摘要
针对现有的梯度稀疏压缩技术在实际应用中面临时间开销大的问题,基于分布式训练中残差梯度压缩算法提出低复杂度、能快速选取top-k稀疏梯度通信集的方法.采用Wasserstein距离确定梯度分布特征符合Laplacian分布;利用Laplacian分布曲线面积关系确定关键点,并通过最大似然估计简化特征参数;估计稀疏梯度top-k阈值,并结合二分搜索对阈值修正.该方法避免了现有随机抽样方法的不稳定性和数据排序之类的复杂操作.为了评估所提方法的有效性,在图形处理器(GPU)平台采用CIFAR-10和CIFAR-100数据集对图像分类深度神经网络进行训练.结果显示,与radixSelect和层级选择方法相比,在达到相同训练精度的情况下,本研究方法最高分别实现了1.62、1.30倍的加速.
The existing gradient sparsification compression technology still has the problem of large time consumption in practical applications.To solve this problem,a low-complex and high-speed approach based on the residual gradient compression algorithm in distributed training was proposed,to select the communication-set of the top-k sparse gradient.Firstly,the Wasserstein distance was used to determine that the characteristics of the gradient distribution conformed to the Laplacian distribution.Secondly,the key points were determined by the area relationship of the Laplacian distribution curve,and the feature parameters were simplified by maximum likelihood estimation.Finally,the sparse gradient top-k threshold was estimated and corrected by the binary search algorithm.The proposed approach avoided the instability of random sampling methods and some complex operations like data sorting.The CIFAR-10 and CIFAR-100 datasets were used to train the deep neural network for image classification on GPU platform in order to evaluate the effectiveness of the proposed approach.Results show that this approach accelerated the training process up to 1.62 and 1.3 times,compared with the radixSelect and the hierarchical selection methods under the same training accuracy.
作者
陈世达
刘强
韩亮
CHEN Shi-da;LIU Qiang;HAN Liang(School of Microelectronics,Tianjin University,Tianjin 300072,China;Tianjin Key Laboratory of Imaging and Sensing Microelectronic Technology,Tianjin 300072,China;Alibaba Group,Sunnyvale 94085,USA)
出处
《浙江大学学报(工学版)》
EI
CAS
CSCD
北大核心
2021年第2期386-394,共9页
Journal of Zhejiang University:Engineering Science
基金
国家自然科学基金资助项目(61974102)
阿里巴巴创新研究项目。
关键词
深度神经网络
分布式训练
残差梯度压缩
top-k阈值
分布估计
二分搜索
deep neural network
distributed training
residual gradient compression
top-k threshold
distribution estimation
binary search