期刊文献+

华为昇腾神经网络加速器性能评测与优化 被引量:3

Evaluation and Optimization for Huawei Ascend Neural Network Accelerator
下载PDF
导出
摘要 华为昇腾是一款新型神经网络加速器.与GPU相比,昇腾加速器专门面向神经网络计算,设计了专用计算单元,核心算力集中在低精度,基于昇腾的软件栈与GPU有所差异.现有研究大多专注于GPU上的深度学习负载性能分析和优化,由于昇腾平台推出不久且具有新的体系结构特征,其实际表现仍有待探索.为深入挖掘昇腾的性能和优化方法,本文对其进行了系统性的评测和分析,包括:(1)基于标准数据集在四个端到端神经网络(ResNet、Transformer、DeepFM和LSTM)上对昇腾和GPU的性能和功耗进行了对比;(2)研究了昇腾上深度学习框架、算子和混合精度训练优化策略;(3)测试三个计算密集型算子(全连接、卷积和RNN)的浮点计算能力、硬件利用率和访存性能.评测结果表明:华为昇腾加速器适合进行稠密型神经网络工作负载,且功耗低于GPU;使用昇腾进行模型训练,需要将神经网络模型从32位精度量化到16位精度.针对昇腾的体系结构和编译软件栈特点,本文提出如下优化策略:深度学习框架开发时应进行整图编译构建,进行算子融合;算子开发时应合理设置分块大小,尽量使用低精度实现算子;模型训练时要合理设置混合精度参数. The great success achieved by deep neural networks(DNNs)mainly relies on the computation ability provided by modern chips.Nvidia’s high performance and general-purpose Graphics Processing Units(GPUs)are widely used to build deep learning tools and software.There is an industry-wide trend towards domain specific neural network accelerators to extend deep learning performance.For example,Google has released Tensor Processing Unit(TPU)and has deployed TPUs in the data center;MIT proposed an energy-efficient reconfigurable accelerator for deep convolution neural networks.In addition to these accelerators,Huawei has developed the Ascend accelerator,including Ascend 910 for training and Ascend 310 for inference.Ascend accelerators feature super computing power,high integration and fast network bandwidth.Take Ascend 910 as an example,it delivers 256T half precision FLOPS,32 GB memory with 1200 GB/s bandwidth and 100G RoCE v2 network adapter.Compared with GPU,Ascend is mainly for neural networks.Differences between Ascend and GPU are:(1)Ascend uses task-specific processing units which is mainly for neural networks;(2)the computing power is based on lower precision;(3)the compiler software stack on Ascend is different from GPU.The main goal of deep learning is to train a statistical model based on train dataset and the fitted model should make high quality predictions on unseen data,which is referred to as generalization.From the perspective of hardware design,task-specific processing units can greatly speed up some particular workloads and lower precision enables faster training for a single iteration.However,task-specific processing units may not meet the need of a wide variety of deep learning models and lower precision hardware requires special software-level optimization methods.Previous benchmarks and analyses focused on deep learning with GPU platform.Ascend has its special and novel features and its potential remains unknown.To thoroughly understand its performance and optimization method,we conduct a systematic evaluation on Huawei Ascend and analyze optimization methods for faster training.Our contributions include:(1)we compare the performance results between Ascend and GPU on four end-to-end neural networks(ResNet,Transformer,DeepFM and LSTM)on well-known public datasets;(2)we analyze optimization methods on Ascend including deep learning framework developing,operator tile strategy and mixed precision training;(3)we measure hardware utilization and memory access pattern on three compute-intensive operators(fully-connected,convolution and RNN).To the best of our knowledge,we are the first to conduct comprehensive analysis on Ascend.Ascend is suitable for dense neural network workloads,its power consumption is lower than GPU when training and neural networks should be quantized from 32-bit to 16-bit precision.Based on the characteristics of the architecture and compiler software stack,to achieve better performance,we propose the following optimization strategies:when developing deep learning frameworks,we should compile the whole computation graph of neural network models so that operators can be fused.When developing operators,we should configure the tile size carefully with lower precision.When training models,we should adopt mixed precision configurations within a reasonable range.Ascend is not suitable for sparse workload.There are some internal errors when allocating extreme big size memory.
作者 鲁蔚征 张峰 贺寅烜 陈跃国 翟季冬 杜小勇 LU Wei-Zheng;ZHANG Feng;HE Yin-Xuan;CHEN Yue-Guo;ZHAI Ji-Dong;DU Xiao-Yong(Office of Research Infrastructure,Renmin University of China,Beijing 100872;Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education,Renmin University of China,Beijing 100872;School of Information,Renmin University of China,Beijing 100872;Department of Computer Science and Technology,Tsinghua University,Beijing 100084)
出处 《计算机学报》 EI CAS CSCD 北大核心 2022年第8期1618-1637,共20页 Chinese Journal of Computers
基金 国家重点研发计划项目(2018YFB1004401) 国家自然科学基金(U1711261,62172419) 教育部产学融合协同育人(华为昇腾)项目资助.
关键词 深度学习 神经网络加速器 华为昇腾 高性能计算 评测基准 deep learning accelerator Huawei Ascend HPC benchmark
  • 相关文献

参考文献3

二级参考文献8

共引文献80

同被引文献17

引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部