华为昇腾神经网络加速器性能评测与优化被引量：3

Evaluation and Optimization for Huawei Ascend Neural Network Accelerator

下载PDF

导出

摘要华为昇腾是一款新型神经网络加速器.与GPU相比,昇腾加速器专门面向神经网络计算,设计了专用计算单元,核心算力集中在低精度,基于昇腾的软件栈与GPU有所差异.现有研究大多专注于GPU上的深度学习负载性能分析和优化,由于昇腾平台推出不久且具有新的体系结构特征,其实际表现仍有待探索.为深入挖掘昇腾的性能和优化方法,本文对其进行了系统性的评测和分析,包括:(1)基于标准数据集在四个端到端神经网络(ResNet、Transformer、DeepFM和LSTM)上对昇腾和GPU的性能和功耗进行了对比;(2)研究了昇腾上深度学习框架、算子和混合精度训练优化策略;(3)测试三个计算密集型算子(全连接、卷积和RNN)的浮点计算能力、硬件利用率和访存性能.评测结果表明:华为昇腾加速器适合进行稠密型神经网络工作负载,且功耗低于GPU;使用昇腾进行模型训练,需要将神经网络模型从32位精度量化到16位精度.针对昇腾的体系结构和编译软件栈特点,本文提出如下优化策略:深度学习框架开发时应进行整图编译构建,进行算子融合;算子开发时应合理设置分块大小,尽量使用低精度实现算子;模型训练时要合理设置混合精度参数. The great success achieved by deep neural networks(DNNs)mainly relies on the computation ability provided by modern chips.Nvidia’s high performance and general-purpose Graphics Processing Units(GPUs)are widely used to build deep learning tools and software.There is an industry-wide trend towards domain specific neural network accelerators to extend deep learning performance.For example,Google has released Tensor Processing Unit(TPU)and has deployed TPUs in the data center;MIT proposed an energy-efficient reconfigurable accelerator for deep convolution neural networks.In addition to these accelerators,Huawei has developed the Ascend accelerator,including Ascend 910 for training and Ascend 310 for inference.Ascend accelerators feature super computing power,high integration and fast network bandwidth.Take Ascend 910 as an example,it delivers 256T half precision FLOPS,32 GB memory with 1200 GB/s bandwidth and 100G RoCE v2 network adapter.Compared with GPU,Ascend is mainly for neural networks.Differences between Ascend and GPU are:(1)Ascend uses task-specific processing units which is mainly for neural networks;(2)the computing power is based on lower precision;(3)the compiler software stack on Ascend is different from GPU.The main goal of deep learning is to train a statistical model based on train dataset and the fitted model should make high quality predictions on unseen data,which is referred to as generalization.From the perspective of hardware design,task-specific processing units can greatly speed up some particular workloads and lower precision enables faster training for a single iteration.However,task-specific processing units may not meet the need of a wide variety of deep learning models and lower precision hardware requires special software-level optimization methods.Previous benchmarks and analyses focused on deep learning with GPU platform.Ascend has its special and novel features and its potential remains unknown.To thoroughly understand its performance and optimization method,we conduct a systematic evaluation on Huawei Ascend and analyze optimization methods for faster training.Our contributions include:(1)we compare the performance results between Ascend and GPU on four end-to-end neural networks(ResNet,Transformer,DeepFM and LSTM)on well-known public datasets;(2)we analyze optimization methods on Ascend including deep learning framework developing,operator tile strategy and mixed precision training;(3)we measure hardware utilization and memory access pattern on three compute-intensive operators(fully-connected,convolution and RNN).To the best of our knowledge,we are the first to conduct comprehensive analysis on Ascend.Ascend is suitable for dense neural network workloads,its power consumption is lower than GPU when training and neural networks should be quantized from 32-bit to 16-bit precision.Based on the characteristics of the architecture and compiler software stack,to achieve better performance,we propose the following optimization strategies:when developing deep learning frameworks,we should compile the whole computation graph of neural network models so that operators can be fused.When developing operators,we should configure the tile size carefully with lower precision.When training models,we should adopt mixed precision configurations within a reasonable range.Ascend is not suitable for sparse workload.There are some internal errors when allocating extreme big size memory.

作者鲁蔚征张峰贺寅烜陈跃国翟季冬杜小勇 LU Wei-Zheng;ZHANG Feng;HE Yin-Xuan;CHEN Yue-Guo;ZHAI Ji-Dong;DU Xiao-Yong(Office of Research Infrastructure,Renmin University of China,Beijing 100872;Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education,Renmin University of China,Beijing 100872;School of Information,Renmin University of China,Beijing 100872;Department of Computer Science and Technology,Tsinghua University,Beijing 100084)

机构地区中国人民大学大型科学仪器共享平台数据工程与知识工程教育部重点实验室(中国人民大学) 中国人民大学信息学院清华大学计算机科学与技术系

出处《计算机学报》 EI CAS CSCD 北大核心 2022年第8期1618-1637,共20页 Chinese Journal of Computers

基金国家重点研发计划项目(2018YFB1004401) 国家自然科学基金(U1711261,62172419) 教育部产学融合协同育人(华为昇腾)项目资助.

关键词深度学习神经网络加速器华为昇腾高性能计算评测基准 deep learning accelerator Huawei Ascend HPC benchmark

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献3

1王超,王腾,马翔,周学海.基于FPGA的机器学习硬件加速研究进展[J].计算机学报,2020,43(6):1161-1182. 被引量：15
2张峰,翟季冬,陈政,林甲灶,杜小勇.面向异构融合处理器的性能分析、优化及应用综述[J].软件学报,2020,31(8):2603-2624. 被引量：8
3杜小勇,卢卫,张峰.大数据管理系统的历史、现状与未来[J].软件学报,2019,30(1):127-141. 被引量：61

二级参考文献8

1黄山,王波涛,王国仁,于戈,李佳佳.MapReduce优化技术综述[J].计算机科学与探索,2013,7(10):865-885. 被引量：30
2刘颖,吕方,王蕾,陈莉,崔慧敏,冯晓兵.异构并行编程模型研究与进展[J].软件学报,2014,25(7):1459-1475. 被引量：13
3Qi ZHU,Bo WU,Xipeng SHEN,Kai SHEN,Li SHEN,Zhiying WANG.Understanding co-run performance on CPU-GPU integrated processors： observations, insights, directions[J].Frontiers of Computer Science,2017,11(1):130-146. 被引量：1
4周傲英.感悟大数据——从数据管理和分析说起[J].大数据,2017,3(2):1-18. 被引量：5
5王建民.领域大数据应用开发与运行平台技术研究[J].软件学报,2017,28(6):1516-1528. 被引量：36
6刘勤让,刘崇阳.利用参数稀疏性的卷积神经网络计算优化及其FPGA加速器设计[J].电子与信息学报,2018,40(6):1368-1374. 被引量：23
7朱虎明,李佩,焦李成,杨淑媛,侯彪.深度神经网络并行化研究综述[J].计算机学报,2018,41(8):1861-1881. 被引量：53
8陈桂林,马胜,郭阳.硬件加速神经网络综述[J].计算机研究与发展,2019,56(2):240-253. 被引量：17

共引文献80

1许有准.统一社会信用代码数据管理模式研究——以厦门市为例[J].中国标准化,2021(7):42-45.
2陈昉,杜战朝,倪翊龙,邵高峰,李丹丹,徐高.民用建筑“四节一环保”数据建模与存储方法研究[J].建筑科学,2020,36(S02):382-389. 被引量：2
3杜文风,王英奇,王辉,赵艳男,高博青,董石麟.基于边界平衡生成对抗网络的十字板式节点新构形智能生成方法[J].建筑结构学报,2022,43(S01):315-324. 被引量：1
4李超平.政务信息系统简论[J].情报资料工作,2000,21(2):11-13. 被引量：5
5张浩鹏,范梅花,姜翠霞,杨欣宇,李诚,王红艳.基于Python的职位画像系统[J].高师理科学刊,2019,39(6):39-42.
6张芹娥,陈曙光,王鸿泉,马薇.“精准治理”在统计监测平台中的运用研究[J].互联网天地,2019(9):33-35.
7蒋平,刘鹏.从过度计算到梯式计算[J].电子技术与软件工程,2019,0(21):140-142.
8吴晓芸,李刚,高小芊,杨熙,尹晗,高镇.大数据挖掘助力全面提升电网监控水平[J].电力大数据,2019,22(11):77-85. 被引量：2
9张莉艳.大数据安全风险分析及应对措施[J].电脑知识与技术,2019,15(9X):37-39. 被引量：3
10杨秋鸿,潘晓衡,赵铁柱,林俊华,袁华强.面向大数据应用的分布式服务平台设计与实现[J].东莞理工学院学报,2020,27(1):34-38. 被引量：3

同被引文献17

1孙召龙,徐昕,朱云龙,田枫.基于YOLOv5的油田作业现场吸烟检测方法[J].系统仿真技术,2021,17(2):89-93. 被引量：4
2李震宁.银河麒麟操作系统开源生态实践[J].软件和集成电路,2021(6):40-41. 被引量：4
3李琰,赵梓焱,田水承,于瑾慧.矿工不安全行为研究综述[J].中国安全生产科学技术,2016,12(8):47-54. 被引量：59
4原志明.神东上湾煤矿安全智能视频系统的开发与应用[J].能源与环保,2017,39(12):263-266. 被引量：5
5杨文旺,邢浩然.煤矿爆破工不安全行为致因分析与控制[J].安全与环境学报,2018,18(3):983-987. 被引量：8
6王浩宇.基于兆芯X86架构处理器的国产化CPCI主板设计[J].机电产品开发与创新,2020,33(1):37-38. 被引量：4
7于璠.新一代深度学习框架研究[J].大数据,2020,6(4):69-80. 被引量：11
8孙有恒,赵明明,王传宝.视频图像分析联动预警在地铁施工安全管理中的应用[J].中国高新科技,2020(22):64-66. 被引量：5
9刘林,梅强,常志朋.国内70年来员工不安全行为研究:发展阶段、研究热点及趋势分析[J].中国安全科学学报,2021,31(3):1-12. 被引量：27
10胡向东,柯希明,尹飞,张新,马永飞,颜世云,马超.高性能众核处理器申威26010[J].计算机研究与发展,2021,58(6):1155-1165. 被引量：13

引证文献3

1吕昊,郭江宇,郝志超,庄成,刘健.基于国产软硬件的深度学习平台设计与验证[J].火力与指挥控制,2023,48(7):134-139.
2李亚美,陈莉丽,王锋,胡畅.基于异构编程模型的FFT算法实现和优化[J].智能安全,2023,2(4):24-34.
3徐明智,幸贞雄,武熠明.基于昇腾芯片的加油站不安全行为智能监测预警系统建设[J].安防技术,2023,11(3):23-29.

1侯钰涛,阿布都克力木·阿布力孜,哈里旦木·阿布都克里木.中文预训练模型研究进展[J].计算机科学,2022,49(7):148-163. 被引量：8
2李奕铎,郭子博,刘凯,孙逍遥.基于误差限制的神经网络混合精度量化方法(特邀)[J].红外与激光工程,2022,51(4):134-141.
3柴勇.浅谈农村小学管理存在的问题及应对措施[J].今天,2022(12):0211-0212.
4吴琼,孙韶杰,于澜,苏迪.一种基于高程数据的全球地形高度检测方法[J].网络安全与数据治理,2022,41(7):88-92.
5聂萌.《遥远的海岸》:卡里尔·菲利普斯的共通体想象[J].解放军外国语学院学报,2022,45(3):61-68. 被引量：1
6龚勋,胡嘉骏,徐年平,邱盼,赵晖.基于深度学习的多普勒气象雷达回波外推短临预报对比研究[J].中国军转民,2022(13):76-80. 被引量：3
7马佳明,段然.学科评语的实效性探究——以小学英语课程为例[J].北京教育（普教版）,2022(7):78-79.
8孙慧慧,刘强.基于改进Huber损失的部分线性模型稳健经验似然推断[J].系统科学与数学,2022,42(5):1330-1343. 被引量：1
9韦荣阳,毛阗,高晗,彭建仁,杨健.基于TWP-SVR的锂离子电池健康状态估计[J].储能科学与技术,2022,11(8):2585-2599. 被引量：4

计算机学报

2022年第8期

浏览历史

内容加载中请稍等...

华为昇腾神经网络加速器性能评测与优化被引量：3

参考文献3

二级参考文献8

共引文献80

同被引文献17

引证文献3

相关作者

相关机构

相关主题

浏览历史

华为昇腾神经网络加速器性能评测与优化 被引量：3

参考文献3

二级参考文献8

共引文献80

同被引文献17

引证文献3

相关作者

相关机构

相关主题

浏览历史

华为昇腾神经网络加速器性能评测与优化被引量：3