摘要
随着深度学习在众多领域发挥着越来越重要的作用,如何设计高性能、低功耗、低延迟的深度学习硬件加速器成为体系结构领域的研究热点.本文基于深度学习算法模型的结构和优化方法,分析了深度学习硬件实现中面临的困难和挑战,并对比当前主流的深度学习硬件加速平台的优势和不足,提出了基于飞腾–迈创通用向量DSP的深度学习硬件加速方案,对其向量广播、矩阵转换等加速技术进行了阐述.并围绕目前通用向量DSP硬件加速的不足,对兼顾通用向量计算和专用深度学习计算的可重构计算阵列等优化技术进行了深入的探讨与研究.
As deep learning(DL)plays an increasingly significant role in several fields,designing a high performance,low power,low-latency hardware accelerator for DL has become a topic of interest in the field of architecture.Based on the structure and optimization method of DL algorithms,this study aims to analyze the difficulties and challenges in DL hardware design.In comparison with the current mainstream DL hardware acceleration platform,advantages of the DL hardware acceleration based on general vector DSP are discussed.Besides,acceleration techniques,such as vector broadcasting and matrix conversion,are described.From the viewpoint of the shortcomings of the general vector DSP discussed herein,optimization techniques such as reconfigurable computing arrays that take into account the general vector calculations as well as specific DL acceleration are discussed in depth.
作者
王慧丽
郭阳
屈婉霞
Huili WANG;Yang GUO;Wanxia QU(School of Computer,National University of Defense Technology,Changsha 410073,China)
出处
《中国科学:信息科学》
CSCD
北大核心
2019年第3期256-276,共21页
Scientia Sinica(Informationis)
基金
国家自然科学基金(批准号:61832018
61572025)资助项目