摘要
张量编译器支持将算子的张量描述和计算调度编译为目标硬件的代码。为加速张量运算,深度学习领域专用处理器被设计为包含特殊指令的专有架构,支持多核并行、多级专用内存架构和张量计算,在硬件之上还有与硬件特性紧密相关的张量指令集。在这样复杂的架构上,张量指令的使用有着许多约束与限制,并存在以下问题和挑战:首先,因计算任务划分或数据切块等循环分段引入的条件分支增加了模式匹配难度;其次,张量指令有对齐、数据布局等硬件约束。针对上述问题和挑战,提出了一种融合循环划分的张量指令生成优化算法。算法通过划分循环区间,来消除因任务划分或数据切分引入的条件分支;通过补零、等价指令替换和添加额外计算来解决指令和硬件约束;并使用模式匹配的方法生成张量指令。研究并扩展开源深度学习编译器TVM 0.7版本,实现了支持DianNao架构机器学习加速器的张量指令生成的编译器原型系统。为评测算法的有效性,在DianNao架构机器学习加速器硬件平台上,对逐元素二元张量操作算子、原地一元张量操作算子和卷积操作算子3类算子的性能和开发效率进行了测试,实验结果表明3类算子性能平均加速比为125.00%,最大加速比为194.00%,开发效率最高提升了7倍。
The tensor compiler compiles the tensor algorithm and schedule of the operator into the code of the target hardware.In order to accelerate tensor operation,the special processor in the field of deep learning is designed as a special architecture with special instructions,which supports multi-core parallel,multi-level special memory architecture and tensor calculation.On top of the hardware,there is a tensor instruction set closely related to the characteristics of the hardware.In such a complex architecture,the use of tensor instructions has many constraints and limitations,and there are the following problems and challenges.Firstly,the conditional branches introduced by loop tiling such as computing task division or data chunking increase the difficulty of pattern matching.Secondly,tensor instructions have hardware constraints such as alignment and data layout.To solve the above problems and research challenges,an optimization algorithm of tensor instruction ge-neration based on loop partitioning is proposed.By dividing the loop interval,the algorithm eliminates the conditional branches introduced by task division or data segmentation.The instruction and hardware constraints are solved by filling zeros,replacing equivalent instructions and adding additional calculations.The tensor instruction is generated by pattern matching method.This paper studies and extends the open source deep learning compiler TVM version 0.7,and implements a compiler prototype system supporting tensor instruction ge-neration of DianNao architecture machine learning accelerator.In order to evaluate the effectiveness of the algorithm,the operator performance and development efficiency of element-wise binary tensor operator,in-place unary tensor operator and convolution operator are tested on the DianNao architecture machine learning accelerator hardware platform.Experimental results show that the average speedup of the three types of operators is 125.00%,the maximum speedup is 194.00%,and the maximum development efficiency increases by 7 times.
作者
梁佳利
华保健
苏少博
LIANG Jiali;HUA Baojian;SU Shaobo(School of Software Engineering,University of Science and Technology of China,Hefei 230000,China)
出处
《计算机科学》
CSCD
北大核心
2023年第2期374-383,共10页
Computer Science
基金
中国科学技术大学研究生教育创新计划(2020YCJC41,2021YCJC34)。
关键词
深度学习
张量编译器
领域专用处理器
张量化
循环划分
Deep learning
Tensor compiler
Domain-specific processor
Tensorization
Loop partition