融合循环划分的张量指令生成优化

Tensor Instruction Generation Optimization Fusing with Loop Partitioning

下载PDF

导出

摘要张量编译器支持将算子的张量描述和计算调度编译为目标硬件的代码。为加速张量运算,深度学习领域专用处理器被设计为包含特殊指令的专有架构,支持多核并行、多级专用内存架构和张量计算,在硬件之上还有与硬件特性紧密相关的张量指令集。在这样复杂的架构上,张量指令的使用有着许多约束与限制,并存在以下问题和挑战:首先,因计算任务划分或数据切块等循环分段引入的条件分支增加了模式匹配难度;其次,张量指令有对齐、数据布局等硬件约束。针对上述问题和挑战,提出了一种融合循环划分的张量指令生成优化算法。算法通过划分循环区间,来消除因任务划分或数据切分引入的条件分支;通过补零、等价指令替换和添加额外计算来解决指令和硬件约束;并使用模式匹配的方法生成张量指令。研究并扩展开源深度学习编译器TVM 0.7版本,实现了支持DianNao架构机器学习加速器的张量指令生成的编译器原型系统。为评测算法的有效性,在DianNao架构机器学习加速器硬件平台上,对逐元素二元张量操作算子、原地一元张量操作算子和卷积操作算子3类算子的性能和开发效率进行了测试,实验结果表明3类算子性能平均加速比为125.00%,最大加速比为194.00%,开发效率最高提升了7倍。 The tensor compiler compiles the tensor algorithm and schedule of the operator into the code of the target hardware.In order to accelerate tensor operation,the special processor in the field of deep learning is designed as a special architecture with special instructions,which supports multi-core parallel,multi-level special memory architecture and tensor calculation.On top of the hardware,there is a tensor instruction set closely related to the characteristics of the hardware.In such a complex architecture,the use of tensor instructions has many constraints and limitations,and there are the following problems and challenges.Firstly,the conditional branches introduced by loop tiling such as computing task division or data chunking increase the difficulty of pattern matching.Secondly,tensor instructions have hardware constraints such as alignment and data layout.To solve the above problems and research challenges,an optimization algorithm of tensor instruction ge-neration based on loop partitioning is proposed.By dividing the loop interval,the algorithm eliminates the conditional branches introduced by task division or data segmentation.The instruction and hardware constraints are solved by filling zeros,replacing equivalent instructions and adding additional calculations.The tensor instruction is generated by pattern matching method.This paper studies and extends the open source deep learning compiler TVM version 0.7,and implements a compiler prototype system supporting tensor instruction ge-neration of DianNao architecture machine learning accelerator.In order to evaluate the effectiveness of the algorithm,the operator performance and development efficiency of element-wise binary tensor operator,in-place unary tensor operator and convolution operator are tested on the DianNao architecture machine learning accelerator hardware platform.Experimental results show that the average speedup of the three types of operators is 125.00%,the maximum speedup is 194.00%,and the maximum development efficiency increases by 7 times.

作者梁佳利华保健苏少博 LIANG Jiali;HUA Baojian;SU Shaobo(School of Software Engineering,University of Science and Technology of China,Hefei 230000,China)

机构地区中国科学技术大学软件学院

出处《计算机科学》 CSCD 北大核心 2023年第2期374-383,共10页 Computer Science

基金中国科学技术大学研究生教育创新计划(2020YCJC41,2021YCJC34)。

关键词深度学习张量编译器领域专用处理器张量化循环划分 Deep learning Tensor compiler Domain-specific processor Tensorization Loop partition

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

1程晨.鸿蒙eTS开发入门(6) Swiper组件[J].无线电,2023(2):67-68.
2游俊良,刘羽超,古智鹏.直流换流阀TVM板发热预测及预警方法[J].电工技术,2022(24):210-212.
3熊义龙,曹炳尧,谢莹庆.基于双散列的DPDK负载均衡算法设计与实现[J].工业控制计算机,2023,36(1):65-67. 被引量：2
4魏巍,冯蓬勃,陈峥廷.基于增强现实的智能辅助装配系统设计研究[J].工业设计,2023(1):152-154. 被引量：3
5邓晓璐,姚松.基于Scrapy的新浪微博数据爬虫研究[J].现代信息科技,2023,7(3):44-47. 被引量：4
6You-Yuan Wang,Wei Hu,Fu-Sheng Wang,Chao Zhang.Revisiting the role of human memory CD8+ T cells in immune surveillance[J].Cellular & Molecular Immunology,2022,19(11):1319-1321.
7王雅仪,刘琛,黄天波,文伟平.改进的基于底层虚拟机混淆器的指令混淆框架[J].计算机应用,2023,43(2):490-498.
8迟宇宁,郭云飞,王亚文,扈红超.基于ROP/JOP gadgets性质的软件多样化评估方法[J].网络与信息安全学报,2022,8(6):135-145.
9杨礼吉,王家祺,景丽萍,于剑.基于张量计算的卷积神经网络语义表示学习[J].计算机学报,2023,46(3):568-578. 被引量：4
10邓景威,何汉武,吴悦明,苏健豪.多感知通道融合虚拟实验的快速生成方法[J].系统仿真学报,2022,34(12):2639-2648.

计算机科学

2023年第2期

浏览历史

内容加载中请稍等...

融合循环划分的张量指令生成优化

相关作者

相关机构

相关主题

浏览历史