摘要
为了提高卷积神经网络模型中二维矩阵卷积的计算效率,基于FT2000多核向量处理器研究二维矩阵卷积的并行实现方法.通过使用广播指令将卷积核元素广播至向量寄存器,使用向量LOAD指令加载卷积矩阵行元素,并通过混洗操作将不易并行化的矩阵卷积操作变成可以向量化的乘加操作,实现了通过减少访存、充分复用已取数据的方式来提高算法的执行效率.设计卷积矩阵规模变化、卷积核规模不变和卷积矩阵规模不变、卷积核规模变化2种常用矩阵卷积计算方式,并对比分析不同计算方式对算法执行效率的影响.基于服务器级多核CPU和TI6678进行实验对比,实验结果显示,FT2000比多核CPU及TI6678具有更好的计算优势,相比多核CPU最高可加速11 974倍,相比TI6678可加速21倍.
A parallel implementation method based on multi-core vector processor FT2000 was proposed to improve the computational efficiency of two-dimensional matrix convolution in convolution neural network model.The convolution kernel element was broadcast to vector register by using broadcast instruction;the row elements of the convolution matrix were vector loaded.With shuffle operation,the operation of matrix convolution,which is hard to be parallelled,can be vectorized by using multiply-add operation,and the implementation efficiency was achieved through reduction of access,full reuse of obtained data.Two kinds of common matrix convolution methods were designed:changing convolution matrix scale with constant convolution kernel size,and constant convolution matrix size with changing convolution kernel scale.The influence of different calculation methods on the algorithm execution efficiency was analyzed and compared.Finally,the comparison experiments were taken based on the server-level multi-core CPU and TI6678.Results show that FT2000 has a better computing advantage over multi-core CPU and TI6678,which can accelerate up to 11 974 times compared to multi-core CPU,while to TI6678 it is 21 times.
出处
《浙江大学学报(工学版)》
EI
CAS
CSCD
北大核心
2018年第3期515-523,共9页
Journal of Zhejiang University:Engineering Science
基金
国家自然科学基金资助项目(60133007
61572025)
国家重点研发计划资助项目(2016YFB0200401)
关键词
矩阵卷积
向量处理器
并行算法
性能优化
卷积神经网络
matrix convolution
vector processor
parallel algorithm
performance optimization
convolution neural network