期刊文献+

二维矩阵卷积的并行计算方法 被引量:8

Parallel computing method for two-dimensional matrix convolution
下载PDF
导出
摘要 为了提高卷积神经网络模型中二维矩阵卷积的计算效率,基于FT2000多核向量处理器研究二维矩阵卷积的并行实现方法.通过使用广播指令将卷积核元素广播至向量寄存器,使用向量LOAD指令加载卷积矩阵行元素,并通过混洗操作将不易并行化的矩阵卷积操作变成可以向量化的乘加操作,实现了通过减少访存、充分复用已取数据的方式来提高算法的执行效率.设计卷积矩阵规模变化、卷积核规模不变和卷积矩阵规模不变、卷积核规模变化2种常用矩阵卷积计算方式,并对比分析不同计算方式对算法执行效率的影响.基于服务器级多核CPU和TI6678进行实验对比,实验结果显示,FT2000比多核CPU及TI6678具有更好的计算优势,相比多核CPU最高可加速11 974倍,相比TI6678可加速21倍. A parallel implementation method based on multi-core vector processor FT2000 was proposed to improve the computational efficiency of two-dimensional matrix convolution in convolution neural network model.The convolution kernel element was broadcast to vector register by using broadcast instruction;the row elements of the convolution matrix were vector loaded.With shuffle operation,the operation of matrix convolution,which is hard to be parallelled,can be vectorized by using multiply-add operation,and the implementation efficiency was achieved through reduction of access,full reuse of obtained data.Two kinds of common matrix convolution methods were designed:changing convolution matrix scale with constant convolution kernel size,and constant convolution matrix size with changing convolution kernel scale.The influence of different calculation methods on the algorithm execution efficiency was analyzed and compared.Finally,the comparison experiments were taken based on the server-level multi-core CPU and TI6678.Results show that FT2000 has a better computing advantage over multi-core CPU and TI6678,which can accelerate up to 11 974 times compared to multi-core CPU,while to TI6678 it is 21 times.
出处 《浙江大学学报(工学版)》 EI CAS CSCD 北大核心 2018年第3期515-523,共9页 Journal of Zhejiang University:Engineering Science
基金 国家自然科学基金资助项目(60133007 61572025) 国家重点研发计划资助项目(2016YFB0200401)
关键词 矩阵卷积 向量处理器 并行算法 性能优化 卷积神经网络 matrix convolution vector processor parallel algorithm performance optimization convolution neural network
  • 相关文献

参考文献3

二级参考文献24

  • 1李辉,张安,赵敏,徐琦.粒子群优化算法在FIR数字滤波器设计中的应用[J].电子学报,2005,33(7):1338-1341. 被引量:37
  • 2马宝山,朱义胜.一种用于基因预测的FIR数字滤波器[J].电子学报,2007,35(9):1710-1713. 被引量:8
  • 3Mehrara M, Jablin T, Upton D, et al. Multicore compilation strnd challenges[ J ]. II.EIZ. Signal g Magazine, 21309, 26(6) :55 - 63.
  • 4Mirzaei S, Hosangadi A, Kastner R. FPGA implementation of high speed FIR filters using add and shift method[ A]. Proceed- ings of International Conference on Computer Design[ C]. San Jose, California: IEEE, 2006.308 - 313.
  • 5Shahbahrami A, Juurlink B H H. Vassiliadis S. Efficient vector- ization of the FIR filter[A]. Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing [ C ]. Veldhoven, the Netherlands: , 2005.432 - 437.
  • 6Kutil R, Eder P. Parallelizafion of wavelet filters using SIMD extensions[ J]. Parallel Processing Letters, 2006, 16 (3) : 335 -349.
  • 7Kutil R. Parallelization of fiR filters using SIMD extensions [ A]. Proceedings of the 15th Intemational Conference on Sys- tems, Signals and Image Processing[ C]. Bratislava, Slovak Re- public: n,l:,l:,, 2008.65 - 68.
  • 8Dang B L, Engin N, Gaydadjiev G N. Efficient filtering with the co-vector processor[ A ]. Proceedings of the 14th Annual Workshop on Circuits, Systems and Signal Processing [ C ].Veldhoven, The Netherlands:l,2003.351 - 356.
  • 9Texas Instruments. C67x floating point benchmarks[ R/OL ]. http://www, ft. com/sc/docs/products/dsp/c6000/67bench. hlm,2011 - 11 - 03.
  • 10Texas InslnLrnents. C64x floating point benchrnarks[ R/OL]. ht://focus, ft. com/dsp/docs/dspplatformscontentaut, tsp? secfionId = 2familyId = 4778aabId = 496,2011 - 11 - 03.

共引文献9

同被引文献56

引证文献8

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部