期刊文献+

一种支持优化分块策略的矩阵乘加速器设计 被引量:4

A matrix multiplication accelerator design for optimization blocking strategy
下载PDF
导出
摘要 在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。 Large-scale floating-point matrix multiplication is one of the most time consuming compu- tational kernels in many applications. There is a feature in emerging applications that matrices usually own at least one small dimension, which is called non-uniform large-scale matrix multiplication. Due to the limited amount of on-chip memory for storing intermediate results on FPGA, partitioning large-scale matrix multiplication into fine grained sub-block computational tasks is needed. When accelerating non- uniform matrix multiplications, most of the existing hardware matrix multipliers with a linear array ar- chitecture can suffer great performance reduction due to the fixed sub-block size support. To solve this problem, we propose an efficient optimization blocking strategy. Based on it, we implement a novel ma- trix muhiplier to support variable sub-block operations on a Xilinx Zynq XCTZ045 FPGA. By integrating 224 processing elements (PEs), the multiplier achieves up to 48 GFI.OPS for non-uniform matrix multi- plication in real application at 150 MHz with requirement of 4.8 GB/s of memory bandwidth. Res show that our proposed blocking strategy can improve up to 12% of performance in comparison with ditional blocking algorithms
出处 《计算机工程与科学》 CSCD 北大核心 2016年第9期1748-1754,共7页 Computer Engineering & Science
基金 国家863计划(2012AA012706) 国家自然科学基金(61272145)
关键词 FPGA 非均匀矩阵 矩阵乘法 分块策略 FPGA non-uniform matrix matrix multiplication blocking strategy
  • 相关文献

参考文献9

  • 1Zhang Ting. Research on key technology of accelerating float- ing-point matrix multiplication based on FPGA in embedded environment[D]. Changsha: Hunan University, 2013: 361- 367. (in Chinese).
  • 2Jang J-W,Choi S, Prasanna V K. Area and time efficient im plementation of matrix multiplication on FPGAs[C]//Proc of the International Conference on Field-Programmable Tech- nology(FPT~ 02), 2002 : 93 -100.
  • 3Zhuo L, Prasanna V. Scalable and modular algorithms for floating point matrix multiplication on FPGAs[C]//Proe of the 18th International Parallel and Distributed Processing Symposium,2004:92. doi: 10. ll09/IPDPS. 2004. 1303036.
  • 4Jang J-W,Choi S, Prasanna V K. Energy- and time-efficient matrix multiplication on FPGAs[C]//Proc of the Interna tional Conference on VLSI Design ( VLSI' 2005), 2005 : 1305 -1319.
  • 5Dou Y, Vassiliadis S, Kuzmanov G K. 64-bit floating-point FPGA matrix multiplieation[C]//Proc of the International Symposium on Field-Programmable Gate Arrays (FPGA' 05) ,2005:86- 95.
  • 6Zhuo I,, Prasanna V K. Scalable and modular algorithms for floating-point matrix multiplication on reeonfigurable compu- ting systems [J].IEEE Transactions on Parallel and Distrib- uted Systems, 2007,18(4) : 433-448.
  • 7Kumar V, Joshi S, Patkar S, et al. FPGA based high per formanee Double precision matrixe multiplication[C]//Proc of the International Conference on VLSI Design (VLSI' 2009) :341-346.
  • 8Jovanovic Z, Milutinovic V. FPGA accelerator for floating- point matrix multiplication[J]. IET Computers g>- Digital Techniques, 2012,6 (4) : 249-256.
  • 9Krizhevsky A,Sutskever I, Hinton G E. Imagenet classifica- tion with deep convolutional neural networks[J]. Advanced in Neural Information Processing Systems, 2012, 25 (2): 1097- 1105.

同被引文献15

引证文献4

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部