摘要
在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。
Large-scale floating-point matrix multiplication is one of the most time consuming compu- tational kernels in many applications. There is a feature in emerging applications that matrices usually own at least one small dimension, which is called non-uniform large-scale matrix multiplication. Due to the limited amount of on-chip memory for storing intermediate results on FPGA, partitioning large-scale matrix multiplication into fine grained sub-block computational tasks is needed. When accelerating non- uniform matrix multiplications, most of the existing hardware matrix multipliers with a linear array ar- chitecture can suffer great performance reduction due to the fixed sub-block size support. To solve this problem, we propose an efficient optimization blocking strategy. Based on it, we implement a novel ma- trix muhiplier to support variable sub-block operations on a Xilinx Zynq XCTZ045 FPGA. By integrating 224 processing elements (PEs), the multiplier achieves up to 48 GFI.OPS for non-uniform matrix multi- plication in real application at 150 MHz with requirement of 4.8 GB/s of memory bandwidth. Res show that our proposed blocking strategy can improve up to 12% of performance in comparison with ditional blocking algorithms
出处
《计算机工程与科学》
CSCD
北大核心
2016年第9期1748-1754,共7页
Computer Engineering & Science
基金
国家863计划(2012AA012706)
国家自然科学基金(61272145)
关键词
FPGA
非均匀矩阵
矩阵乘法
分块策略
FPGA
non-uniform matrix
matrix multiplication
blocking strategy