期刊文献+

基于Intel Xeon Phi的稀疏矩阵向量乘性能优化 被引量:1

Efficient Sparse Matrix-vector Multiplication on Intel Xeon Phi
下载PDF
导出
摘要 稀疏矩阵向量乘(Sp MV)是线性求解系统等科学计算中重要的计算核心.鉴于传统的稀疏矩阵向量乘算法在Intel Xeon Phi众核集成架构上存在SIM D利用率低,不规则访存开销高及负载不均衡的问题,难以发挥其运算能力.本文针对Intel Xeon Phi的体系结构特点,提出了一种通用的分块压缩存储表示的稀疏矩阵向量乘并行算法:(1)在ELLPACK存储格式基础上按列分块及压缩矩阵,增加非零元的密度,提高SIMD利用率;(2)通过精心的数据重排,保留矩阵非零元本身的局部性,从而提高数据重用率,降低访存开销;(3)将矩阵压缩后划分成近似等大的矩阵块并静态等量分配到不同核上计算,使各核负载均衡.实验结果表明,与Intel Xeon Phi上已有的MKL数学库中的CSR算法相比,本算法获得了更高的计算访存比,性能比M KL的CSR算法平均快2.05倍. Sparse matrix-vector multiplication( Sp M V) plays a key role in many scientific computing applications such as linear solver systems.Traditional Sp M V algorithms meet lowSIM D utilization,high overhead of irregular memory access and load imbalance on Intel Xeon Phi.This paper introduces a newparallel Sp M V algorithm with a newblocked compressing format,which focused on main architecture features of Intel Xeon Phi.These methods include(1) Column blocking and compressing based on ELLPACK format to improve the density of non-zeros,to improve SIM D utilization;(2) With carefully data re-organization,retain the locality of matrix non-zeros,and improve data reuse ratio,to decrease memory access overhead;(3) After compressing,partition the matrix into equal size of matrix blocks,and distribute them to different cores to make them load balance.Furthermore,we demonstrate that compared with the CSR kernel of Intel M KL,our implementation on Intel Xeon Phi achieves higher ratio of the number of floating point operations to the number of memory accesses,and our implementation is 2.05 faster than the CSR kernel of M KL on average.
出处 《小型微型计算机系统》 CSCD 北大核心 2016年第4期818-823,共6页 Journal of Chinese Computer Systems
基金 国家"八六三"高技术研究发展计划项目(2012AA010901 2012AA010902)资助
关键词 稀疏矩阵向量乘 数据布局重组 INTEL XEON PHI 分块压缩存储 sparse matrix-vector multiplication data layout re-organization Intel Xeon Phi blocked compressing storage
  • 相关文献

参考文献2

二级参考文献17

  • 1袁伟,张云泉,孙家昶,李玉成.国产万亿次机群系统NPB性能测试分析[J].计算机研究与发展,2005,42(6):1079-1084. 被引量:13
  • 2董春丽,韩林,赵荣彩.并行编译中一种线性数据和计算划分算法[J].计算机工程,2006,32(24):26-28. 被引量:5
  • 3Vuduc Wilson.Automatic Performance of Sparse Matrix Kernels[D].Berkeley,CA:University of California,2003.
  • 4Im Eun Jin,Yelick Katherine.Optimizing sparse matrix computations for register reuse in SPARSITY[G] //LNCS 2073,Proc of the Int Conf on Computational Science.Berlin,Springer,2001,127-136.
  • 5Im Eun Jin,Yelick Katherine,Vudue Wilson.Sparsity,Optimization framework for fparse matrix kernels[J].International Journal of High Performance Computing Applications,2004,18(1):135-158.
  • 6Vuduc Wilson,Demmel James,Yelick Katherine,et al.Performance optimizarions and bounds for sparse matrixvector multiply[C] //Proc of Supercomputing.Los Alamitos,CA:IEEE Computer Society,2002= 1-35.
  • 7Vuduc Wilson,Demmel James,Bilmes Jeff.Statistical models for empirical search-based performance tuning[J].International Journal of High Performance Computing Applications,2004,18(1):65-94.
  • 8Demmel James,Yelick Katherine.Berkeley Benchmarking and OPtimization Project[OL].2006 [2007-11-20],http:// bebop.cs.berkeley.edu/.
  • 9Voduc Wilson,Demmel James,Yelick Katherine.OSKI,A library of automatically tuned sparse matrix kernels[C] //Proc of SciDAC 2005:Journal of Physics,Conference Series.Philadelphia,PA:IOP,2005:521-530.
  • 10Davis Tim.University of Florida sparse matrix collection[OL].2006[2007-11-20].http://www.else.ufl.edu/ research/sparse/matrices/.

共引文献15

同被引文献5

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部