期刊文献+

对角线稀疏矩阵的SpMV自适应性能优化 被引量:4

Auto-Tuning of SpMV for Diagonal Sparse Matrices
下载PDF
导出
摘要 稀疏矩阵向量乘(SpMV)是科学计算中常用的内核之一,其运行速率跟非零元分布相关.针对对角线稀疏矩阵,提出了压缩行片段对角(compressed row segment diagonal,CRSD)存储格式.它利用"对角线格式"有效描述矩阵的对角线分布,区别于以往通用的计算方法,CRSD通过对给定应用的对角线稀疏矩阵采样再进行特定的优化.并且在软件安装阶段,通过自适应的方法选取适合具体运行平台的最优SpMV实现.在CPU端进行多线程并行化实现时,自适应调优过程中收集的信息还被用于线程间任务划分,以实现负载平衡.同时完成CRSD存储格式在GPU端的实现,并根据GPU端计算与访存的特点进行优化.实验结果表明:在Intel和AMD的多核平台使用相同线程数的情况下,与DIA相比,使用CRSD的加速比可以达到2.37X(平均1.7X);与CSR相比,可以达到4.6X(平均2.1X). SpMV (sparse matrix-vector multiplication) is an important computational kernel in scientific applications. Its performance is mainly affected by the nonzero distribution of sparse matrices. In this paper, we propose a new storage format for diagonal sparse matrices, called CRSD (compressed row segmented with diagonal-pattern). We design a diagonal pattern to represent the diagonal distribution. Unlike general purpose methods, our CRSD based SpMV implementation is application specific since it needs to sample the sparse matrix of one given application. The optimal implementation is selected automatically for a given hardware platform during the software installation phase. The information collected during auto-tuning process is also utilized for parallel implementation to maintain load-balance. The implementation is also tuned on GPU platform. Experimental results demonstrate that the speedup reaches up to 2.37X(1.7X on average) in comparison with DIA and 4. 6X(2.1X on average) in comparison with CSR under the same number of threads on Intel and AMD platforms.
出处 《计算机研究与发展》 EI CSCD 北大核心 2013年第3期648-656,共9页 Journal of Computer Research and Development
基金 国家"八六三"高技术研究发展计划基金项目(2009AA01A129 2009AA01A134) 国家"核高基"重大科技专项基金项目(2009ZX01036-001-002) 中国科学院知识创新工程重大项目课题(KGCX1-YW-13) 国家重大科研装备研制项目(ZDYZ2008-2) 国家自然科学基金项目(61100073 61133005 61100066) 中国科学院研究生科技创新与社会实践资助专项
关键词 CRSD 自适应性能优化SpMV 对角线格式 对角线稀疏矩阵 GPU 科学应用 compressed row segmented with diagonal-pattern(CRSD) auto-tuning SpMV diagonalpattern diagonal sparse matrix GPU scientific application
  • 相关文献

参考文献1

二级参考文献15

  • 1袁伟,张云泉,孙家昶,李玉成.国产万亿次机群系统NPB性能测试分析[J].计算机研究与发展,2005,42(6):1079-1084. 被引量:13
  • 2Whaley R C,Petitet A,Dongarra J.Automated empirical optimization of software and ATLAS project[J].Parallel Computing,2001,27(1/2):3-35.
  • 3JackDongarra[OL].[2008-03-08].http://netlib.org/utk/people/JackDongarra/PAPERS/gco_search.pdf.
  • 4Moore Cordon E.Cramming more components onto integrated circuits[J].Electronics,1965,38(8):114-117.
  • 5Bilmes Jeff,Asanovic Krste,Chin Chee-Whye,et al.Optimizing matrix multiply using PHiPAC:A portable,high-performance,ANSI C coding methodology[C]//Int Conf on Supercomputing.New York:ACM,1997:340-347.
  • 6Lawson C L,Hanson R J,Kincaid D R,et al.Algorithm 539:Basic linear algebra subprograms for FORTRAN usage[J].ACM Trans on Mathematical Software,1979,5(3):324-325.
  • 7Vuduc Richard W.Automatic performance tuning of sparse matrix kernels[D].Berkeley:University of California,Berkeley,2003.
  • 8Im Eun-Jin,Yelick Katherine A,Vuduc Richard.Sparsity:Framework for optimizing sparse matrix-vector multiply[J].Int Journal of High Performance Computing Applications,2004,18(1):135-158.
  • 9Frigo Metteo,Johnson Steven G.FFTW:An adaptive software architecture for the FFT[C]//Proc of IEEE Int Conf on Acoustics,Speech,and Signal Processing.1998:1381-1384.
  • 10Frigo Matteo,Johson Steven G.The design and implementation of FFTW3[J].Proc of the IEEE:Special Issue on Program Generation,Optimization,and Platform Adaptation,2005,93(2):216-231.

共引文献1

同被引文献15

引证文献4

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部