Cache performance optimization of irregular sparse matrix multiplication on modern multi-core CPU and GPU

Cache performance optimization of irregular sparse matrix multiplication on modern multi-core CPU and GPU

下载PDF

导出

摘要 This paper focuses on how to optimize the cache performance of sparse matrix-matrix multiplication(SpGEMM).It classifies the cache misses into two categories;one is caused by the irregular distribution pattern of the multiplier-matrix,and the other is caused by the multiplicand.For each of them,the paper puts forward an optimization method respectively.The first hash based method removes cache misses of the 1 st category effectively,and improves the performance by a factor of 6 on an Intel 8-core CPU for the best cases.For cache misses of the 2nd category,it proposes a new cache replacement algorithm,which achieves a cache hit rate much higher than other historical knowledge based algorithms,and the algorithm is applicable on CELL and GPU.To further verify the effectiveness of our methods,we implement our algorithm on GPU,and the performance perfectly scales with the size of on-chip storage. This paper focuses on how to optimize the cache performance of sparse matrix-matrix multiplication （SpGEMM）.It classifies the cache misses into two categories：one is caused by the irregular distribution pattern of the multiplier-matrix,and the other is caused by the multiplicand.For each of them,the paper puts forward an optimization method respectively.The first hash based method removes cache misses of the 1 st category effectively,and improves the performance by a factor of 6 on an Intel 8-core CPU for the best cases.For cache misses of the 2nd category,it proposes a new cache replacement algorithm,which achieves a cache hit rate much higher than other historical knowledge based algorithms,and the algorithm is applicable on CELL and GPU.To further verify the effectiveness of our methods,we implement our algorithm on GPU,and the performance perfectly scales with the size of on-chip storage.

作者刘力 LiuLi Yang Guang wen

机构地区 Department of Computer Science and Technology

出处《High Technology Letters》 EI CAS 2013年第4期339-345,共7页 高技术通讯（英文版）

基金 Supported by the National High Technology Research and Development Programme of China(No.2010AA012302,2009AA01 A134) Tsinghua National Laboratory for Information Science and Technology(TNList)Cross-discipline Foundation

关键词高速缓存性能优化矩阵乘法稀疏矩阵 GPU CPU 缓存替换算法多核心 sparse matrix multiplication cache miss scalability multi-core CPU GPU

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献12

1Cullum J K, Willoughby R A. Lanczos Algorithms forLarge Symmetric Eigenvalue Computations. Boston: Birkh auser, 1985. 252-268.
2Rabin M O, Vazirani V V. Maximum matchings in gener- al graphs through randomization. Journal of Algorithms, 1989, 10(4) :557-567.
3Yuster R, Zwick U. Detecting short directed cycles using rectangular matrix multiplication and dynamic program- ming. In: Proceedings of the 15th annual ACM-SIAM symposium on discrete algorithms, Philadelphia, USA, 2004. 254-260.
4Zwick U. All pairs shortest paths using bridging sets and rectangular matrix multiplication. Journal of the ACM, 2002, 49(3) :289-317.
5Gustavson F G. Two fast algorithms for sparse matrices: multiplication and permuted transposition. ACM Transac- tions on Mathematical Software ( TOMS ), 1978, 4 ( 3 ) : 250-269.
6Yuster R, Zwick U. Fast sparse matrix multiplication. ACM Transactions on Algorithms ( TALG) , 2005, 1 ( 1 ) : 2-13.
7Gilbert J R, Moler C, Schreiber R. Sparse matrices in Matlab: design and implementation. SlAM Journal on Matrix Analysis and Applications, 1992, 13( 1 ) :333-356.
8Buluc A, Gilbert J R. On the Representation and multi- plication of hypersparse matrices. In: 2008 IEEE Inter-national Symposium on Parallel and Distributed Process- ing. Miami, USA, 2008. 1-11.
9Buluc A, Gilbert J R. Challenges and advances in paral- lel sparse matrix-matrix multiplication. In: Proceedings of the 37th International Conference on Parallel Process- ing, Washington DC, USA, 2008. 503-510.
10Ryoo S, Rodrigues C I, Baghsorkhi S S, et al. Optimiza- tion principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. New York, USA, 2008.73-82.

1Fu Xiong Wang Ruchuan.CTUNING:A REUSE DISTANCE BASED CACHE PERFORMANCE TUNING TOOL[J].Journal of Electronics(China),2009,26(4):517-524.
2争论：多核心显示解决方案，我们真的需要吗？[J].新电脑,2005(8):29-29.
3Strategic Porcine Multiplication Agreement in China[J].Feed & Livestock,2014,7(4):14-14.
4任敬儒,陆穗苓,解卫,何瑁,张乃健,王士智,白光治.大天顶角的多核心簇射[J].高能物理与核物理,1997,21(2):116-120.
5窦朝晖,胡庆丰,张秀山.主维对矩阵运算性能的影响[J].国防科技大学学报,1999,21(3):71-73.
6三头六臂的处理器[J].科技新时代,2005(7):30-30.
7李佳佳,张秀霞,谭光明,陈明宇.选择稀疏矩阵乘法最优存储格式的研究[J].计算机研究与发展,2014,51(4):882-894. 被引量：10
8周鑫,谭毅华.基于稀疏矩阵变换降维的高光谱图像目标探测算法[J].计算机与数字工程,2016,44(1):153-157.
9H. Ansari-Toroghy,F. Farshadifar.ON MULTIPLICATION AND COMULTIPLICATION MODULES[J].Acta Mathematica Scientia,2011,31(2):694-700. 被引量：2
10肖振久,胡驰,陈虹.改进的RSA算法在数字签名中的应用[J].计算机工程与应用,2014,50(17):106-109. 被引量：3

High Technology Letters

2013年第4期

浏览历史

内容加载中请稍等...

Cache performance optimization of irregular sparse matrix multiplication on modern multi-core CPU and GPU

参考文献12

相关作者

相关机构

相关主题

浏览历史