This paper focuses on how to optimize the cache performance of sparse matrix-matrix multiplication(SpGEMM).It classifies the cache misses into two categories;one is caused by the irregular distribution pattern of the ...This paper focuses on how to optimize the cache performance of sparse matrix-matrix multiplication(SpGEMM).It classifies the cache misses into two categories;one is caused by the irregular distribution pattern of the multiplier-matrix,and the other is caused by the multiplicand.For each of them,the paper puts forward an optimization method respectively.The first hash based method removes cache misses of the 1 st category effectively,and improves the performance by a factor of 6 on an Intel 8-core CPU for the best cases.For cache misses of the 2nd category,it proposes a new cache replacement algorithm,which achieves a cache hit rate much higher than other historical knowledge based algorithms,and the algorithm is applicable on CELL and GPU.To further verify the effectiveness of our methods,we implement our algorithm on GPU,and the performance perfectly scales with the size of on-chip storage.展开更多
Cache performance tuning tools are conducive to develop program with good locality and fully use cache to decrease the influence caused by speed gap between processor and memory. This paper introduces the design and i...Cache performance tuning tools are conducive to develop program with good locality and fully use cache to decrease the influence caused by speed gap between processor and memory. This paper introduces the design and implementation of a cache performance tuning tool named CTuning, which employs a source level instrumentation method to gather program data access information, and uses a limited reuse distance model to analyze cache behavior. Experiments on 183.equake improve average performance more than 6% and show that CTuning is proficient not only in locating cache performance bottlenecks to guide manual code transformation, but also in analyzing cache behavior relationship among variables, thus to direct manual data reorganization.展开更多
基金Supported by the National High Technology Research and Development Programme of China(No.2010AA012302,2009AA01 A134)Tsinghua National Laboratory for Information Science and Technology(TNList)Cross-discipline Foundation
文摘This paper focuses on how to optimize the cache performance of sparse matrix-matrix multiplication(SpGEMM).It classifies the cache misses into two categories;one is caused by the irregular distribution pattern of the multiplier-matrix,and the other is caused by the multiplicand.For each of them,the paper puts forward an optimization method respectively.The first hash based method removes cache misses of the 1 st category effectively,and improves the performance by a factor of 6 on an Intel 8-core CPU for the best cases.For cache misses of the 2nd category,it proposes a new cache replacement algorithm,which achieves a cache hit rate much higher than other historical knowledge based algorithms,and the algorithm is applicable on CELL and GPU.To further verify the effectiveness of our methods,we implement our algorithm on GPU,and the performance perfectly scales with the size of on-chip storage.
基金Sponsored by the National Natural Science Foundation of China (No.60573141, 60773041)National 863 High Tech- nology Research Program of China (No.2007AA01Z404, 2007AA01Z478)+2 种基金High Technology Research Programme of Jiangsu Province (No.BG2006001)Key Laboratory of Information Technology Processing of Jiangsu Province (kjs06006)Project of NJUPT (NY207135)
文摘Cache performance tuning tools are conducive to develop program with good locality and fully use cache to decrease the influence caused by speed gap between processor and memory. This paper introduces the design and implementation of a cache performance tuning tool named CTuning, which employs a source level instrumentation method to gather program data access information, and uses a limited reuse distance model to analyze cache behavior. Experiments on 183.equake improve average performance more than 6% and show that CTuning is proficient not only in locating cache performance bottlenecks to guide manual code transformation, but also in analyzing cache behavior relationship among variables, thus to direct manual data reorganization.