期刊文献+

一种六边形循环分块的Jacobi计算优化方法

Hexagonal Loop Tiling for Jacobi Computation Optimization Method
下载PDF
导出
摘要 Jacobi计算是一种模板计算,在科学计算领域具有广泛的应用.围绕Jacobi计算的性能优化是一个经典的课题,其中循环分块是一种较有效的优化方法.现有的循环分块主要关注分块对并行通信和程序局部性的影响,缺少对负载均衡和向量化等其他因素的考虑.面向多核计算架构,分析比较不同分块方法,并选择一种先进的六边形分块作为加速Jacobi计算的主要方法.在分块大小选择上,综合考虑分块对程序向量化效率、局部性和计算核负载均衡等多方面的影响,提出一种六边形分块大小选择算法Hexagon_TSS.实验结果表明所提算法相对于原始串行程序计算方法,最好情况可将L1数据缓存失效率降低至其5.46%,最大加速比可达24.48,并且具有良好的可扩展性. Jacobi computation is a kind of stencil computation,which has been widely applied in the field of scientific computing.The performance optimization of Jacobi computation is a classic topic,where loop tiling is an effective optimization method.The existing loop tiling methods mainly focus on the impact of tiling on parallel communication and program locality and fail to consider other factors such as load balancing and vectorization.This study analyzes and compares several tiling methods based on multi-core computing architecture and chooses an advanced hexagonal tiling as the main method to accelerate Jacobi computation.For tile size selection,this study proposes a hexagonal tile size selection algorithm called Hexagon_TSS by comprehensively considering the impact of tiling on load balancing,vectorization efficiency,and locality.The experimental results show that the L1 data cache miss rate can be reduced to 5.46%of original serial program computation in the best case by Hexagon_TSS,and the maximum speedup reaches 24.48.The proposed method also has excellent scalability.
作者 屈彬 刘松 张增源 马洁 伍卫国 QU Bin;LIU Song;ZHANG Zeng-Yuan;MA Jie;WU Wei-Guo(School of Computer Science and Technology,Xi’an Jiaotong University,Xi’an 710049,China)
出处 《软件学报》 EI CSCD 北大核心 2024年第8期3721-3738,共18页 Journal of Software
基金 国家自然科学基金(62002279) 陕西省自然科学基础研究计划一般项目(青年)(2020JQ-077)。
关键词 Jacobi计算 六边形分块方法 分块大小选择 性能优化 多核架构 Jacobi computation hexagonal tiling method tile size selection performance optimization multi-core architecture
  • 相关文献

参考文献6

二级参考文献100

  • 1关治 陈景良.数值计算方法[M].北京:清华大学出版社,2001..
  • 2NVIDIA. CUDA C programming guide 3.2[EB/OL]. http://developer. download. nvidia. com/compute/cuda/3_2/toolkit/docs/ CUDA_C_Programming Guide. pdf, March, 2011.
  • 3Xue J. Aggressive loop fusion for improving locality and parallelism [ J ]. Parallel and Distributed Processing and Applications, 2005,3758 : 224 -238.
  • 4Vasilache N, Bastoul C, Cohen A. Polyhedral code generation in the real world[ C]. Proceedings of CC'06, 2006 : 185-201.
  • 5Baskaran M M, Ramanujam J, Sadayappan P. Automatic C-to-CUDA code generation for affine [ C ]. Proceedings of CC' 10, 2010 : 185-201.
  • 6Bondhugula UKR. Bondhugula UKR effective automatic parallelization and locality optimization using the polyhedral model[ D]. Columbus: Ohio State University ,2010.
  • 7Huang Q, Xue J, Vera X. Code tiling for improving the cache performance of PDE solvers[ C]. Proceedings of ICPP'03,2003:615-625.
  • 8Axelsson O, Lindskog G. Constant wavefront iteration methods for 9 and 15 point difference matrices[J]. Computing, 1991,46 (3) : 233 -252.
  • 9Song Y, Li Z. New tiling techniques to improve cache temporal locality[ C]. Proceedings of PLDI'99, 1999:215-228.
  • 10Di P, Xue J. Model-driven tile size selection for DOACROSS loops on GPUs [ C ]. Proceedings of Euro-Par'11, 2011 : 1-12.

共引文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部