期刊文献+

GPU上高效Jacobi迭代算法的研究与实现 被引量:2

Research and Implementation of Effective Jacobi Iteration Algorithms on GPU
下载PDF
导出
摘要 Jacobi迭代算法是一种求解偏微分方程组的常用循环运算.由于该算法存在语句间的数据相关,阻碍了其在图像处理单元(Graphic Processing Unit,GPU)等并行计算平台的高效实现.通过数学证明与实验验证,比较不同的循环优化策略,消除语句间数据相关,增强数据局部性,从而获得更高的执行性能.此外,利用块(Tile)大小选取模型,合理的划分计算数据,充分利用GPU的运算资源,进一步提高性能.实验结果表明,Jacobi奇偶复制算法比传统Jacobi并行算法在GPU上的性能提高4倍以上. Jacobi iteration method is an inherently iterative loop computation solving Partial Differential Equations. However, the pres- ence of data dependences in Jacobi loop nest poses an obstacle to its paralleled execution on the state-of-the-art parallel platform, Graphics Processing Unit ( GPU ). Analysis of mathematic and experiment assist to compare various loop optimizing strategies, which eliminate data dependence, significantly enhance Jacobi algorithm's locality, utilize latency-free characteristic of shared memory, and largely exploit GPU's potential on accelerating Jacobi algorithm. Moreover, efficient tile size selection model helps to appropriately map computation to GPU and substantially utilize its computation resources for higher performance. Experimental result demonstrates the odd-even duplication algorithm has over four times higher speedups than traditional Jacobi parallel algorithm on GPU.
出处 《小型微型计算机系统》 CSCD 北大核心 2012年第9期1962-1967,共6页 Journal of Chinese Computer Systems
基金 教育部科学技术研究重点项目(108008)资助 国家"八六三"高技术研究发展计划项目(2008AA01Z109)资助
关键词 图像处理单元 计算设备统一构架 Jacobi迭代算法 循环优化 GPU CUDA Jacobi iteration method loop optimization
  • 相关文献

参考文献11

  • 1Di P, Xue J. Model-driven tile size selection for DOACROSS loops on GPUs [ C ]. Proceedings of Euro-Par'11, 2011 : 1-12.
  • 2Vasilache N, Bastoul C, Cohen A. Polyhedral code generation in the real world[ C]. Proceedings of CC'06, 2006 : 185-201.
  • 3关治 陈景良.数值计算方法[M].北京:清华大学出版社,2001..
  • 4Song Y, Li Z. New tiling techniques to improve cache temporal locality[ C]. Proceedings of PLDI'99, 1999:215-228.
  • 5Bondhugula UKR. Bondhugula UKR effective automatic parallelization and locality optimization using the polyhedral model[ D]. Columbus: Ohio State University ,2010.
  • 6NVIDIA. CUDA C programming guide 3.2[EB/OL]. http://developer. download. nvidia. com/compute/cuda/3_2/toolkit/docs/ CUDA_C_Programming Guide. pdf, March, 2011.
  • 7Huang Q, Xue J, Vera X. Code tiling for improving the cache performance of PDE solvers[ C]. Proceedings of ICPP'03,2003:615-625.
  • 8Baskaran M M, Ramanujam J, Sadayappan P. Automatic C-to-CUDA code generation for affine [ C ]. Proceedings of CC' 10, 2010 : 185-201.
  • 9Axelsson O, Lindskog G. Constant wavefront iteration methods for 9 and 15 point difference matrices[J]. Computing, 1991,46 (3) : 233 -252.
  • 10蒋江,张民选,廖湘科.异构集群系统中一种基于资源的负载平衡算法的设计与模拟[J].小型微型计算机系统,2003,24(4):625-630. 被引量:4

二级参考文献17

  • 1[1]Mitzenmacher M.The power of two choices in randomized load balancing[D]. PhD Thesis, University of California, Berkeley, 1996.
  • 2[2]Barak A,and Braverman A. Memory ushering in a scalable computing cluster[J]. Journal of Microprocessors and Microsystems. August 1998.22(3-4): 175~182
  • 3[3]Berman F, Wolski R, Figueira S, Schopf J, and Shao F. Application-level scheduling on distributed heterogeneous networks[J]. Proceedings of Supercomputing'96, November 1996.
  • 4[4]Douglis F,and Ousterhout J. Transparent process migration: design alternatives and the sprite implementation[J]. Software - Practice and Experience. 1991.21(8):757~785
  • 5[5]Raman R, Livny M, Solomon M. Matchmaking: distributed resource management for high throughput computing[D]. University of Wisconsin, Madison, 1998.
  • 6[6]Harchol-Balter M, and Downey A B. Exploiting process lifetime distributions for dynamic load balancing[J]. ACM Transactions on Computer Systems. 1997.15(3):253~285
  • 7[7]Zhang X, Qu Y, and Xiao L. Improving distributed workload performance by sharing both CPU and memory resources[C]. Proceedings of 20th International Conference on Distributed Computing Systems, (ICDCS'2000), Taipei, Taiwan, April 10-13, 2000.
  • 8[8]Xiao L, Zhang X, and Qu Y. Effective load sharing on heterogeneous networks of workstation[C]. Proceedings of the 2000 International Parallel and Distributed Processing Symposium, (IPDPS'2000), Cancun, Mexico. May 1-5, 2000. 431~438
  • 9[9]Zhou S. A trace-driven simulation study of load balancing[J]. IEEE Transactions on Software Engineering.1988. 14(9): 1327~1341
  • 10[10]10. Neeracher M. Scheduling for heterogeneous opportunistic workstation clusters[D]. PhD Thesis. Swiss Federal Institute of Technology, 1998.

共引文献9

同被引文献11

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部