期刊文献+

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs 被引量:1

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs
原文传递
导出
摘要 In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only. In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.
作者 Yang Yang Hui-Min Cui Xiao-Bing Feng Jing-Ling Xue 杨杨;崔慧敏;冯晓兵;薛京灵(State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences Beijing 100190,China;Graduate University of Chinese Academy of Sciences,Beijing 100190,China;Programming Languages and Compilers Group,School of Computer Science and Engineering University of New South Wales,Sydney,NSW 2052,Australia)
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第1期57-74,共18页 计算机科学技术学报(英文版)
基金 Supported in part by the National Basic Research 973 Program of China under Grant Nos. 2011CB302504 and 2011ZX01028-001-002 the National High Technology Research and Development 863 Program of China under Grant No. 2009AA01A129 the National Natural Science Foundation of China (NSFC) under Grant No. 60970024 the Innovation Research Group of NSFC under Grant No. 60921002
关键词 stencil computation circular queue GPU OCCUPANCY REGISTER stencil computation, circular queue, GPU, occupancy, register
  • 相关文献

参考文献37

  • 1Wonnacott D. Achieving scalable locality with time skewing. Int. J. Parallel Program 2002, 30(3): 181-221.
  • 2Mccalpin J, Wonnacott D. Time skewing: A value-based ap- proach to optimizing for memory locality. Technical Report DCS-TR-379, Department of Computer Science, Rugers Uni- versity. 1999.
  • 3Strzodka R, Shaheen M, Pajak D et aL Cache oblivious.parallelograms in iterative stencil computations. In Froc. the 24th ACM Int. Conf. Supercomputing, Tsukuba, Japan, Jan. 1-4, 2010, pp.49-59.
  • 4Song Y, Li Z. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Program- ruing Language Design and Implementation, Atlanta, USA, May 1-4, 1999, pp.215-228.
  • 5Jin G, Mellor-Crummey J, Fowler R. Increasing tempo- ral locality with skewing and recursive blocking. In Proc. ACM/IEEE Conference on Supercomputing, Denver, USA, Nov. 10-16, 2001, pp.43-43.
  • 6Datta K, Murphy M, Volkov Vet al. Stencil computation op- timization and auto-tuning on state-of-the-art multicore ar- chitectures. In Proe. A CM/IEEE Conference on Supercom- paring, Austin, USA, Nov.15-21, 2008,.
  • 7Article 4. Williams S, Shall J, Oliker Let al. Scientific computing Ker- nels on the cell processor. Int. J. Parallel Program, 2007, 35(3): 263-298.
  • 8Meng J, Skadron K. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proc. the 23rd International Conference on Supereomput- ing, Yorktown Heights, USA, Jun. 8-12, 2009, pp.256-265.
  • 9NVIDIA. NVIDIA CUDA programming guide 3.0, http://de- veloper.download.nvidia.com/compute/cuda/3_0/toolkit/do- cs/NVIDIA_CUDA_ProgrammingGuide-pdf, 2010.
  • 10NVIDIA Corp. CUDA Occupancy Calculator, 2010.

同被引文献14

  • 1徐心和,王骄.中国象棋计算机博弈关键技术分析[J].小型微型计算机系统,2006,27(6):961-969. 被引量:61
  • 2周玮,王水涛,孙旸.中国象棋计算机博弈中的一种数据结构方法[J].计算机工程与应用,2006,42(35):219-221. 被引量:2
  • 3徐心和,徐长明.计算机博弈原理与方法学概述[C].中国人工智能进展:2009.北京:北京邮电出版社,2009.10.
  • 4刘知青,李文峰.现代计算机围棋基础[M].北京:北京邮电大学出版社,2011:63-80.
  • 5Gao Qiang,Xu Xinhe.The NSCGT-CCGC computer games tournament[J].International Computer Games Association Journal,2013,36(4):252-254.
  • 6Tong Guofeng,Xu Xinhe.Progress of computer games in China[J].International Computer Games Association Journal,2011,34(3):168-170.
  • 7Zhang Liqun,Ding Lili,Li Zhenlai.Research on the battle platform in computer game[C]//Proceedings of the 24th Chinese Control and Decision Conference.Piscataway,NJ:IEEE Press,2012:1513-1516.
  • 8Zhang Liqun,Ding Lili,Li Zhenla.The design of surakarta chess battle platform in computer game[C]//Proceedings of the 25th Chinese Control and Decision Conference.Piscataway,NJ:IEEE Press,2013:2332-2335.
  • 9Yen Shijim,Chou Chengwei,Chen Jrchang,et al.Design and implementation of Chinese Dark Chess programs[J].IEEE Transactions on Computational Intelligence and AI in Games,2015,7(1):66-74.
  • 10Silvela J,Portillo J.Breadth-first search and its application to image processing problems[J].IEEE Transactions on Image Processing,2001,10(8):1194-1199.

引证文献1

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部