A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs 被引量：1

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

导出

摘要 In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only. In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

作者 Yang Yang Hui-Min Cui Xiao-Bing Feng Jing-Ling Xue 杨杨;崔慧敏;冯晓兵;薛京灵(State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences Beijing 100190,China;Graduate University of Chinese Academy of Sciences,Beijing 100190,China;Programming Languages and Compilers Group,School of Computer Science and Engineering University of New South Wales,Sydney,NSW 2052,Australia)

机构地区 State Key Laboratory of Computer Architecture Graduate University of Chinese Academy of Sciences Programming Languages and Compilers Group

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第1期57-74,共18页 计算机科学技术学报（英文版）

基金 Supported in part by the National Basic Research 973 Program of China under Grant Nos. 2011CB302504 and 2011ZX01028-001-002 the National High Technology Research and Development 863 Program of China under Grant No. 2009AA01A129 the National Natural Science Foundation of China (NSFC) under Grant No. 60970024 the Innovation Research Group of NSFC under Grant No. 60921002

关键词 stencil computation circular queue GPU OCCUPANCY REGISTER stencil computation, circular queue, GPU, occupancy, register

分类号 TP391.41 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献37

1Wonnacott D. Achieving scalable locality with time skewing. Int. J. Parallel Program 2002, 30(3): 181-221.
2Mccalpin J, Wonnacott D. Time skewing: A value-based ap- proach to optimizing for memory locality. Technical Report DCS-TR-379, Department of Computer Science, Rugers Uni- versity. 1999.
3Strzodka R, Shaheen M, Pajak D et aL Cache oblivious.parallelograms in iterative stencil computations. In Froc. the 24th ACM Int. Conf. Supercomputing, Tsukuba, Japan, Jan. 1-4, 2010, pp.49-59.
4Song Y, Li Z. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Program- ruing Language Design and Implementation, Atlanta, USA, May 1-4, 1999, pp.215-228.
5Jin G, Mellor-Crummey J, Fowler R. Increasing tempo- ral locality with skewing and recursive blocking. In Proc. ACM/IEEE Conference on Supercomputing, Denver, USA, Nov. 10-16, 2001, pp.43-43.
6Datta K, Murphy M, Volkov Vet al. Stencil computation op- timization and auto-tuning on state-of-the-art multicore ar- chitectures. In Proe. A CM/IEEE Conference on Supercom- paring, Austin, USA, Nov.15-21, 2008,.
7Article 4. Williams S, Shall J, Oliker Let al. Scientific computing Ker- nels on the cell processor. Int. J. Parallel Program, 2007, 35(3): 263-298.
8Meng J, Skadron K. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proc. the 23rd International Conference on Supereomput- ing, Yorktown Heights, USA, Jun. 8-12, 2009, pp.256-265.
9NVIDIA. NVIDIA CUDA programming guide 3.0, http://de- veloper.download.nvidia.com/compute/cuda/3_0/toolkit/do- cs/NVIDIA_CUDA_ProgrammingGuide-pdf, 2010.
10NVIDIA Corp. CUDA Occupancy Calculator, 2010.

同被引文献14

1徐心和,王骄.中国象棋计算机博弈关键技术分析[J].小型微型计算机系统,2006,27(6):961-969. 被引量：61
2周玮,王水涛,孙旸.中国象棋计算机博弈中的一种数据结构方法[J].计算机工程与应用,2006,42(35):219-221. 被引量：2
3徐心和,徐长明.计算机博弈原理与方法学概述[C].中国人工智能进展:2009.北京:北京邮电出版社,2009.10.
4刘知青,李文峰.现代计算机围棋基础[M].北京:北京邮电大学出版社,2011:63-80.
5Gao Qiang,Xu Xinhe.The NSCGT-CCGC computer games tournament[J].International Computer Games Association Journal,2013,36(4):252-254.
6Tong Guofeng,Xu Xinhe.Progress of computer games in China[J].International Computer Games Association Journal,2011,34(3):168-170.
7Zhang Liqun,Ding Lili,Li Zhenlai.Research on the battle platform in computer game[C]//Proceedings of the 24th Chinese Control and Decision Conference.Piscataway,NJ:IEEE Press,2012:1513-1516.
8Zhang Liqun,Ding Lili,Li Zhenla.The design of surakarta chess battle platform in computer game[C]//Proceedings of the 25th Chinese Control and Decision Conference.Piscataway,NJ:IEEE Press,2013:2332-2335.
9Yen Shijim,Chou Chengwei,Chen Jrchang,et al.Design and implementation of Chinese Dark Chess programs[J].IEEE Transactions on Computational Intelligence and AI in Games,2015,7(1):66-74.
10Silvela J,Portillo J.Breadth-first search and its application to image processing problems[J].IEEE Transactions on Image Processing,2001,10(8):1194-1199.

引证文献1

1张利群.实现苏拉卡尔塔棋网络博弈平台的吃子算法[J].计算机工程与应用,2016,52(7):62-66. 被引量：4

二级引证文献4

1王亚杰,邱虹坤,吴燕燕,李飞,杨周凤.计算机博弈的研究与发展[J].智能系统学报,2016,11(6):788-798. 被引量：30
2陈雪健,张利群,曹杨.实现不围棋博弈程序的一种策略及关键算法[J].现代计算机,2020,26(22):9-13.
3车晓菲,徐勇,蒋宗华.苏拉卡尔塔棋系统的设计与实现[J].信息与电脑,2021,33(6):70-73. 被引量：2
4张涛,江业峰,李博文.基于PVS算法的苏拉卡尔塔棋博弈系统设计与实现[J].信息与电脑,2023,35(19):46-48.

1莫操君.采用数据寄存器实现PC顺序控制[J].电世界,2002,43(10):3-3.
2Wen-Jing Ma,Kan Gao,Guo-Ping Long.Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs[J].Journal of Computer Science & Technology,2016,31(6):1262-1274.
3Intersil的单向内核控制器为Santa Rosa平台GPU供电[J].电子与电脑,2006(11):82-82.
4分布式计算面临挑战[J].中国信息化,2007(24):23-23.
5曹君.stencil计算在intel+mic众核上的并行优化[J].电子技术与软件工程,2016(17):148-148.
6美国邦纳小型PLC BSP02系列[J].自动化博览,2015,0(2):8-8.
7毕艳梅,王晋.核电DCS系统数据结构和图形模板设计与实现[J].核电子学与探测技术,2015,35(7):689-694. 被引量：6
8王夕元.微处理机控制打印接口实用方法[J].微处理机,1989,10(4):68-72.
9Mo Zeyao,Li Xiaomei(Dept. of Computer, Changsha institute of Technology Changsha, China, 410073).Realistic　Efficiency　Evaluations　for　Parallel　Computations　under　Workstation　Cluster[J].Wuhan University Journal of Natural Sciences,1996,1(Z1):329-336.
10王华峰,张新家.三层结构的网络游戏服务器设计及其性能分析[J].计算机工程与应用,2007,43(2):125-127. 被引量：4

Journal of Computer Science & Technology

2012年第1期

浏览历史

内容加载中请稍等...

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs 被引量：1

参考文献37

同被引文献14

引证文献1

二级引证文献4

相关作者

相关机构

相关主题

浏览历史