期刊文献+

Landing Stencil Code on Godson-T 被引量:1

Landing Stencil Code on Godson-T
原文传递
导出
摘要 The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures. The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第4期886-894,共9页 计算机科学技术学报(英文版)
基金 Supported by the National Basic Research 973 Program of China under Grant No.2005CB321602 the National Natural Science Foundation of China under Grant No.60736012 the National High Technology Research and Development 863 Program of China under Grant Nos.2007AA01Z110 and 2009AA01Z103
关键词 many-core stencil Jacobi compiler SPM fine-grain synchronization many-core, stencil, Jacobi, compiler SPM, fine-grain synchronization
  • 相关文献

参考文献36

  • 1Dally W J. Computer architecture in the many-core era. In Keynote at the 24th Int. Conf. Comput. Design, San Jose, CA, USA, Oct. 1, 2006.
  • 2Borkar S Y, Mulder H, Dubey P, Pawlowski S S, Kahn K C, Rattner J R, Kuck D J. Platform 2015: Intel processor and platform evolution for the next decade. Technical Report, Intel White Paper, Mar. 2005.
  • 3Seiler L, Carmean D, Sprangle E, Forsyth T, Abrash M, Dubey P, Junkins S, Lake A, Sugerman J, Cavin R, Espasa R, Grochowski E, Juan T, Hanrahan P. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics, 27(3): Article No. 18.
  • 4Zhu W, Sreedhar V C, Hu Z, Gao G R. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proc. ISCA 2007, San Diego, USA, June 9-13, 2007, pp.35-45.
  • 5Hu Z, del Cuvillo J, Zhu W, Gao G R. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Proc. Euro-Par 2006, Dresden, Germany, Aug. 29-Sept. 1, 2006, pp.134-144.
  • 6Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P. Effective automatic parallelization of stencil computations. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego, USA, June 10-13, 2007, pp.235-244.
  • 7Frigo M, Strumpen V. The memory behavior of cache oblivious stencil computations. Journal of Supercomputing, 2006, 29(2): 93-112.
  • 8Kamil s, Datta K, Williams S, Oliker L, Shall J, Yelick K. Implicit and explicit optimizations for stencil computations. In Proc. MSPC2006, San Jose, USA, Oct. 22, 2006, pp.51-60.
  • 9Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shall J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. SC2008, Austin, USA, Nov. 15-21, 2008, Article No. 1.
  • 10Renganarayanan L, Harthikote-Matha M, Dewri R, Rajopadbye S V. Towards optimal multi-level tiling for stencil computations. In Proc. IPDPS, Long Beach, USA, Mar. 26-30, 2007, p.101.

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部