期刊文献+

面向局部性和并行优化的循环分块技术 被引量:10

Loop Tiling for Optimization of Locality and Parallelism
下载PDF
导出
摘要 循环分块是一种广泛用于改善数据局部性和开发并行性的程序变换优化技术.主要分为2类:固定分块技术和参数化分块技术,系统地总结了这2类技术,并分析了其优缺点.由于分块大小的选择会严重影响分块代码的性能,因此介绍分析了选择最优分块大小的各种方法.此外,总结了循环分块在多级分块、并行性开发和不完美嵌套循环等方面应用的各项技术.通过对循环分块技术当前研究现状的分析,得出如下结论:1)循环分块技术中的计算复杂度和生成代码效率问题还未得到完全解决,如何利用循环边界有效地约束迭代空间并提高数据局部性还需要更深入的研究;2)最优分块大小的选择依然是一个开放式难题,研究清楚分级存储架构中每级分块对性能的影响具有重要的意义;3)从循环分块的应用角度,如何有效地构建面向任意嵌套循环集的自动分块代码生成系统,同时充分利用深度共享存储资源和多核架构实现分块代码的高并行度,也是一个需要深入研究的问题. Loop tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality in modern computer architecture. It is mainly divided into two categories: fixed and parameterized. These two types of tiling technologies are systematically summarized and their advantages and disadvantages are analyzed comprehensively. Since the tile size would significantly affect the performance of the tiled code, various methods of optimal tile size selection are described. Besides, various kinds of technologies applied to multi-level tiling, parallelism exploration and imperfectly nested loops are surveyed in this paper. Based on the detailed analysis of the current researches on loop tiling technologies, several conclusions are drawn as follows: 1) How to balance the trade-off between computation complexity and generation efficiency of tiled code has not been completely solved, and how to use loop boundaries to efficiently bound the iteration spaces for data locality enhancement also needs further study. 2) Optimal tile size selection is still a difficult and open question, and it would be significant to understand the influence of different level tile size in hierarchical memory system on performance. 3) From the perspective of application, how to automatically generate effective tiled code for arbitrarily nested loops needs further research. On the other hand, how to take full advantage of shared hierarchical memory and multi-core architectures to achieve high degree of parallelism for tiled code is another interesting direction.
出处 《计算机研究与发展》 EI CSCD 北大核心 2015年第5期1160-1176,共17页 Journal of Computer Research and Development
基金 国家自然科学基金项目(91330117) 国家"八六三"高技术研究发展计划基金项目(2012AA01A306 2012AA010901)
关键词 循环分块 最优分块大小 程序变换 并行性 性能优化 loop tiling optimal tile size program transformations parallelism performance optimization
  • 相关文献

参考文献83

  • 1Owens J D, Luebke D, Govindaraju N, et al. A survey of general-purpose computation on graphics hardware [J]. Computer Graphics Forum, 2007, 26(1) : 80-113.
  • 2Grosser T, Cohen A, Kelly P, et al. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles [C]//Proc of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. New York: ACM, 2013: 24-31.
  • 3Kaspersky K. Code Optimization: Effective Memory Usage [M]. New Delhi, India: BPB Publications, 2004.
  • 4Baghdadi R, Cohen A, Verdoolaege S, et al. Improved loop tiling based on the removal of spurious false dependences [J]. ACM Trans on Architecture and Code Optimization(TACO) Special Issue on High-Performance Embedded Architectures and Compilers, 2013, 9(4): 1-26.
  • 5Pouchet L N, Bondhugula U, Bastoul C, et al. Loop transformations: Convexity, pruning and optimization [C // Proc of the 38th ACM SIGPLAN-SIGACT Symp on Principles of Programming Languages (POPL'll). New York: ACM, 2011:549-562.
  • 6Lain M S, Wolf M E. A data locality optimizing algorithm [C] //Proc of the 12th ACM SIGPLAN Conf on Programming LangUage Design and Implementation (PLDI'91). NewYork: ACM, 1991:30-44.
  • 7Lain M D, Rothberg E, Wolf M E. The cache performance and optimizations of blocked algorithms [C] //Proc of the 4th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 1991: 63-74.
  • 8Irigoin F, Triolet R. Supernode partitioning [C] //Proc of the 15th ACM SIGPLAN-SIGACT Syrup on Principles of Programming Languages ( POPL'88 ). New York: ACM, 1988:319-328.
  • 9Ancourt C, Irigoin F. Scanning polyhedra with DO loops [C] //Proc of the 3rd ACM SIGPLAN Syrup on Principles and Practice of Parallel Programming. New York: ACM, 1991: 39-50.
  • 10Xue Jingling. Loop Tiling for Parallelism [M]. Amsterdam, Netherlands: Kluwer Academic Publishers, 2000.

二级参考文献18

  • 1Qureshi M, Patt Y. Utility-based cache partitioning: A low- overhead, high performance, runtime mechanism to partition shared caches [C]//Proc of the 39th Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2006: 423-432.
  • 2Suh E, Rudolph L, Devadas S. Dynamic partitioning of shared cache memory [J]. The Journal of Supercomputing, 2004, 28(1): 7-26.
  • 3Cho S, Jin L. Managing distributed, shared L2 caches through OS-level page allocation [C] //Proe of the Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2006: 455- 468.
  • 4Tam D, Azimi R, Soares L, et al. Managing shared L2 caches on multicore systems in software [OL]. [2007-12- 17]. http://www, ideal, ece. ufl. edu/workshops/wiosea07/ Paper4. pdf.
  • 5Lin J, Lu Q, Ding X, et al. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems [C] //Proc of Int Syrup on High Performance Computer Architecture. Piseataway, NJ: IEEE, 2008: 367- 378.
  • 6Tam D, Azimi R, Soares L, et al. RapidMRC Approximating 1.2 miss rate curves on commodity systems for online optimizations [C] //Proc of lnt Conf on Architectural Support for Programming Languages &. Operating Systems. New York: ACM, 2009.- 121-132.
  • 7Zhao Q, Rabbah R, Amarasinghe S, et al. Ubiquitous memory introspection [C]//Proc of the Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2007: 299-311.
  • 8Kim Y, Hill M, Wood D. Implementing stack simulation for highly associative memories [C] //Proe of ACM SIGMETRICS conf on Measurement and modeling of computer systems. New York: ACM, 1991:212-213.
  • 9Berg E, Hagersten E. Fast data-locality profiling of native execution [C] //Proc of the 2005 ACM S1GMETRICS Int Conf on Measurement and Modeling of Computer Systems. New York: ACM, 2005:169-180.
  • 10Sherwood T, Perelman E, Hamerly G, et al, Automatically characterizing large scale program behavior [C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York : ACM, 2002: 45-57.

共引文献2

同被引文献23

引证文献10

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部