面向局部性和并行优化的循环分块技术被引量：10

Loop Tiling for Optimization of Locality and Parallelism

下载PDF

导出

摘要循环分块是一种广泛用于改善数据局部性和开发并行性的程序变换优化技术.主要分为2类:固定分块技术和参数化分块技术,系统地总结了这2类技术,并分析了其优缺点.由于分块大小的选择会严重影响分块代码的性能,因此介绍分析了选择最优分块大小的各种方法.此外,总结了循环分块在多级分块、并行性开发和不完美嵌套循环等方面应用的各项技术.通过对循环分块技术当前研究现状的分析,得出如下结论:1)循环分块技术中的计算复杂度和生成代码效率问题还未得到完全解决,如何利用循环边界有效地约束迭代空间并提高数据局部性还需要更深入的研究;2)最优分块大小的选择依然是一个开放式难题,研究清楚分级存储架构中每级分块对性能的影响具有重要的意义;3)从循环分块的应用角度,如何有效地构建面向任意嵌套循环集的自动分块代码生成系统,同时充分利用深度共享存储资源和多核架构实现分块代码的高并行度,也是一个需要深入研究的问题. Loop tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality in modern computer architecture. It is mainly divided into two categories： fixed and parameterized. These two types of tiling technologies are systematically summarized and their advantages and disadvantages are analyzed comprehensively. Since the tile size would significantly affect the performance of the tiled code, various methods of optimal tile size selection are described. Besides, various kinds of technologies applied to multi-level tiling, parallelism exploration and imperfectly nested loops are surveyed in this paper. Based on the detailed analysis of the current researches on loop tiling technologies, several conclusions are drawn as follows： 1） How to balance the trade-off between computation complexity and generation efficiency of tiled code has not been completely solved, and how to use loop boundaries to efficiently bound the iteration spaces for data locality enhancement also needs further study. 2） Optimal tile size selection is still a difficult and open question, and it would be significant to understand the influence of different level tile size in hierarchical memory system on performance. 3） From the perspective of application, how to automatically generate effective tiled code for arbitrarily nested loops needs further research. On the other hand, how to take full advantage of shared hierarchical memory and multi-core architectures to achieve high degree of parallelism for tiled code is another interesting direction.

作者刘松伍卫国赵博蒋庆

机构地区西安交通大学电子与信息工程学院

出处《计算机研究与发展》 EI CSCD 北大核心 2015年第5期1160-1176,共17页 Journal of Computer Research and Development

基金国家自然科学基金项目(91330117) 国家"八六三"高技术研究发展计划基金项目(2012AA01A306 2012AA010901)

关键词循环分块最优分块大小程序变换并行性性能优化 loop tiling optimal tile size program transformations parallelism performance optimization

分类号 TP314 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献83

1Owens J D, Luebke D, Govindaraju N, et al. A survey of general-purpose computation on graphics hardware [J]. Computer Graphics Forum, 2007, 26(1) : 80-113.
2Grosser T, Cohen A, Kelly P, et al. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles [C]//Proc of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. New York: ACM, 2013: 24-31.
3Kaspersky K. Code Optimization: Effective Memory Usage [M]. New Delhi, India: BPB Publications, 2004.
4Baghdadi R, Cohen A, Verdoolaege S, et al. Improved loop tiling based on the removal of spurious false dependences [J]. ACM Trans on Architecture and Code Optimization(TACO) Special Issue on High-Performance Embedded Architectures and Compilers, 2013, 9(4): 1-26.
5Pouchet L N, Bondhugula U, Bastoul C, et al. Loop transformations: Convexity, pruning and optimization [C // Proc of the 38th ACM SIGPLAN-SIGACT Symp on Principles of Programming Languages (POPL'll). New York: ACM, 2011:549-562.
6Lain M S, Wolf M E. A data locality optimizing algorithm [C] //Proc of the 12th ACM SIGPLAN Conf on Programming LangUage Design and Implementation (PLDI'91). NewYork: ACM, 1991:30-44.
7Lain M D, Rothberg E, Wolf M E. The cache performance and optimizations of blocked algorithms [C] //Proc of the 4th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 1991: 63-74.
8Irigoin F, Triolet R. Supernode partitioning [C] //Proc of the 15th ACM SIGPLAN-SIGACT Syrup on Principles of Programming Languages ( POPL'88 ). New York: ACM, 1988:319-328.
9Ancourt C, Irigoin F. Scanning polyhedra with DO loops [C] //Proc of the 3rd ACM SIGPLAN Syrup on Principles and Practice of Parallel Programming. New York: ACM, 1991: 39-50.
10Xue Jingling. Loop Tiling for Parallelism [M]. Amsterdam, Netherlands: Kluwer Academic Publishers, 2000.

二级参考文献18

1Qureshi M, Patt Y. Utility-based cache partitioning: A low- overhead, high performance, runtime mechanism to partition shared caches [C]//Proc of the 39th Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2006: 423-432.
2Suh E, Rudolph L, Devadas S. Dynamic partitioning of shared cache memory [J]. The Journal of Supercomputing, 2004, 28(1): 7-26.
3Cho S, Jin L. Managing distributed, shared L2 caches through OS-level page allocation [C] //Proe of the Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2006: 455- 468.
4Tam D, Azimi R, Soares L, et al. Managing shared L2 caches on multicore systems in software [OL]. [2007-12- 17]. http://www, ideal, ece. ufl. edu/workshops/wiosea07/ Paper4. pdf.
5Lin J, Lu Q, Ding X, et al. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems [C] //Proc of Int Syrup on High Performance Computer Architecture. Piseataway, NJ: IEEE, 2008: 367- 378.
6Tam D, Azimi R, Soares L, et al. RapidMRC Approximating 1.2 miss rate curves on commodity systems for online optimizations [C] //Proc of lnt Conf on Architectural Support for Programming Languages &. Operating Systems. New York: ACM, 2009.- 121-132.
7Zhao Q, Rabbah R, Amarasinghe S, et al. Ubiquitous memory introspection [C]//Proc of the Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2007: 299-311.
8Kim Y, Hill M, Wood D. Implementing stack simulation for highly associative memories [C] //Proe of ACM SIGMETRICS conf on Measurement and modeling of computer systems. New York: ACM, 1991:212-213.
9Berg E, Hagersten E. Fast data-locality profiling of native execution [C] //Proc of the 2005 ACM S1GMETRICS Int Conf on Measurement and Modeling of Computer Systems. New York: ACM, 2005:169-180.
10Sherwood T, Perelman E, Hamerly G, et al, Automatically characterizing large scale program behavior [C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York : ACM, 2002: 45-57.

共引文献2

1万虎,徐远超,孙凤芸,闫俊峰.面向大数据应用的众核处理器缓存结构设计[J].计算机工程与科学,2015,37(1):28-35. 被引量：3
2马久跃,余子濠,包云岗,孙凝晖.体系结构内可编程数据平面方法[J].计算机研究与发展,2017,54(1):123-133. 被引量：2

同被引文献23

1裴颂文,吴百锋.SpMT WaveCache:开发数据流计算机中的推测多线程[J].计算机学报,2009,32(7):1382-1392. 被引量：3
2狄鹏,胡长军,李建江.GPU上高效Jacobi迭代算法的研究与实现[J].小型微型计算机系统,2012,33(9):1962-1967. 被引量：3
3李春江,杜云飞,倪晓强,王永文,杨灿群.基于GCC实现飞腾处理器向量处理单元的编译器后端[J].计算机科学,2013,40(12):19-22. 被引量：3
4李远成,阴培培,赵银亮.基于模糊聚类的推测多线程划分算法[J].计算机学报,2014,37(3):580-592. 被引量：19
5王燕,聂长海,钮鑫涛,吴化尧,徐家喜.覆盖表生成的禁忌搜索算法[J].软件学报,2018,29(12):3665-3691. 被引量：8
6刘芳芳,杨超,袁欣辉,吴长茂,敖玉龙.面向国产申威26010众核处理器的SpMV实现与优化[J].软件学报,2018,29(12):3921-3932. 被引量：11
7于俊清,张维维,陈文斌,涂浩,何云峰.面向多核集群的数据流程序层次流水线并行优化方法[J].计算机学报,2014,37(10):2071-2083. 被引量：8
8高伟,赵荣彩,韩林,庞建民,丁锐.SIMD自动向量化编译优化概述[J].软件学报,2015,26(6):1265-1284. 被引量：30
9刘德儿,熊证,沈敬伟,朱晓璠,徐昕.完全拓扑关系结构D-TIN多核并行构建[J].计算机辅助设计与图形学学报,2016,28(6):950-959. 被引量：3
10刘松,赵博,蒋庆,伍卫国.一种面向循环优化和非规则代码段的粗粒度半自动并行化方法[J].计算机学报,2017,40(9):2127-2147. 被引量：4

引证文献10

1刘松,赵博,蒋庆,伍卫国.一种面向循环优化和非规则代码段的粗粒度半自动并行化方法[J].计算机学报,2017,40(9):2127-2147. 被引量：4
2骆亮.多核平台两级抢占式固定优先级DAG递归调度[J].微电子学与计算机,2020,37(4):70-75. 被引量：1
3薛亚非,冯钧.基于时隙堆栈搜索的异构集群DAG调度策略[J].计算机工程与设计,2020,41(6):1725-1732.
4柴晓菲,刘松,屈彬,王倩,伍卫国.向量化友好的循环分块因子选择算法[J].计算机工程与应用,2020,56(15):37-42.
5池昊宇,陈长波.基于神经网络的循环分块大小预测[J].计算机科学,2020,47(8):62-70. 被引量：7
6陈莹,黄永彪,谢瑾.基于可靠性的多核系统硬实时任务并行调度[J].控制工程,2021,28(1):176-182. 被引量：4
7包怡坤,张鹏,徐小文,莫则尧.基于神经网络模型的stencil循环最优分块大小预测[J].计算机科学,2022,49(10):18-26.
8彭畅,刘青枝,陈长波.多面体模型下的循环置换与自动调优[J].计算机工程与科学,2023,45(12):2121-2134.
9彭畅,陈长波.基于机器学习的多面体模型下的循环置换[J].信息技术,2023,47(12):22-32.
10屈彬,刘松,张增源,马洁,伍卫国.一种六边形循环分块的Jacobi计算优化方法[J].软件学报,2024,35(8):3721-3738.

二级引证文献16

1董鑫,张一.基于符号执行的C程序单元测试的实现[J].电子技术与软件工程,2017(13):247-247.
2崔元桢,刘松,王倩,伍卫国.格子玻尔兹曼方法计算程序的循环优化技术研究[J].计算机学报,2020,43(6):1086-1102.
3冯晖,王亚刚.基于深度图网络的编译器向量化启发式算法[J].计算机应用研究,2021,38(8):2349-2353. 被引量：1
4池昊宇,陈长波.基于机器学习的编译器自动调优综述[J].计算机科学,2022,49(1):241-251. 被引量：7
5冯华伟.基于深度神经网络的异常财务数据识别方法[J].电子设计工程,2022,30(10):31-35. 被引量：6
6程顺达,程颖,孙士江.基于机器学习的肿瘤智能辅助诊断方法[J].电子科技,2022,35(5):56-59. 被引量：4
7包怡坤,张鹏,徐小文,莫则尧.基于神经网络模型的stencil循环最优分块大小预测[J].计算机科学,2022,49(10):18-26.
8刘金硕,黄朔,邓娟.面向PMVS算法的自动两级并行翻译方法[J].计算机工程,2022,48(12):16-23.
9谢石木林,白杰,张翔,汤泽毅,粘为帆,刘旭杰.基于5G+MEC的电网边缘计算平台任务安全性调度方法[J].电信科学,2022,38(12):78-85.
10徐益民,杨余旺,郭利强.一种蚁群算法优化的BP神经网络技术研究[J].计算机与数字工程,2022,50(11):2373-2376. 被引量：1

1吴英杰,王一蕾.通过数组分块技术优化Cache性能[J].福建电脑,2006,22(1):27-27.
2李雁冰,赵荣彩,赵博,黄品丰.面向异构多核处理器的的循环分块[J].计算机工程与设计,2015,36(1):168-173. 被引量：4
3胡莹.采用改进PSO的LU循环分块优化算法[J].河南师范大学学报（自然科学版）,2013,41(5):157-160.
4史岳鹏,周溪召,孔素真.基于优化PSO的LU循环分块方法[J].科学技术与工程,2013,21(20):5960-5963.
5宋微,李亚芬.Web应用代码生成系统的设计与实现[J].现代电子技术,2009,32(22):63-67. 被引量：1
6孙茂增,李凤华,都婧.基于Velocity的J2EE应用代码生成系统[J].仪器仪表用户,2008,15(1):105-106. 被引量：4
7王寅峰,邓果丽,许志良.MIC商用并行编程性能优化分析[J].深圳信息职业技术学院学报,2013,11(1):87-93.
8康绯,刘胜利,武东英.循环分块在MPI程序设计中的应用研究[J].计算机工程与应用,2003,39(5):92-95. 被引量：1
9仲跻冬,李晓明,方滨兴.HPF计算划分的算法实现[J].计算机工程与科学,1997,19(2):55-58.
10钱晓捷,杨镇江,杜志刚,李秀芳.矩阵相乘算法优化的研究[J].微计算机信息,2009,25(27):182-183.

计算机研究与发展

2015年第5期

浏览历史

内容加载中请稍等...

面向局部性和并行优化的循环分块技术被引量：10

参考文献83

二级参考文献18

共引文献2

同被引文献23

引证文献10

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

面向局部性和并行优化的循环分块技术 被引量：10

参考文献83

二级参考文献18

共引文献2

同被引文献23

引证文献10

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

面向局部性和并行优化的循环分块技术被引量：10