面向SLP的多重循环向量化被引量：13

Loop-Nest Auto-Vectorization Based on SLP

下载PDF

导出

摘要如今,越来越多的处理器集成了SIMD(single instruction multiple data)扩展,现有的编译器大多也实现了自动向量化的功能,但是一般都只针对最内层循环进行向量化,对于多重循环缺少一种通用、易行的向量化方法.为此,提出了一种面向SLP(superword level parallelism)的多重循环向量化方法,从外至内依次对各个循环层次进行分析,收集各层循环对应的一些影响向量化效果的属性值,主要包括能否对该循环进行直接循环展开和压紧、有多少数组引用相对于该循环索引连续以及该循环所包含的区域等,然后根据这些属性值决定在哪些循环层次进行直接循环展开和压紧,最后通过SLP对循环中的语句进行向量化.实验结果表明,该算法相对于内层循环向量化和简单的外层循环向量化平均加速比提升了2.13和1.41,对于一些常用的核心循环可以得到高达5.3的加速比. Nowadays, more and more processors are integrated with SIMD （single instruction multiple data） extensions, and most of the compilers have applied automatic vectorization, but the vectorization usually targets the innermost loop, there have been no easy vectorization approaches that deal with the loop nest. This paper brings out an automatic vectorization approach to vectorize nested loops form outer to inner. The paper first analyzes whether the loop can do direct unroll-and-jam through dependency analysis. Next, this study collects the values about the loop that will influence vectorization performance, including whether it can do direct unroll-and-jam, the number of array references that are continuous for this loop index and the loop region. Moreover, the study also presents an aggressive algorithm that will be used to decide which loops need to do unroll-and-jam at last generate SIMD code using SLP （superword /eve/ parallelism） algorithm. The test results on Intel platform show that the average speedup factor of some numerical/video/communication kernels achieved by this approach is 2.13/1.41, better than the innermost loop vectorization and simple outer-loop vectorization, the speedup factor of some common kernels can reach 5.3.

作者魏帅赵荣彩姚远

机构地区解放军信息工程大学信息工程学院

出处《软件学报》 EI CSCD 北大核心 2012年第7期1717-1728,共12页 Journal of Software

基金国家高技术研究发展计划(863)(2009AA012201) "核高基"国家科技重大专项(2009ZX01036)

关键词 SIMD 向量化依赖关系分析多重循环超字并行 SIMD （single instruction multiple data） vectorization data dependence analysis nested loop SLP（superword level parallelism）

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献14

1Stewart J. An investigation of SIMD instruction sets. University of Ballarat School of Information Technology and Mathematical Sciences, 2005. http://noisymime.org/blogimages/SIMD.pdf.
2Nuzman D, Rosen I, Zaks A. Auto-Vectorization of interleaved data for SIMD, In: Proc. of the ACM SIGPLAN Conf. on Programming Language Design and Implementation. Ottawa: ACM Press, 2006. 132-143. [doi: 10.1145/1133981.1133996].
3Zheng WM, Tang ZZ. Compiler Archtecture. Beijing: Tsinghua University Press, 1998 (in Chinese).
4Allen R, Kennedy K. Optimizing Compilers for Modern Architectures--A Dependence-Based Approach. San Francisco: Morgan Kaufmann Publishers, 2001.
5Shen ZY, Hu ZA, Liao XK, Wu HP, Zhao KJ, Lu YT. Methods of Parallel Compilation. Beijing: National Defence Industry Press, 2000 (in Chinese).
6Bik AJC. The Software Vectorization Handbook--Applying Multimedia Extensions for Maximum Performance. Intel Press, 2004.
7Hampton M, Asanovic K. Compiling for vector-thread architectures. In: Proc. of the 6th Annual IEEE/ACM Int'l Symp. on Code Generation and Optimization. Boston: ACM Press, 2008.205-215. [doi: 10.1145/1356058.1356085].
8Naishlos D, Biberstein M, Ben-David S, Zaks A. Vectorizing for a SIMdD DSP architecture. In: Proc. of the 2003 Int'l ConL on Compilers, Architecture and Synthesis for Embedded Systems. San Jose: ACM Press, 2003.2-11. [doi: 10.1145/951710.951714].
9Bik AJC, GirKar M, Grey PM, Tian XM. Automatic intra-register vectorization for the Intel architecture. Int'l Journal of Parallel Programming, 2002,30(2):65-98. [doi: 10.1023/A:1014230429447].
10Wu P, Eichenberger AE, Wang A, Zhao P. An integrated simdization framework using virtual vectors. In: Proc. of the 19th Annual Int'l Conf. on Supercomputing. Cambridge: ACM Press, 2005. 169-178. [doi: 10.1145/1088149.1088172].

同被引文献82

1李文龙,陈彧,林海波,汤志忠.3种提高软件流水有效性的算法:比较和结合[J].软件学报,2005,16(10):1822-1832. 被引量：2
2孙学琴.系统布置设计在物流中心设计中的应用[J].科技进步与对策,2005,22(10):117-119. 被引量：8
3李诗珍,杜文宏.基于SLP思想的配送中心布置设计研究[J].科技管理研究,2006,26(10):246-248. 被引量：7
4AllenR,KennedyK现代体系结构的优化编译器[M].张兆庆,乔如良,冯晓兵,等,译.北京:机械工业出版社,2004.
5AYGUADE E, COPTY N, DURAN A, et al. The design of OpenMP tasks [ J]. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(3), 404 - 418,.
6LENGAUER C, GRIEBL M. On the parallelization of loop nests containing while loops [ C] // Proceedings of the First Aizu Interna- tional Symposium on Parallel Algorithms/Architecture Synthesis. Washington, DC: IEEE Computer Society, 1995: 10- 18.
7COLLARD J F. Space-time transformation of while-loops using spec- ulative execution [ C] // Proceedings of the 1999 Sealable High-Per- formance Computing Conference. Piseataway: IEEE Press, 1994:37 -42.
8GRIEBL M. The mechanical parallelization of loop nests containing while loops [D]. Passau, Germany: University of Passau, 1996.
9LARSEN S, RABBAH R, AMARASINGHE S. Exploiting vector parallelism in software pipelined loops [ C ]// Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitec- ture. Washington, DC: IEEE Computer Society, 2005:50-58.
10LARSEN S, AMARASINGHE S. Exploiting superword level paral- lelism with multimedia instruction sets [ C]// Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation. New York: ACM Press, 2000:145 - 156.

引证文献13

1侯永生,赵荣彩,高伟,高伟.非正规化循环的单指令多数据向量化[J].计算机应用,2013,33(11):3149-3154. 被引量：1
2赵博,赵荣彩,李雁冰,高伟.类型转换语句的SLP发掘方法[J].计算机科学,2014,41(11):16-21. 被引量：2
3赵博,赵荣彩,徐金龙,高伟.渐进式智能回溯向量化代码调优方法[J].计算机科学,2015,42(1):50-53.
4王向前,洪一,郑启龙.分块内存的数据分布优化[J].小型微型计算机系统,2015,36(4):815-819. 被引量：1
5王向前,洪一,王昊,郑启龙.魂芯DSP的编译器设计与优化[J].电子学报,2015,43(8):1656-1661. 被引量：7
6张素平,韩林,丁丽丽,王鹏翔.新型超字级并行改进算法[J].计算机应用,2017,37(2):450-456.
7吕鹏伟,刘从新,沈绪榜.一种新型自动向量化编译算法[J].武汉大学学报（理学版）,2016,62(5):456-463.
8吕鹏伟,刘从新,赵一明,沈绪榜.基于动态规划的自动向量化方法[J].北京理工大学学报,2017,37(5):544-550. 被引量：1
9凌宁,樊树海,任蒙蒙,魏霞.面向大规模定制的制造企业设施布局分析[J].机床与液压,2017,45(23):50-55.
10Xinbiao GAN,Yikun HU,Jie LIU,Lihua CHI,Han XU,Chunye GONG,Shengguo LI,Yihui YAN.Customizing the HPL for China accelerator[J].Science China(Information Sciences),2018,61(4):101-111. 被引量：1

二级引证文献16

1徐金龙,赵荣彩,韩林.分段约束的超字并行向量发掘路径优化算法[J].计算机应用,2015,35(4):950-955. 被引量：11
2张博,盛魁,陈继祥,董辉.一种改进的内存索引算法在中药追溯数据处理中的应用[J].通化师范学院学报,2016,37(6):70-73.
3张素平,韩林,丁丽丽,王鹏翔.新型超字级并行改进算法[J].计算机应用,2017,37(2):450-456.
4李颖颖,高伟,高雨辰,翟胜伟,李朋远.发掘函数级单指令多数据向量化的方法[J].计算机应用,2017,37(8):2200-2208.
5王玉林,郑启龙,赵高义.魂芯DSP上复数类型的支持和优化[J].计算机系统应用,2017,26(9):40-45. 被引量：2
6贾尚柱,郎文辉,曾飞洋,刘余福.BWDSP上HEVC运动估计的实现及存储器访问优化[J].电脑知识与技术（过刊）,2017,23(4X):178-180. 被引量：1
7汪辉,郎文辉,杨学志,段苓丽,佘成龙.基于BWDSP的HEVC熵编码的复杂度分析与优化[J].合肥工业大学学报（自然科学版）,2019,42(9):1193-1198.
8刘玉,刘谷,耿锐.基于LLVM实现的国产DSP优化编译器[J].中国集成电路,2020,29(7):24-28. 被引量：1
9Ruibo Wang,Kai Lu,Juan Chen,Wenzhe Zhang,Jinwen Li,Yuan Yuan,Pingjing Lu,Libo Huang,Shengguo Li,Xiaokang Fan.Brief Introduction of TianHe Exascale Prototype System[J].Tsinghua Science and Technology,2021,26(3):361-369. 被引量：5
10张飞,于佳耕,邢明杰,武延军.基于musl libc库的RVV优化[J].计算机系统应用,2023,32(11):29-35.

1曾扬.串行程序的依赖关系分析和向量化[J].计算机学报,1993,16(2):130-142. 被引量：1
2侯永生,赵荣彩,高伟,高伟.非正规化循环的单指令多数据向量化[J].计算机应用,2013,33(11):3149-3154. 被引量：1
3翻身的入门级独显 NVIDIA GeForce 920M有多强[J].电脑爱好者,2015,0(14):78-79.
4处理器集成的显卡性能如何？[J].计算机应用文摘,2010(5):49-49.
5基于英特尔H61 Mini—ITX主板M1961[J].现代制造,2012(15):74-74.
6微星一体电脑WindTopAE2031[J].中国传媒科技,2012(11):80-80.
7新Atom平台测试专题于无声处听惊雷[J].新电脑,2010(3):89-91.
8石锦松,贺丽萍,白亮,庞小峰.基于ARM的远程控制温控系统的设计[J].现代电子技术,2007,30(12):80-81. 被引量：3
9李韬.2×2=4? 浪潮NF280D双核、四核服务器对比测试[J].科技浪潮,2007,0(8):20-20.
10晓慧.引爆二代i芯内核GPU[J].电脑知识与技术（经验技巧）,2011(9):89-89.

软件学报

2012年第7期

浏览历史

内容加载中请稍等...

面向SLP的多重循环向量化被引量：13

参考文献14

同被引文献82

引证文献13

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

面向SLP的多重循环向量化 被引量：13

参考文献14

同被引文献82

引证文献13

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

面向SLP的多重循环向量化被引量：13