期刊文献+

面向SLP的多重循环向量化 被引量:13

Loop-Nest Auto-Vectorization Based on SLP
下载PDF
导出
摘要 如今,越来越多的处理器集成了SIMD(single instruction multiple data)扩展,现有的编译器大多也实现了自动向量化的功能,但是一般都只针对最内层循环进行向量化,对于多重循环缺少一种通用、易行的向量化方法.为此,提出了一种面向SLP(superword level parallelism)的多重循环向量化方法,从外至内依次对各个循环层次进行分析,收集各层循环对应的一些影响向量化效果的属性值,主要包括能否对该循环进行直接循环展开和压紧、有多少数组引用相对于该循环索引连续以及该循环所包含的区域等,然后根据这些属性值决定在哪些循环层次进行直接循环展开和压紧,最后通过SLP对循环中的语句进行向量化.实验结果表明,该算法相对于内层循环向量化和简单的外层循环向量化平均加速比提升了2.13和1.41,对于一些常用的核心循环可以得到高达5.3的加速比. Nowadays, more and more processors are integrated with SIMD (single instruction multiple data) extensions, and most of the compilers have applied automatic vectorization, but the vectorization usually targets the innermost loop, there have been no easy vectorization approaches that deal with the loop nest. This paper brings out an automatic vectorization approach to vectorize nested loops form outer to inner. The paper first analyzes whether the loop can do direct unroll-and-jam through dependency analysis. Next, this study collects the values about the loop that will influence vectorization performance, including whether it can do direct unroll-and-jam, the number of array references that are continuous for this loop index and the loop region. Moreover, the study also presents an aggressive algorithm that will be used to decide which loops need to do unroll-and-jam at last generate SIMD code using SLP (superword /eve/ parallelism) algorithm. The test results on Intel platform show that the average speedup factor of some numerical/video/communication kernels achieved by this approach is 2.13/1.41, better than the innermost loop vectorization and simple outer-loop vectorization, the speedup factor of some common kernels can reach 5.3.
出处 《软件学报》 EI CSCD 北大核心 2012年第7期1717-1728,共12页 Journal of Software
基金 国家高技术研究发展计划(863)(2009AA012201) "核高基"国家科技重大专项(2009ZX01036)
关键词 SIMD 向量化 依赖关系分析 多重循环 超字并行 SIMD (single instruction multiple data) vectorization data dependence analysis nested loop SLP(superword level parallelism)
  • 相关文献

参考文献14

  • 1Stewart J. An investigation of SIMD instruction sets. University of Ballarat School of Information Technology and Mathematical Sciences, 2005. http://noisymime.org/blogimages/SIMD.pdf.
  • 2Nuzman D, Rosen I, Zaks A. Auto-Vectorization of interleaved data for SIMD, In: Proc. of the ACM SIGPLAN Conf. on Programming Language Design and Implementation. Ottawa: ACM Press, 2006. 132-143. [doi: 10.1145/1133981.1133996].
  • 3Zheng WM, Tang ZZ. Compiler Archtecture. Beijing: Tsinghua University Press, 1998 (in Chinese).
  • 4Allen R, Kennedy K. Optimizing Compilers for Modern Architectures--A Dependence-Based Approach. San Francisco: Morgan Kaufmann Publishers, 2001.
  • 5Shen ZY, Hu ZA, Liao XK, Wu HP, Zhao KJ, Lu YT. Methods of Parallel Compilation. Beijing: National Defence Industry Press, 2000 (in Chinese).
  • 6Bik AJC. The Software Vectorization Handbook--Applying Multimedia Extensions for Maximum Performance. Intel Press, 2004.
  • 7Hampton M, Asanovic K. Compiling for vector-thread architectures. In: Proc. of the 6th Annual IEEE/ACM Int'l Symp. on Code Generation and Optimization. Boston: ACM Press, 2008.205-215. [doi: 10.1145/1356058.1356085].
  • 8Naishlos D, Biberstein M, Ben-David S, Zaks A. Vectorizing for a SIMdD DSP architecture. In: Proc. of the 2003 Int'l ConL on Compilers, Architecture and Synthesis for Embedded Systems. San Jose: ACM Press, 2003.2-11. [doi: 10.1145/951710.951714].
  • 9Bik AJC, GirKar M, Grey PM, Tian XM. Automatic intra-register vectorization for the Intel architecture. Int'l Journal of Parallel Programming, 2002,30(2):65-98. [doi: 10.1023/A:1014230429447].
  • 10Wu P, Eichenberger AE, Wang A, Zhao P. An integrated simdization framework using virtual vectors. In: Proc. of the 19th Annual Int'l Conf. on Supercomputing. Cambridge: ACM Press, 2005. 169-178. [doi: 10.1145/1088149.1088172].

同被引文献82

  • 1李文龙,陈彧,林海波,汤志忠.3种提高软件流水有效性的算法:比较和结合[J].软件学报,2005,16(10):1822-1832. 被引量:2
  • 2孙学琴.系统布置设计在物流中心设计中的应用[J].科技进步与对策,2005,22(10):117-119. 被引量:8
  • 3李诗珍,杜文宏.基于SLP思想的配送中心布置设计研究[J].科技管理研究,2006,26(10):246-248. 被引量:7
  • 4AllenR,KennedyK现代体系结构的优化编译器[M].张兆庆,乔如良,冯晓兵,等,译.北京:机械工业出版社,2004.
  • 5AYGUADE E, COPTY N, DURAN A, et al. The design of OpenMP tasks [ J]. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(3), 404 - 418,.
  • 6LENGAUER C, GRIEBL M. On the parallelization of loop nests containing while loops [ C] // Proceedings of the First Aizu Interna- tional Symposium on Parallel Algorithms/Architecture Synthesis. Washington, DC: IEEE Computer Society, 1995: 10- 18.
  • 7COLLARD J F. Space-time transformation of while-loops using spec- ulative execution [ C] // Proceedings of the 1999 Sealable High-Per- formance Computing Conference. Piseataway: IEEE Press, 1994:37 -42.
  • 8GRIEBL M. The mechanical parallelization of loop nests containing while loops [D]. Passau, Germany: University of Passau, 1996.
  • 9LARSEN S, RABBAH R, AMARASINGHE S. Exploiting vector parallelism in software pipelined loops [ C ]// Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitec- ture. Washington, DC: IEEE Computer Society, 2005:50-58.
  • 10LARSEN S, AMARASINGHE S. Exploiting superword level paral- lelism with multimedia instruction sets [ C]// Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation. New York: ACM Press, 2000:145 - 156.

引证文献13

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部