期刊文献+

MPFFT:An Auto-Tuning FFT Library for OpenCL GPUs 被引量:10

MPFFT:An Auto-Tuning FFT Library for OpenCL GPUs
原文传递
导出
摘要 Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the cIAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTYV with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050. Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the cIAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTYV with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2013年第1期90-105,共16页 计算机科学技术学报(英文版)
基金 This work is supported in partial by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61100073, 61100066, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010902, 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice.
关键词 fast Fourier transform GPU OPENCL AUTO-TUNING fast Fourier transform, GPU, OpenCL, auto-tuning
  • 相关文献

参考文献28

  • 1Duhamel P, Vetterli M. Fast fourier transforms: A tutorial review and a state of the art. Signal Processing, 1990, 9(14): 259-299.
  • 2Govindaraju N K, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In Proc. SC, Nov. 2008, Article No.2.
  • 3Nukada A, Matsuoka S. Auto-tuning 3-D FFT library for CUDA GPUs. In Proc. SC, Nov. 2009, Article No.30. Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K. Auto-tuning of fast Fourier transform on graphics processors. In Proc PPoPP, Feb. 2011, pp.257-266.
  • 4Gu L, Li X M, Siegel J. An empirically tu:ed 2D and 3D FFT library on CUDA GPU. In Proc. the 2:th ICS, June 2010, pp.305:314.
  • 5Gaster B, Howes L, Kaeli D R, Mistry P, $chaa D. Heteroge- neous Computing with OpenCL. San Fransisco, USA: Morgan Kaufmann: 2011.
  • 6Munshi A, Gaster B, Mattson T G, Fung J, Ginsburg D. OpenCL Programming Guide. Boston, USA: Addison-Wesley Professional. 2011.
  • 7Zhang E Z, Jiang Y L, Guo GPU applications on the fly: Z Y, Shen X P. Streamlining Thread divergence elimination through runtime thread-data remapping. In Proc. the 2.:th ICS, June 2010: pp.115-126.
  • 8Zhang E Z, Jiang Y L, Guo Z Y, Shen X P. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proc. the 24th ICS, June 2010, pp.115-126.
  • 9Yang Y, Xiang P, Kong J F, Zhou H Y. A GPGPU com- piler for memory optimization and parallelism management. In Proc. PLDI, June 2010, pp.86-97.
  • 10Cooley J W, Tukey J W. An algorithm for the machine cal- culation of complex Fourier series. Mathematics of Compu- tation, 1965, 19: 297-301.

同被引文献69

  • 1方志红,张长耀,俞根苗.利用逆序循环实现FFT运算中倒序算法的优化[J].信号处理,2004,20(5):533-535. 被引量:7
  • 2迟利华,刘杰,胡庆丰.数值并行计算可扩展性评价与测试[J].计算机研究与发展,2005,42(6):1073-1078. 被引量:10
  • 3Pease M C. An adaptation of the fast Fourier transform for parallel processing[J]. Journal of the ACM, 1968, 15 (2) : 252 - 264.
  • 4Linzer E N, Feig E. Implementation of efficient FFT algorithms on fused multiply-add architectures[ J ]. IEEE Transactions on Signal Processing, 1993, 41 ( 1 ) : 93 - 107.
  • 5Goedeeker S. Fast radix 2, 3,4, and 5 kernels for fast Fourier transformations on computers with overlapping multiply-add instructions[J]. SIAM Journal on Scientific Computing, 1997, 18(6) : 1605 -1611.
  • 6Kamer H, Auer M, Ueberhuber C W. Multiply-add optimized FFT kernels[ J]. Mathematical Models and Methods in Applied Sciences, 2001, 11 ( 1 ) : 105 - 117.
  • 7Voronenko Y, Puschel M. Mechanical derivation of fused multiply-add algorithms for linear transforms [ J ]. IEEE Transactions on Signal Processing, 2007, 55 ( 9 ) : 4458 - 4473.
  • 8Frigo M, Johnson S G. BenchFFT[EB/OL]. [2014 -03 - 15 ]. http ://www. fftw. org/benchfft/.
  • 9Lobeiras J, Amor M, Doallo R. Influence of memory access patterns to small-scale FFT performance [ J ]. Journal of Supercomputing, 2013, 64( 1 ) :120 - 131.
  • 10Cooley J W, Turkey J W. An algorithm for the machine calculation of complex Fourier series [ J ]. Mathematics of Computation, 1965, 19 : 297 - 301.

引证文献10

二级引证文献39

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部