MPFFT：An Auto-Tuning FFT Library for OpenCL GPUs 被引量：10

MPFFT：An Auto-Tuning FFT Library for OpenCL GPUs

导出

摘要 Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform （FFT） is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the cIAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTYV with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050. Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform （FFT） is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the cIAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTYV with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.

作者 Yan Li Yun-Quan Zhang Yi-Qun Liu Guo-Ping Long Hai-Peng Jia

机构地区 Institute of Software Graduate University of Chinese Academy of Sciences School of Information Science and Engineering

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2013年第1期90-105,共16页 计算机科学技术学报（英文版）

基金 This work is supported in partial by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61100073, 61100066, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010902, 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice.

关键词 fast Fourier transform GPU OPENCL AUTO-TUNING fast Fourier transform, GPU, OpenCL, auto-tuning

分类号 TP334.7 [自动化与计算机技术—计算机系统结构] TN911.72

引文网络
相关文献

参考文献28

1Duhamel P, Vetterli M. Fast fourier transforms: A tutorial review and a state of the art. Signal Processing, 1990, 9(14): 259-299.
2Govindaraju N K, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In Proc. SC, Nov. 2008, Article No.2.
3Nukada A, Matsuoka S. Auto-tuning 3-D FFT library for CUDA GPUs. In Proc. SC, Nov. 2009, Article No.30. Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K. Auto-tuning of fast Fourier transform on graphics processors. In Proc PPoPP, Feb. 2011, pp.257-266.
4Gu L, Li X M, Siegel J. An empirically tu:ed 2D and 3D FFT library on CUDA GPU. In Proc. the 2:th ICS, June 2010, pp.305:314.
5Gaster B, Howes L, Kaeli D R, Mistry P, $chaa D. Heteroge- neous Computing with OpenCL. San Fransisco, USA: Morgan Kaufmann: 2011.
6Munshi A, Gaster B, Mattson T G, Fung J, Ginsburg D. OpenCL Programming Guide. Boston, USA: Addison-Wesley Professional. 2011.
7Zhang E Z, Jiang Y L, Guo GPU applications on the fly: Z Y, Shen X P. Streamlining Thread divergence elimination through runtime thread-data remapping. In Proc. the 2.:th ICS, June 2010: pp.115-126.
8Zhang E Z, Jiang Y L, Guo Z Y, Shen X P. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proc. the 24th ICS, June 2010, pp.115-126.
9Yang Y, Xiang P, Kong J F, Zhou H Y. A GPGPU com- piler for memory optimization and parallelism management. In Proc. PLDI, June 2010, pp.86-97.
10Cooley J W, Tukey J W. An algorithm for the machine cal- culation of complex Fourier series. Mathematics of Compu- tation, 1965, 19: 297-301.

同被引文献69

1方志红,张长耀,俞根苗.利用逆序循环实现FFT运算中倒序算法的优化[J].信号处理,2004,20(5):533-535. 被引量：7
2迟利华,刘杰,胡庆丰.数值并行计算可扩展性评价与测试[J].计算机研究与发展,2005,42(6):1073-1078. 被引量：10
3Pease M C. An adaptation of the fast Fourier transform for parallel processing[J]. Journal of the ACM, 1968, 15 (2) : 252 - 264.
4Linzer E N, Feig E. Implementation of efficient FFT algorithms on fused multiply-add architectures[ J ]. IEEE Transactions on Signal Processing, 1993, 41 ( 1 ) : 93 - 107.
5Goedeeker S. Fast radix 2, 3,4, and 5 kernels for fast Fourier transformations on computers with overlapping multiply-add instructions[J]. SIAM Journal on Scientific Computing, 1997, 18(6) : 1605 -1611.
6Kamer H, Auer M, Ueberhuber C W. Multiply-add optimized FFT kernels[ J]. Mathematical Models and Methods in Applied Sciences, 2001, 11 ( 1 ) : 105 - 117.
7Voronenko Y, Puschel M. Mechanical derivation of fused multiply-add algorithms for linear transforms [ J ]. IEEE Transactions on Signal Processing, 2007, 55 ( 9 ) : 4458 - 4473.
8Frigo M, Johnson S G. BenchFFT[EB/OL]. [2014 -03 - 15 ]. http ://www. fftw. org/benchfft/.
9Lobeiras J, Amor M, Doallo R. Influence of memory access patterns to small-scale FFT performance [ J ]. Journal of Supercomputing, 2013, 64( 1 ) :120 - 131.
10Cooley J W, Turkey J W. An algorithm for the machine calculation of complex Fourier series [ J ]. Mathematics of Computation, 1965, 19 : 297 - 301.

引证文献10

1常丽,杨继敏.基于校正多相位快速傅里叶变换算法的叠栅条纹相位差测量[J].光学学报,2014,34(6):136-142. 被引量：3
2刘颖,吕方,王蕾,陈莉,崔慧敏,冯晓兵.异构并行编程模型研究与进展[J].软件学报,2014,25(7):1459-1475. 被引量：13
3刘益群,李焱,张云泉,张先轶.Memory Efficient Two-Pass 3D FFT Algorithm for Intel~ Xeon Phi^(TM) Coprocessor[J].Journal of Computer Science & Technology,2014,29(6):989-1002. 被引量：2
4刘仲,陈海燕,向宏卫.使用融合乘加加速快速傅里叶变换计算的向量化方法[J].国防科技大学学报,2015,37(2):72-78. 被引量：3
5王向前,郑启龙,王昊,洪一,张磊.面向高数据并行架构的原位 FFT 算法[J].中国科学技术大学学报,2015,45(7):608-613.
6刘琦,黄咨,陈璐艳,胡福乔.基于GPU的卷积检测模型加速[J].计算机应用与软件,2016,33(5):226-230. 被引量：4
7李琨,贾海鹏,曹婷,张云泉.大规模集群上多维FFT算法的实现与优化研究[J].计算机科学与探索,2017,11(6):863-874. 被引量：3
8陈暾,李志豪,贾海鹏,张云泉.基于ARMv8平台的多维FFT实现与优化研究[J].计算机学报,2019,42(11):2384-2402. 被引量：10
9张云泉,袁良,陈一峯,冯晓兵,张贺.高性能计算多层次不连续非线性可扩展现象研究[J].计算机学报,2020,43(6):973-989. 被引量：1
10崔翔,李晓雯,陈一峯.基于新型语言机制的异构集群应用通信优化方法[J].计算机科学,2020,47(8):17-25.

二级引证文献39

1方宝辉,徐金秀,魏敏,周明忠.BCC_AGCM_T106在Intel众核上混合异构编程与优化研究[J].计算机科学与探索,2015,9(9):1093-1099. 被引量：4
2刘丹丹,杨灿美,倪素萍,杜学亮.一种异构多核系统的编译方法及实现[J].微电子学与计算机,2015,32(11):1-5. 被引量：1
3黄静静,陈文静,苏显渝,卢明腾.小波变换在调制度测量轮廓术中的应用[J].光学学报,2016,36(7):69-76. 被引量：13
4刘磊,李广力,徐玥,张桐搏,吕帅.基于移动平台的异构并行字符串匹配算法[J].吉林大学学报（理学版）,2017,55(1):82-88. 被引量：2
5冯勇,陈坤,邓辉,王锋,梅盈,卫守林,戴伟,杨秋萍,刘应波,吴静平.基于OpenCL的MUSER CLEAN算法研究与实现[J].天文学报,2017,58(2):55-64. 被引量：3
6柴恩惠,智敏.融合分支定界的可变形部件模型的行人检测[J].计算机应用,2017,37(7):2003-2007. 被引量：2
7江慧芳,蔡达,王晓蕊.基于CPU-GPU异构环境的运算代价评估模型[J].计算机工程,2017,43(9):12-16. 被引量：1
8张龙飞,梅中磊.FFT方法分析电磁波的时频传播特性[J].无线电工程,2017,47(10):49-52.
9孙家栋,孙乔,邓攀,杨超.基于申威众核处理器的1、2级BLAS函数优化研究[J].计算机系统应用,2017,26(11):101-108. 被引量：5
10刘忠艳,房俊龙,田淑梅.广义Morse小波在物体三维轮廓测量中的应用[J].计算机工程与应用,2017,53(23):190-196. 被引量：4

1董丽,李京华,王克宏.基于CORBA的Web计算体系结构的研究[J].清华大学学报（自然科学版）,2000,40(9):82-85. 被引量：10
2王生.CA90s:CA公司的90年代计算体系结构[J].软件世界,1995(4):59-60.
3Java计算体系结构（三）[J].世界电脑与通信（数据传播）,1997(6):52-52.
4吴和群.浅谈网格计算及其应用[J].信息通信,2013,26(10):127-127.
5王璐,梁涛,王文义.FFT算法的并行化性能分析[J].中原工学院学报,2010,21(5):30-32.
6Intersil的单向内核控制器为Santa Rosa平台GPU供电[J].电子与电脑,2006(11):82-82.
7分布式计算面临挑战[J].中国信息化,2007(24):23-23.
8汪东艳.软件无线电技术与可重配置计算体系结构[J].今日电子,2002(z1):12-14. 被引量：1
9张大勇.电视全媒体运营视频云架构[J].现代电视技术,2011(6):32-35. 被引量：3
10Java计算体系结构(三)[J].信息系统工程,1997,0(6):52-52.

Journal of Computer Science & Technology

2013年第1期

浏览历史

内容加载中请稍等...

MPFFT：An Auto-Tuning FFT Library for OpenCL GPUs 被引量：10

参考文献28

同被引文献69

引证文献10

二级引证文献39

相关作者

相关机构

相关主题

浏览历史