期刊文献+

异构平台上性能自适应FFT框架 被引量:9

An Automatic Performance Tuning Framework for FFT on Heterogenous Platforms
下载PDF
导出
摘要 快速傅里叶变换(fast Fourier transform,FFT)在科学和工程界中具有着广泛的应用,尤其是在信号处理、图像处理以及求解偏微分方程领域.基于图形处理器(graphic processing unit,GPU)和加速处理器(accelerated processing unit,APU)的异构平台,提出了自适应性能优化的大规模并行FFT(massively parallel FFT,MPFFT)框架.MPFFT框架采用了安装时和运行时2层自适应策略.安装时借助代码产生器可以生成被GPU程序内核(kernel)调用的任意长度的代码模板库(codelet);运行时根据自动调优技术使代码产生器生成高度优化的GPU计算代码.实验结果表明:MPFFT在APU平台上,一维、二维以及三维FFT相对于AMD clAmdFft 1.6取得的平均加速比分别为3.45,15.20以及4.47,在AMD HD7970GPU上平均加速比分别为1.75,3.01和1.69.在NVIDIA Tesla C2050GPU上取得的整体性能都达到了CUFFT 4.1的93%,最大加速比能够达到1.28. The fast Fourier transform (FFT) is an important computational kernel in scientific and engineering computation which has broad applicability, especially in the field of signal processing, image processing and solving partial differential equation. In this paper, we propose an automatic performance tuning framework, called MPFFT (massively parallel FFT), which is well-suited to heterogeneous platforms such as GPU (graphic processing unit) and APU (accelerated processing unit). We employ two-stage adaptation methodology in two levels, namely installation time and runtime. At installation time, there is a code generator that could automatically generate FFT codelet for arbitrary size called by GPU kernel. The code generator could also generate high optimized code for GPU kernel according to auto-tuning techniques at runtime. Experimental results demonstrate that MPFFT substantially outperforms the clAmdFft library both on AMD GPU and APU. For 1D, 2D and 3D FFT, the average speedup of MPFFT compared with clAmdFft 1.6 achieves up to 3.45, 15.20, 4.47 on AMD APU A-360 and 1.75, 3.01, 1.69 on AMD HD7970. It also achieves comparable performance as the CUFFT library on NVIDIA GPU, and the overall performance is within 93% of CUFFT 4.1 on Tesla C2050, and the maximum speedup is 1.28.
作者 李焱 张云泉
出处 《计算机研究与发展》 EI CSCD 北大核心 2014年第3期637-649,共13页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61221062) 国家"八六三"高技术研究发展计划基金项目(2012AA010902 2012AA010903) 中国科学院研究生科技创新与社会实践资助专项基金项目(11000GBF01)
关键词 快速傅里叶变换 自适应性能优化 加速处理器 图形处理器 异构 fast Fourier transform (FFT) auto-tuning performance accelerated processing unit (APU) graphic processing unit (GPU) heterogenous
  • 相关文献

参考文献23

  • 1Williams S. Auto-tuning performance on multicore computers[D].Berkeley:University of California,Berkeley,2003.
  • 2Whaley R,Petitet A,Dongarra J. Automated empirical optimization of software and ATLAS project[J].Parallel Computing,2001,(1/2):3-35.
  • 3Bilmes J,Asanovic K,Chin C. Optimizing matrix multiply using PHiPAC:A portable,high-Performance,ANSI C coding methodology[A].New York:ACM,1997.340-347.
  • 4Frigo M,Johnson S. FFTW:An adaptive software architecture for theFFT[OL].http://www.fftw.org/fftw-paper icassp.pdf,2012.
  • 5Frigo M,Johnson S. The design and implementation of FFTW3[J].Proceeding of the IEEE,2005,(02):216-231.
  • 6Frigo M. A fast Fourier transform compiler[A].New York:ACM,1999.642-655.
  • 7Püschel M,Moura J,Johnson J. SPIRAL:Code generation for DSP transforms[J].Proceeding of the IEEE:Program Generation Optimization and Adaptation,2005,(02):232-275.
  • 8Püschel M,Franchetti F,Voronenko Y. Encyclopedia of Parallel Computing[M].Berlin:Springer-Verlag,2011.1920-1933.
  • 9Franchetti F,Püschel M,Voronenko Y. Discrete Fourier transform on multicore[J].IEEE Signal Processing Magazine:Signal Processing on Platforms with Multiple Cores,2009,(06):90-102.
  • 10Mirkovic D,Mahasoom R,Johnsson L. An adaptive software library for fast Fourier transforms[A].New York:ACM,2000.215-224.

二级参考文献15

  • 1袁伟,张云泉,孙家昶,李玉成.国产万亿次机群系统NPB性能测试分析[J].计算机研究与发展,2005,42(6):1079-1084. 被引量:13
  • 2Whaley R C,Petitet A,Dongarra J.Automated empirical optimization of software and ATLAS project[J].Parallel Computing,2001,27(1/2):3-35.
  • 3JackDongarra[OL].[2008-03-08].http://netlib.org/utk/people/JackDongarra/PAPERS/gco_search.pdf.
  • 4Moore Cordon E.Cramming more components onto integrated circuits[J].Electronics,1965,38(8):114-117.
  • 5Bilmes Jeff,Asanovic Krste,Chin Chee-Whye,et al.Optimizing matrix multiply using PHiPAC:A portable,high-performance,ANSI C coding methodology[C]//Int Conf on Supercomputing.New York:ACM,1997:340-347.
  • 6Lawson C L,Hanson R J,Kincaid D R,et al.Algorithm 539:Basic linear algebra subprograms for FORTRAN usage[J].ACM Trans on Mathematical Software,1979,5(3):324-325.
  • 7Vuduc Richard W.Automatic performance tuning of sparse matrix kernels[D].Berkeley:University of California,Berkeley,2003.
  • 8Im Eun-Jin,Yelick Katherine A,Vuduc Richard.Sparsity:Framework for optimizing sparse matrix-vector multiply[J].Int Journal of High Performance Computing Applications,2004,18(1):135-158.
  • 9Frigo Metteo,Johnson Steven G.FFTW:An adaptive software architecture for the FFT[C]//Proc of IEEE Int Conf on Acoustics,Speech,and Signal Processing.1998:1381-1384.
  • 10Frigo Matteo,Johson Steven G.The design and implementation of FFTW3[J].Proc of the IEEE:Special Issue on Program Generation,Optimization,and Platform Adaptation,2005,93(2):216-231.

共引文献1

同被引文献53

  • 1周海芳,赵进.基于GPU的遥感图像配准并行程序设计与存储优化[J].计算机研究与发展,2012,49(S1):281-286. 被引量:18
  • 2吴恩华,柳有权.基于图形处理器(GPU)的通用计算[J].计算机辅助设计与图形学学报,2004,16(5):601-612. 被引量:226
  • 3高庆狮,刘志勇.一个基于孙子定理的素数存储系统方案[J].计算机研究与发展,1995,32(5):1-7. 被引量:3
  • 4高振斌,万红星,陈禾,韩月秋.超长可变点数FFT处理器设计与实现[J].电讯技术,2005,45(4):92-96. 被引量:5
  • 5Hung Che-Lun, Lin Yaw-Ling, Li Kuan-Ching, et al. Ef-ficient GPGPU-based parallel packet classification [ C ]// 2011 IEEE 10th International Conference on Trust, Securi- ty and Privacy in Computing and Communications. 2011 : 1367-1374.
  • 6Alastair Nottingham, Barry Irwin. GPU packet classifica- tion using OpenCL: A consideration of viable classification methods[ C ]// Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Sci- entists and Information Technologists. 2009:160-169.
  • 7Alastair Nottingham, Barry Irwin. Parallel packet classifi- cation using GPU co-processors [ C ].// Proceedings of the 2010 Annual Research Conference of the South African In- stitute of Computer Scientists and Information Technolo- gists. 2010:231-241.
  • 8Sangjin Han, Keon Jang, KyongSoo Park, et al. Packet- Shader : A GPU-accelerated software router[ C ]//Proceed- ing of the ACM SIGCOMM 2010 Conference. 2010: 195- 206.
  • 9Kang Kang, Yangdong Steve Deng. Scalable packet classi- fication via GPU metaprogramming[ C ]//Design, Automa- tion & Test in Europe Conference & Exhibition. 2011:1-4.
  • 10Shane Ryoo, Christopher I Rodrigues, Sam S Stone, et al. Program optimization space pruning for a multithreaded GPU[ C]//Proceedings of the 6th Annual IEEE/ACM In- ternational Symposium on Code Generation and Optimiza- tion. 2008 : 195-204.

引证文献9

二级引证文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部