异构平台上性能自适应FFT框架被引量：9

An Automatic Performance Tuning Framework for FFT on Heterogenous Platforms

下载PDF

导出

摘要快速傅里叶变换(fast Fourier transform,FFT)在科学和工程界中具有着广泛的应用,尤其是在信号处理、图像处理以及求解偏微分方程领域.基于图形处理器(graphic processing unit,GPU)和加速处理器(accelerated processing unit,APU)的异构平台,提出了自适应性能优化的大规模并行FFT(massively parallel FFT,MPFFT)框架.MPFFT框架采用了安装时和运行时2层自适应策略.安装时借助代码产生器可以生成被GPU程序内核(kernel)调用的任意长度的代码模板库(codelet);运行时根据自动调优技术使代码产生器生成高度优化的GPU计算代码.实验结果表明:MPFFT在APU平台上,一维、二维以及三维FFT相对于AMD clAmdFft 1.6取得的平均加速比分别为3.45,15.20以及4.47,在AMD HD7970GPU上平均加速比分别为1.75,3.01和1.69.在NVIDIA Tesla C2050GPU上取得的整体性能都达到了CUFFT 4.1的93%,最大加速比能够达到1.28. The fast Fourier transform （FFT） is an important computational kernel in scientific and engineering computation which has broad applicability, especially in the field of signal processing, image processing and solving partial differential equation. In this paper, we propose an automatic performance tuning framework, called MPFFT （massively parallel FFT）, which is well-suited to heterogeneous platforms such as GPU （graphic processing unit） and APU （accelerated processing unit）. We employ two-stage adaptation methodology in two levels, namely installation time and runtime. At installation time, there is a code generator that could automatically generate FFT codelet for arbitrary size called by GPU kernel. The code generator could also generate high optimized code for GPU kernel according to auto-tuning techniques at runtime. Experimental results demonstrate that MPFFT substantially outperforms the clAmdFft library both on AMD GPU and APU. For 1D, 2D and 3D FFT, the average speedup of MPFFT compared with clAmdFft 1.6 achieves up to 3.45, 15.20, 4.47 on AMD APU A-360 and 1.75, 3.01, 1.69 on AMD HD7970. It also achieves comparable performance as the CUFFT library on NVIDIA GPU, and the overall performance is within 93% of CUFFT 4.1 on Tesla C2050, and the maximum speedup is 1.28.

作者李焱张云泉

机构地区并行软件与计算科学实验室(中国科学院软件研究所) 中国科学院大学计算机体系结构国家重点实验室(中国科学院计算技术研究所)

出处《计算机研究与发展》 EI CSCD 北大核心 2014年第3期637-649,共13页 Journal of Computer Research and Development

基金国家自然科学基金项目(61221062) 国家"八六三"高技术研究发展计划基金项目(2012AA010902 2012AA010903) 中国科学院研究生科技创新与社会实践资助专项基金项目(11000GBF01)

关键词快速傅里叶变换自适应性能优化加速处理器图形处理器异构 fast Fourier transform （FFT） auto-tuning performance accelerated processing unit （APU） graphic processing unit （GPU） heterogenous

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献23

1Williams S. Auto-tuning performance on multicore computers[D].Berkeley:University of California,Berkeley,2003.
2Whaley R,Petitet A,Dongarra J. Automated empirical optimization of software and ATLAS project[J].Parallel Computing,2001,(1/2):3-35.
3Bilmes J,Asanovic K,Chin C. Optimizing matrix multiply using PHiPAC:A portable,high-Performance,ANSI C coding methodology[A].New York:ACM,1997.340-347.
4Frigo M,Johnson S. FFTW:An adaptive software architecture for theFFT[OL].http://www.fftw.org/fftw-paper icassp.pdf,2012.
5Frigo M,Johnson S. The design and implementation of FFTW3[J].Proceeding of the IEEE,2005,(02):216-231.
6Frigo M. A fast Fourier transform compiler[A].New York:ACM,1999.642-655.
7Püschel M,Moura J,Johnson J. SPIRAL:Code generation for DSP transforms[J].Proceeding of the IEEE:Program Generation Optimization and Adaptation,2005,(02):232-275.
8Püschel M,Franchetti F,Voronenko Y. Encyclopedia of Parallel Computing[M].Berlin:Springer-Verlag,2011.1920-1933.
9Franchetti F,Püschel M,Voronenko Y. Discrete Fourier transform on multicore[J].IEEE Signal Processing Magazine:Signal Processing on Platforms with Multiple Cores,2009,(06):90-102.
10Mirkovic D,Mahasoom R,Johnsson L. An adaptive software library for fast Fourier transforms[A].New York:ACM,2000.215-224.

二级参考文献15

1袁伟,张云泉,孙家昶,李玉成.国产万亿次机群系统NPB性能测试分析[J].计算机研究与发展,2005,42(6):1079-1084. 被引量：13
2Whaley R C,Petitet A,Dongarra J.Automated empirical optimization of software and ATLAS project[J].Parallel Computing,2001,27(1/2):3-35.
3JackDongarra[OL].[2008-03-08].http://netlib.org/utk/people/JackDongarra/PAPERS/gco_search.pdf.
4Moore Cordon E.Cramming more components onto integrated circuits[J].Electronics,1965,38(8):114-117.
5Bilmes Jeff,Asanovic Krste,Chin Chee-Whye,et al.Optimizing matrix multiply using PHiPAC:A portable,high-performance,ANSI C coding methodology[C]//Int Conf on Supercomputing.New York:ACM,1997:340-347.
6Lawson C L,Hanson R J,Kincaid D R,et al.Algorithm 539:Basic linear algebra subprograms for FORTRAN usage[J].ACM Trans on Mathematical Software,1979,5(3):324-325.
7Vuduc Richard W.Automatic performance tuning of sparse matrix kernels[D].Berkeley:University of California,Berkeley,2003.
8Im Eun-Jin,Yelick Katherine A,Vuduc Richard.Sparsity:Framework for optimizing sparse matrix-vector multiply[J].Int Journal of High Performance Computing Applications,2004,18(1):135-158.
9Frigo Metteo,Johnson Steven G.FFTW:An adaptive software architecture for the FFT[C]//Proc of IEEE Int Conf on Acoustics,Speech,and Signal Processing.1998:1381-1384.
10Frigo Matteo,Johson Steven G.The design and implementation of FFTW3[J].Proc of the IEEE:Special Issue on Program Generation,Optimization,and Platform Adaptation,2005,93(2):216-231.

共引文献1

1孙相征,张云泉,王婷,李焱,袁良.对角线稀疏矩阵的SpMV自适应性能优化[J].计算机研究与发展,2013,50(3):648-656. 被引量：4

同被引文献53

1周海芳,赵进.基于GPU的遥感图像配准并行程序设计与存储优化[J].计算机研究与发展,2012,49(S1):281-286. 被引量：18
2吴恩华,柳有权.基于图形处理器(GPU)的通用计算[J].计算机辅助设计与图形学学报,2004,16(5):601-612. 被引量：226
3高庆狮,刘志勇.一个基于孙子定理的素数存储系统方案[J].计算机研究与发展,1995,32(5):1-7. 被引量：3
4高振斌,万红星,陈禾,韩月秋.超长可变点数FFT处理器设计与实现[J].电讯技术,2005,45(4):92-96. 被引量：5
5Hung Che-Lun, Lin Yaw-Ling, Li Kuan-Ching, et al. Ef-ficient GPGPU-based parallel packet classification [ C ]// 2011 IEEE 10th International Conference on Trust, Securi- ty and Privacy in Computing and Communications. 2011 : 1367-1374.
6Alastair Nottingham, Barry Irwin. GPU packet classifica- tion using OpenCL: A consideration of viable classification methods[ C ]// Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Sci- entists and Information Technologists. 2009:160-169.
7Alastair Nottingham, Barry Irwin. Parallel packet classifi- cation using GPU co-processors [ C ].// Proceedings of the 2010 Annual Research Conference of the South African In- stitute of Computer Scientists and Information Technolo- gists. 2010:231-241.
8Sangjin Han, Keon Jang, KyongSoo Park, et al. Packet- Shader : A GPU-accelerated software router[ C ]//Proceed- ing of the ACM SIGCOMM 2010 Conference. 2010: 195- 206.
9Kang Kang, Yangdong Steve Deng. Scalable packet classi- fication via GPU metaprogramming[ C ]//Design, Automa- tion & Test in Europe Conference & Exhibition. 2011:1-4.
10Shane Ryoo, Christopher I Rodrigues, Sam S Stone, et al. Program optimization space pruning for a multithreaded GPU[ C]//Proceedings of the 6th Annual IEEE/ACM In- ternational Symposium on Code Generation and Optimiza- tion. 2008 : 195-204.

引证文献9

1卢小杰,叶明全,黄道斌.基于织物信息的动态Huffman压缩算法优化[J].安庆师范学院学报（自然科学版）,2016,22(2):43-47. 被引量：1
2张唯唯,张玉洁.基于GPU的并行报文分类方法[J].计算机与现代化,2014(11):9-14. 被引量：2
3刘磊,张子佳,刘雷,张睿.一种基于GPU的二维离散多分辨率小波变换加速方法[J].吉林大学学报（理学版）,2015,53(2):267-272. 被引量：3
4程鹏,卢宇彤,高涛,王晨旭.面向异构体系结构的GA模型拓展[J].计算机研究与发展,2017,54(4):804-812. 被引量：1
5陈暾,李志豪,贾海鹏,张云泉.基于ARMv8平台的多维FFT实现与优化研究[J].计算机学报,2019,42(11):2384-2402. 被引量：9
6张文博,稂时楠,崔祥斌,赵钰恺,包振山.机载冰雷达原始数据质量监测系统及其应用进展[J].极地研究,2019,31(4):421-430. 被引量：1
7王谛,石嵩,吴铁彬,刘亮,谭弘兵,郝子宇,过锋,李宏亮.一种高性能超长点数浮点FFT加速器设计[J].计算机研究与发展,2021,58(6):1192-1203. 被引量：1
8郭金鑫,张广婷,张云泉,陈泽华,贾海鹏.Cooley-Tukey FFT算法高性能实现与优化研究[J].计算机科学与探索,2022,16(6):1304-1315. 被引量：4
9李凤娇,顾乃杰,齐东升,苏俊杰.基于ARM SVE的FFT算法向量化研究[J].小型微型计算机系统,2022,43(10):2017-2021. 被引量：1

二级引证文献22

1董玮,周昱,王迪,杨张义.基于DSP的微波光子信号监测系统的设计[J].半导体光电,2023,44(2):251-256.
2李伟伟.基于GPU的对称正定稀疏矩阵复线性方程组迭代算法[J].吉林大学学报（理学版）,2016,54(2):297-302.
3张静,李钰,任舜文.基于有限脉冲响应滤波器的实时小波算法及其在色谱信号解析中的应用[J].色谱,2017,35(4):368-374. 被引量：5
4李廷凯,龚俊,赖文娟.探究以GPGPU为基础的数字图像并行化预处理[J].信息通信,2018,0(1):8-9.
5胡安思,张喆,孙秋田.基于嵌入式的动态哈夫曼压缩算法研究[J].科教导刊（电子版）,2019,0(5):268-269. 被引量：1
6唐志斌,曾学文,陈晓.基于维度分解的多核并行网包分类算法[J].计算机与现代化,2020,0(2):1-7.
7鲁旭.机载雷达告警接收机的发展现状与发展前景展望[J].信息通信,2020(5):64-65. 被引量：3
8黄娇郁.互相关算法在Cortex-M3平台上的实现和优化[J].电子设计工程,2021,29(3):93-98.
9王谛,石嵩,吴铁彬,刘亮,谭弘兵,郝子宇,过锋,李宏亮.一种高性能超长点数浮点FFT加速器设计[J].计算机研究与发展,2021,58(6):1192-1203. 被引量：1
10石钊铭,胡哲琨,陈敬东.国产特种计算机自动测试控制系统设计[J].舰船电子工程,2021,41(6):164-168. 被引量：2

1硬件店[J].大众软件,2011(19):65-66.
2覃特.发布周年APU已经影响了整个行业[J].电脑时空,2012(3):19-19.
3AMD推最新型号加速处理器APU[J].电子商务,2011,12(7):2-2.
4APU[J].个人电脑,2014(5):81-81.
5海外视点[J].微型计算机,2011(31):150-151.
6AMD推出最新APU比英特尔凌动CPU芯片更小[J].电子质量,2011(2):39-39.
7魔之左手.APU全球销量突破3000万颗[J].大众软件,2012(8):67-67.
8张晓云.AMD全新A系列APU上市[J].微电脑世界,2014(2):93-93.
9AMD发布2013年至尊A系列台式机APU[J].个人电脑,2013(7):75-75.
10融合时代之AMD Fusion APU笔记本略览[J].电脑迷,2011(5):28-29.

计算机研究与发展

2014年第3期

浏览历史

内容加载中请稍等...

异构平台上性能自适应FFT框架被引量：9

参考文献23

二级参考文献15

共引文献1

同被引文献53

引证文献9

二级引证文献22

相关作者

相关机构

相关主题

浏览历史

异构平台上性能自适应FFT框架 被引量：9

参考文献23

二级参考文献15

共引文献1

同被引文献53

引证文献9

二级引证文献22

相关作者

相关机构

相关主题

浏览历史

异构平台上性能自适应FFT框架被引量：9