期刊文献+

申威26010众核处理器上一维FFT实现与优化 被引量:2

General Implementation of 1-D FFT on the Sunway 26010 Processor
下载PDF
导出
摘要 根据申威26010众核处理器的特点提出了基于两层分解的一维FFT众核并行算法.该算法基于迭代的Stockham FFT计算框架和Cooley-Tukey FFT算法,将大规模FFT分解成一系列的小规模FFT来计算,并通过设计合理的任务划分方式、寄存器通信、双缓冲以及SIMD向量化等与计算平台相关的优化方法来提高FFT的计算性能.最后对所提出算法的性能进行了测试,相比于单主核上运行的FFTW3.3.4库,获得了平均44.53x的加速比,最高加速比可达56.33x,且其带宽利用率最高可达83.45%. A two-layer decomposition 1-D FFT multi-core parallel algorithm is proposed according to the characteristics of Sunway 26010 processor.It is based on the iterative Stockholm FFT framework and the Cooley-Tukey FFT algorithm.It decomposes large scale FFT into a series of small scale FFTs.It improves the performance of the algorithm by means of designing reasonable task partitioning,register communication,double-buffering,and SIMD vectorization.Finally,the performance of the two-layer decomposition 1-D FFT multi-core parallel algorithm is tested.It achieves an average speedup of 44.53x,with a maximum speedup of up to 56.33x,and a maximum bandwidth utilization of 83.45%,compared to FFTW3.3.4 library running on the single MPE.
作者 赵玉文 敖玉龙 杨超 刘芳芳 尹万旺 林蓉芬 ZHAO Yu-Wen;AO Yu-Long;YANG Chao;LIU Fang-Fang;YIN Wan-Wang;LIN Rong-Fen(Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;School of Mathematical Sciences,Peking University,Beijing 100871,China;State Key Laboratory of Computer Science(Institute of Software,Chinese Academy of Sciences),Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)
出处 《软件学报》 EI CSCD 北大核心 2020年第10期3184-3196,共13页 Journal of Software
基金 国家重点研发计划(2016YFB0200603) 北京市自然科学基金(JQ18001)。
关键词 申威26010处理器 一维FFT 两层分解 Cooley-Tukey 众核并行 Sunway 26010 processor 1-D FFT two-layer decomposition Cooley-Tukey multi-core parallel
  • 相关文献

参考文献2

二级参考文献35

  • 1TOP500. http://www.top500.org/.
  • 2Cui X, Chen YF, Mei H. Improving performance of matrix multiplication and FFT on GPU. In:Proc. of the 15th Int'l Conf. on Parallel and Distributed Systems(ICPADS 2009). 2009.[doi:10.1109/ICPADS.2009.8].
  • 3Cui X, Chen YF, Zhang CY, Mei H. Auto-Tuning dense matrix multiplication for GPGPU with cache. In:Proc. of the 16th Int'l Conf. on Parallel and Distributed Systems(ICPADS 2010). 2010.[doi:10.1109/ICPADS.2010.64].
  • 4Chen YF, Cui X, Mei H. Large-Scale FFT on GPU clusters. In:Proc. of the 24th Int'l Conf. on Supercomputing(ICS 2010). 2010.[doi:10.1145/1810085.1810128].
  • 5Govindaraju NK, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In:Proc. of the 2008 ACM/IEEE Conf. on Supercomputing(SC 2008). 2008.[doi:10.1109/SC.2008.5213922].
  • 6Micikevicius P. 3D finite difference computation on GPUs using CUDA. In:Proc. of the 2nd Workshop on General Purpose Processing on Graphics Processing Units(GPGPU-2). 2009.[doi:10.1145/1513895.1513905].
  • 7Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In:Proc. of the 13th ACM SIGPLAN Symp. on Principles and practice of parallel programming(PPoPP 2008). 2008.[doi:10.1145/1345206.1345220].
  • 8Volkov V, Kazian B. FFT prototype. http://www.cs.berkeley.edu/volkov/.
  • 9Dotsenko Y, Baghsorkhi SS, Lloyd B, Govindaraju NK. Auto-Tuning of fast Fourier transform on graphics processors. In:Proc. of the 16th ACM Symp. on Principles and Practice of Parallel Programming(PPoPP 2011). ACM Press, 2011.[doi:10.1145/1941553. 1941589].
  • 10Govindaraju NK, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In:Proc. of the 2008 ACM/IEEE Conf. on Supercomputing(SC 2008).2008.[doi:10.1109/SC.2008.5213922].

共引文献1

同被引文献7

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部