摘要
根据申威26010众核处理器的特点提出了基于两层分解的一维FFT众核并行算法.该算法基于迭代的Stockham FFT计算框架和Cooley-Tukey FFT算法,将大规模FFT分解成一系列的小规模FFT来计算,并通过设计合理的任务划分方式、寄存器通信、双缓冲以及SIMD向量化等与计算平台相关的优化方法来提高FFT的计算性能.最后对所提出算法的性能进行了测试,相比于单主核上运行的FFTW3.3.4库,获得了平均44.53x的加速比,最高加速比可达56.33x,且其带宽利用率最高可达83.45%.
A two-layer decomposition 1-D FFT multi-core parallel algorithm is proposed according to the characteristics of Sunway 26010 processor.It is based on the iterative Stockholm FFT framework and the Cooley-Tukey FFT algorithm.It decomposes large scale FFT into a series of small scale FFTs.It improves the performance of the algorithm by means of designing reasonable task partitioning,register communication,double-buffering,and SIMD vectorization.Finally,the performance of the two-layer decomposition 1-D FFT multi-core parallel algorithm is tested.It achieves an average speedup of 44.53x,with a maximum speedup of up to 56.33x,and a maximum bandwidth utilization of 83.45%,compared to FFTW3.3.4 library running on the single MPE.
作者
赵玉文
敖玉龙
杨超
刘芳芳
尹万旺
林蓉芬
ZHAO Yu-Wen;AO Yu-Long;YANG Chao;LIU Fang-Fang;YIN Wan-Wang;LIN Rong-Fen(Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;School of Mathematical Sciences,Peking University,Beijing 100871,China;State Key Laboratory of Computer Science(Institute of Software,Chinese Academy of Sciences),Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)
出处
《软件学报》
EI
CSCD
北大核心
2020年第10期3184-3196,共13页
Journal of Software
基金
国家重点研发计划(2016YFB0200603)
北京市自然科学基金(JQ18001)。