申威26010众核处理器上一维FFT实现与优化被引量：2

General Implementation of 1-D FFT on the Sunway 26010 Processor

下载PDF

导出

摘要根据申威26010众核处理器的特点提出了基于两层分解的一维FFT众核并行算法.该算法基于迭代的Stockham FFT计算框架和Cooley-Tukey FFT算法,将大规模FFT分解成一系列的小规模FFT来计算,并通过设计合理的任务划分方式、寄存器通信、双缓冲以及SIMD向量化等与计算平台相关的优化方法来提高FFT的计算性能.最后对所提出算法的性能进行了测试,相比于单主核上运行的FFTW3.3.4库,获得了平均44.53x的加速比,最高加速比可达56.33x,且其带宽利用率最高可达83.45%. A two-layer decomposition 1-D FFT multi-core parallel algorithm is proposed according to the characteristics of Sunway 26010 processor.It is based on the iterative Stockholm FFT framework and the Cooley-Tukey FFT algorithm.It decomposes large scale FFT into a series of small scale FFTs.It improves the performance of the algorithm by means of designing reasonable task partitioning,register communication,double-buffering,and SIMD vectorization.Finally,the performance of the two-layer decomposition 1-D FFT multi-core parallel algorithm is tested.It achieves an average speedup of 44.53x,with a maximum speedup of up to 56.33x,and a maximum bandwidth utilization of 83.45%,compared to FFTW3.3.4 library running on the single MPE.

作者赵玉文敖玉龙杨超刘芳芳尹万旺林蓉芬 ZHAO Yu-Wen;AO Yu-Long;YANG Chao;LIU Fang-Fang;YIN Wan-Wang;LIN Rong-Fen(Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;School of Mathematical Sciences,Peking University,Beijing 100871,China;State Key Laboratory of Computer Science(Institute of Software,Chinese Academy of Sciences),Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)

机构地区中国科学院软件研究所并行软件与计算科学实验室北京大学数学科学学院计算机科学国家重点实验室(中国科学院软件研究所) 中国科学院大学国家并行计算机工程技术研究中心

出处《软件学报》 EI CSCD 北大核心 2020年第10期3184-3196,共13页 Journal of Software

基金国家重点研发计划(2016YFB0200603) 北京市自然科学基金(JQ18001)。

关键词申威26010处理器一维FFT 两层分解 Cooley-Tukey 众核并行 Sunway 26010 processor 1-D FFT two-layer decomposition Cooley-Tukey multi-core parallel

分类号 TP301 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献2

1刘益群,李焱,张云泉,张先轶.Memory Efficient Two-Pass 3D FFT Algorithm for Intel~ Xeon Phi^(TM) Coprocessor[J].Journal of Computer Science & Technology,2014,29(6):989-1002. 被引量：2
2崔翔,李晓雯,陈一峯.数组维度类型程序设计方法及高性能FFT实现[J].软件学报,2015,26(12):3104-3116. 被引量：1

二级参考文献35

1TOP500. http://www.top500.org/.
2Cui X, Chen YF, Mei H. Improving performance of matrix multiplication and FFT on GPU. In:Proc. of the 15th Int'l Conf. on Parallel and Distributed Systems(ICPADS 2009). 2009.[doi:10.1109/ICPADS.2009.8].
3Cui X, Chen YF, Zhang CY, Mei H. Auto-Tuning dense matrix multiplication for GPGPU with cache. In:Proc. of the 16th Int'l Conf. on Parallel and Distributed Systems(ICPADS 2010). 2010.[doi:10.1109/ICPADS.2010.64].
4Chen YF, Cui X, Mei H. Large-Scale FFT on GPU clusters. In:Proc. of the 24th Int'l Conf. on Supercomputing(ICS 2010). 2010.[doi:10.1145/1810085.1810128].
5Govindaraju NK, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In:Proc. of the 2008 ACM/IEEE Conf. on Supercomputing(SC 2008). 2008.[doi:10.1109/SC.2008.5213922].
6Micikevicius P. 3D finite difference computation on GPUs using CUDA. In:Proc. of the 2nd Workshop on General Purpose Processing on Graphics Processing Units(GPGPU-2). 2009.[doi:10.1145/1513895.1513905].
7Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In:Proc. of the 13th ACM SIGPLAN Symp. on Principles and practice of parallel programming(PPoPP 2008). 2008.[doi:10.1145/1345206.1345220].
8Volkov V, Kazian B. FFT prototype. http://www.cs.berkeley.edu/volkov/.
9Dotsenko Y, Baghsorkhi SS, Lloyd B, Govindaraju NK. Auto-Tuning of fast Fourier transform on graphics processors. In:Proc. of the 16th ACM Symp. on Principles and Practice of Parallel Programming(PPoPP 2011). ACM Press, 2011.[doi:10.1145/1941553. 1941589].
10Govindaraju NK, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In:Proc. of the 2008 ACM/IEEE Conf. on Supercomputing(SC 2008).2008.[doi:10.1109/SC.2008.5213922].

共引文献1

1方宝辉,徐金秀,魏敏,周明忠.BCC_AGCM_T106在Intel众核上混合异构编程与优化研究[J].计算机科学与探索,2015,9(9):1093-1099. 被引量：4

同被引文献7

1杨琳,吴家铸,扈啸,田希.互相关运算在银河飞腾DSP上的实现及优化[J].计算机科学,2015,42(11):53-55. 被引量：3
2陈海燕,杨超,刘胜,刘仲.一种高效的面向基2 FFT算法的SIMD并行存储结构[J].电子学报,2016,44(2):241-246. 被引量：7
3孙家栋,孙乔,邓攀,杨超.基于申威众核处理器的1、2级BLAS函数优化研究[J].计算机系统应用,2017,26(11):101-108. 被引量：5
4张军阳,郭阳,扈啸.二维矩阵卷积的并行计算方法[J].浙江大学学报（工学版）,2018,52(3):515-523. 被引量：8
5宋森森,贾振红,杨杰,Nikola KASABOV.结合Ostu阈值法的最小生成树图像分割算法[J].计算机工程与应用,2019,55(9):178-183. 被引量：31
6王耀华,郭阳.面向HPC的高性能微处理器研究进展[J].计算机工程与科学,2020,42(10):1742-1748. 被引量：1
7周力凯,江雨洋,冯亚春,梁国远,吴新宇,王琼.基于多尺度区域与类不确定性理论的局部阈值分割方法[J].计算机应用,2020,40(S02):66-72. 被引量：7

引证文献2

1郭俊,刘鹏,杨昕遥,张鲁飞,吴东.大点数FFT在“申威26010”上的并行优化[J].浙江大学学报（工学版）,2024,58(1):78-86.
2陈云,胡伟方,王梦园,商建东.面向FT-M7002的阈值分割算法优化实现[J].计算机应用与软件,2024,41(8):254-258.

1陈博伦,何卫锋.面向GPU平台的二维FFT的加速技术研究[J].现代计算机,2020,26(12):68-72. 被引量：2
2王先梦,赵民富,吕玉凤,蔡银宇,储根深,卢旭,王昭顺,郭苏萱,周志锋,胡长军,杨文.一种全堆芯精确到每个通道的子通道并行模拟方法[J].原子能科学技术,2020,54(6):1108-1117. 被引量：2
3王燕平.基于大数据时代的航天企业信息化管理——以易盘点SaaS平台与钉钉移动平台整合方案为例[J].中国高新科技,2020(18):104-106. 被引量：2
4高珑,戴华东,杨沙洲,丁滟.并行帧缓存设备:基于多核CPU的Xorg并行显示优化[J].软件学报,2020,31(10):3309-3320. 被引量：1
5董成贞,苏美仙.脓毒症心肌抑制的治疗进展[J].医学综述,2020,26(20):4017-4021. 被引量：9
6童曼琪,黄江升,郭昆.融合Spark与隐性兴趣的用户综合影响力度量[J].计算机工程,2020,46(11):61-69. 被引量：1
7王娟.基于STM32系列单片机的智能手势识别多功能系统[J].科技创新与应用,2020(33):43-44. 被引量：6
8张超,王维庆,王海云,邱衍江.新能源汇集地区次同步谐波检测方法的研究[J].太阳能学报,2020,41(9):104-113. 被引量：7
9无.满足5G复杂要求的高性能DSP和控制处理[J].中国集成电路,2020,29(11):86-89. 被引量：1
10陈鹏旭,申士楠,孙雨,赵建森,齐婷婷.基于软质放电管的等离子体线型天线[J].微波学报,2020,36(5):56-60.

软件学报

2020年第10期

浏览历史

内容加载中请稍等...

申威26010众核处理器上一维FFT实现与优化被引量：2

参考文献2

二级参考文献35

共引文献1

同被引文献7

引证文献2

相关作者

相关机构

相关主题

浏览历史

申威26010众核处理器上一维FFT实现与优化 被引量：2

参考文献2

二级参考文献35

共引文献1

同被引文献7

引证文献2

相关作者

相关机构

相关主题

浏览历史

申威26010众核处理器上一维FFT实现与优化被引量：2