期刊文献+

基于软硬件的协同支持在众核上对1-DFFT算法的优化研究 被引量:9

Software/Hardware Co-Design for 1-D FFT Optimization on Many-Core Architecture
下载PDF
导出
摘要 随着高性能计算需求的日益增加,片上众核(many-core)处理器成为未来处理器架构的发展方向.快速傅立叶变换(FFT)作为高性能计算中的重要应用,对计算能力和通信带宽都有较高的要求.因此基于众核处理器平台,实现高效、可扩展的FFT算法是算法和体系结构设计者共同面临的挑战.文中在众核处理器Godson-T平台上对1-D FFT算法进行了优化和评估,在节省几乎三分之一L2 Cache存储开销的情况下,通过隐藏矩阵转置,计算与通信重叠等优化策略,使得优化后的1-D FFT算法达到3倍以上的性能提升.并通过片上网络拥塞状况的实验分析,发现对于像FFT这样访存带宽受限的应用,增加L2 Cache的访问带宽,可以缓解因为爆发式读写带给片上网络和L2 Cache的压力,进一步提高程序的性能和扩展性. As the increasing demand of high performance computing,many-core architecture becomes to the trend of future processor architecture.Fast Fourier Transform(FFT),both computing intensive and bandwidth intensive,is one of the most important applications of the high performance computing.For both software and hardware developers,it is a challenge to implement high efficiency and scalable FFT algorithm on many-core processor.Based on Godson-T processor,the authors developed an optimized implementation of 1-D FFT through implicitly matrix transpose hidden as well as overlapping computation and communication.The performance of optimized 1-D FFT algorithm achieves more than 3 times better and reduces almost 1/3 L2 Cache consumption.After the analysis of on-chip network congestion problem,the authors suggest that increasing the access bandwidth of L2 cache can alleviate the negative impact on on-chip network and L2 Cache which is brought by burst L2 Cache access.As a result,the performance and scalability of memory bandwidth limited applications,such as FFT,can be further improved.
出处 《计算机学报》 EI CSCD 北大核心 2008年第11期2005-2014,共10页 Chinese Journal of Computers
基金 国家"九七三"重点基础研究发展规划项目基金(2005CB321600) 国家自然科学基金重点项目(60736012)资助.
关键词 众核 Godson-T 快速傅立叶变换 计算与通信重叠 many-core Godson-T fast Fourier transform computation/communication overlapping
  • 相关文献

参考文献8

  • 1Cooley J W, Tukey J W. An algorithm for the machine computation of the complex fourier series. Mathematics of Computation, 1965, 19(90): 297-301
  • 2Frigo M, Johson S G. The design and implementation of FFTW3. Proceedings of the IEEE, 2005, 93(2): 216-231
  • 3Williams Samuel, Shall John, Oliker Leonid, Kamil Shoaib, Husbands Parry, Yelick Katherine. Scientific computing kernels on the Cell processor. International Journal of Parallel Programming, 2007, 35(3): 263-298
  • 4Govindaraiu Naga K, Larsen Scott, Gray Jim, Manocha Dinesh. A memory model for scientific algorithms on graphics processors//Proceedings of the 2006 ACM/IEEE Conference on Supereomputing. Tampa, Florida, 2006
  • 5Chen Long, Hu Ziang, Lin Jun-Min, Gao Guang R. Optimizing fast fourier transform on a multi-core architecture//Proceedings of the IEEE International Parallel and Distributed Processing Symposium. California, USA, 2007: 499
  • 6Bailey D H. FFTs in external or hierarchical memory. Journal of Supercomputing, 1990, 4(1): 23-35
  • 7Woo Steven Cameron, Ohara Moriyoshi, Torrie Evan, Singh Jaswinder Pal, Gupta Anoop. The SPLASH-2 programs: Characterization and methodological considerations//Proceedings of the 22nd International Symposium on Computer Architecture. S. Marghenta Ligure, Italy, 19951 24-36
  • 8Iftode Liviu, Singh Jaswinder Pal, Li Kai. Scope consistency: A bridge between release consistency and entry consistency// Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures. Padua, Italy, 1996:277-287

同被引文献81

  • 1李浩,谢伦国.片上多处理器末级Cache优化技术研究[J].计算机研究与发展,2012,49(S1):172-179. 被引量:6
  • 2吴恩华.图形处理器用于通用计算的技术、现状及其挑战[J].软件学报,2004,15(10):1493-1504. 被引量:141
  • 3苏涛,庄德靖.大点数FFT算法的改进及其实现[J].现代雷达,2005,27(7):23-26. 被引量:8
  • 4王志刚,李曦,周学海,余洁.可重定向的定制指令集处理器(ASIP)仿真技术研究[J].系统仿真学报,2007,19(6):1249-1255. 被引量:1
  • 5NVIDIA Corporation. NVIDIA CUDA programming guide [EB/OL][2010-07-15]. http://www, nvidia. com/object/euda_ homenew. html.
  • 6YANG Yang, RAART K V, CASANOVA H. Multi round algorithms for scheduling divisible loads [J].IEEE Transactions on Parallel and Distributed Sys tems, 2005,16(11): 1092-1102.
  • 7TAO Yongcai,JIN Hai,WU Song,et al.Adaptive multi-round scheduling strategy for divisible workloads in grid environments[C] // Proceedings of the 23rd International Conference on Information Networking.New York,USA:ACM,2009:260-264.
  • 8SHET G A,SADAYAPPAN P,BERNHOLDT E D,et al.A framework for characterizing overlap of communication and computation in parallel applications[J].Cluster Computing,2008,11(1):75-90.
  • 9ANTHONY D,LORI P,MARTIN S.MPI-aware compiler optimizations for improving communicationcomputation overlap[C] //Proceedings of the 23th International Conference on Supercomputing New York,USA:ACM,2009:316-325.
  • 10Asanovic K, Bodik R, Catanzam B C. The landscape of parallel computing research*, a view from Berkeley. http:// www. eecs. berkeley, edu/Pubs/TechRpts/2006/EECS-2006- 183. html, 2006.

引证文献9

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部