期刊文献+

An FFT Performance Model for Optimizing General-Purpose Processor Architecture

An FFT Performance Model for Optimizing General-Purpose Processor Architecture
原文传递
导出
摘要 General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT),due to its flexibility,reliability and practicality.FFT is a representative application intensive in both computation and memory access,optimizing the FFT performance of a GPP also benefits the performances of many other applications.To facilitate the analysis of FFT,this paper proposes a theoretical model of the FFT processing.The model gives out a tight lower bound of the runtime of FFT on a GPP,and guides the architecture optimization for GPP as well.Based on the model,two theorems on optimization of architecture parameters are deduced,which refer to the lower bounds of register number and memory bandwidth.Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.The above investigations were adopted in the development of Godson-3B,which is an industrial GPP.The optimization techniques deduced from our performance model improve the FFT performance by about 40%,while incurring only 0.8% additional area cost.Consequently,Godson-3B solves the 1024-point single-precision complex FFT in 0.368 μs with about 40 Watt power consumption,and has the highest performance-per-watt in complex FFT among processors as far as we know.This work could benefit optimization of other GPPs as well. General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT),due to its flexibility,reliability and practicality.FFT is a representative application intensive in both computation and memory access,optimizing the FFT performance of a GPP also benefits the performances of many other applications.To facilitate the analysis of FFT,this paper proposes a theoretical model of the FFT processing.The model gives out a tight lower bound of the runtime of FFT on a GPP,and guides the architecture optimization for GPP as well.Based on the model,two theorems on optimization of architecture parameters are deduced,which refer to the lower bounds of register number and memory bandwidth.Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.The above investigations were adopted in the development of Godson-3B,which is an industrial GPP.The optimization techniques deduced from our performance model improve the FFT performance by about 40%,while incurring only 0.8% additional area cost.Consequently,Godson-3B solves the 1024-point single-precision complex FFT in 0.368 μs with about 40 Watt power consumption,and has the highest performance-per-watt in complex FFT among processors as far as we know.This work could benefit optimization of other GPPs as well.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第5期875-889,共15页 计算机科学技术学报(英文版)
基金 supported by the National Science and Technology Major Project under Grant Nos.2009ZX01028-002-003,2009ZX01029-001-003,2010ZX01036-001-002 the National Natural Science Foundation of China under Grant Nos.61050002,61003064,60921002
关键词 fast Fourier transform (FFT) general-purpose processor (GPP) performance prediction model vector unit DMA fast Fourier transform (FFT) general-purpose processor (GPP) performance prediction model vector unit DMA
  • 相关文献

参考文献27

  • 1Frigo M, Johnson S. The design and implementation of FFTW3. Proceedings of the IEEE, Feb. 2005, 93(2): 216- 231.
  • 2Franchetti F, Piischel M, Voronenko Y, Chellappa S, Moura J M F. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, 2009, 26(6): 90-102.
  • 3Li Y, Zhao L, Lin H, Chow A C, Diamond J R. A performance model for fast Fourier transform. In Proc. the 23rd International Symposium on Parallel and Distributed Processing, Rome, Italy, May 23-29, 2009, pp.1-11.
  • 4Fraguela B B, Voronenko Y, Puschel M. Automatic tuning of discrete Fourier transforms driven by analytical modeling. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT2009). Raleigh, USA, Sept. 12-16, 2009, pp.271-280.
  • 5Norton A, Silberger A J. Parallelization and performance analysis of the Cooley-Tukey FFT algorithm for shared-memory architectures. IEEE Transactions on Computers, 1987, C-36(5): 581-591.
  • 6Cvetanovic Z. Performance analysis of the FFT algorithm on a shaved-memory parallel architecture. IBM Journal of Research and Development, 1987, 31(4): 435-451.
  • 7Gu L, Li X. DFT performance prediction in FFTW. In Proc. the 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC), Newark, USA, Oct. 8-10, 2009.
  • 8Pagiamtzis K, Kulak P G. Empirical performance prediction for IFFT/FFT cores for OFDM systems-on-a-chip. In Proc. the 45th MWSCAS, Tulsa, USA, Aug. 4-7, 2002.
  • 9Singer B, Veloso M. Learning to construct fast signal processing implementations. Journal of Machine Learning Research, 2003, 3: 887-919.
  • 10Sepiashvili D. Performance models and search methods for optimal FFT implementations. [Master's Thesis]. Carnegie Mellon University, 2000.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部