摘要
鉴于GPU强大的计算性能以及先进的并行处理器架构,主要研究一种将FFT的并行算法映射到CUDA模型的并行设计方法。该设计方法遵循如减少内核函数中的全局存储器访问、全局存储器合并访问、高效利用共享存储器、高密集度计算等GPU平台下主要的设计准则进行优化设计,并在基于NVIDIA Fermi处理架构的Tesla C2075GPU平台上进行了大点数一维FFT设计实现。实验结果表明了该方法的可行性及高效性,在256K点范围内性能优于CUFFT库,加速比最高达到CUFFT 4.0库的2.1倍。
Considering the GPU's powerful computing performance and advanced parallel processor architecture, a kind of concurrent design method is studied, which maps the FFT parallel algorithm onto CUDA architecture. This method follows optimized design principles for GPU platforms, such as, re- ducing global memory access, global memory access coalescing, efficient usage of shared memory, and intensive computing. Then, a large Point 1D FFT is implemented on NVIDIA Tesla C2075 GPU based on the architecture of NVIDIA Fermi. Experimental results show that this method is superior to the CUFFT library when the number of points is not larger than 256K, and it runs two times faster than the CUFFT 4.0 library, which shows that the new method is feasible and effective.
出处
《计算机工程与科学》
CSCD
北大核心
2013年第11期34-41,共8页
Computer Engineering & Science