摘要
针对标准并行算法难以在图形处理器(GPU)上高效运行的问题,以累加和算法为例,基于Nvidia公司统一计算设备架构(CUDA)GPU介绍了指令优化、共享缓存冲突避免、解循环优化和线程过载优化四种优化方法。实验结果表明,并行优化能有效提高算法在GPU上的执行效率,优化后累加和算法的运算速度相比标准并行算法提高了约34倍,相比CPU串行实现提高了约70倍。
Standard parallel algorithm cannot work efficiently on GPU. This paper took reduction algorithm for example, introduced four parallel optima methods for NVIDIA' s graphics processor unit (GPU) which supported CUDA architecture. These methods included instruction optimize and shared memory conflict avoid and loop unroll and threads overload optimize. The experiment result shows that: parallel optimize can significantly speed up the GPU compute speed. The optimized reduction algorithm is 34 times faster than standard parallel algorithm and 70 times than CPU-based implementation.
出处
《计算机应用研究》
CSCD
北大核心
2009年第11期4115-4118,共4页
Application Research of Computers
基金
国家"863"高技术(保密)资助项目
关键词
图形处理器
并行优化
累加和
统一计算设备架构
graphics processor unit(GPU)
parallel optimize
reduction
compute unified device architecture(CUDA)