期刊文献+

An enhanced GPU reduction at the warp-level

An enhanced GPU reduction at the warp-level
下载PDF
导出
摘要 In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods. In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.
出处 《Computer Aided Drafting,Design and Manufacturing》 2016年第2期43-52,共10页 计算机辅助绘图设计与制造(英文版)
基金 Supported by National Nature Science Foundation of China(61472289) the Nature Science Foundation of Hubei Province(2015CFB254)
关键词 REDUCTION graphical processing unit computing unified device architecture warp-level reduction reduction graphical processing unit computing unified device architecture warp-level reduction
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部