基于GPU的并行计算性能分析模型被引量：3

Parallel Computation Performance Analysis Model Based on GPU

下载PDF

导出

摘要针对GPU并行计算领域缺少精确的性能分析模型和有针对性的性能优化方法,提出一种基于GPU的并行计算性能定量分析模型,其通过对指令流水线、共享存储器访存、全局存储器访存的性能建模,来定量分析并行程序,帮助程序员找到程序运行瓶颈,进行有效的性能优化。实验部分通过3个具有代表性的实际应用(稠密矩阵乘法、三对角线性方程组求解、稀疏矩阵矢量乘法)的性能分析证明了该模型的实用性,并有效地实现了算法的优化。 In order to solve the problem of lacking accurate performance analysis model in parallel computation field based on GPU,we proposed a quantitative performance model which can simulate the performance of three major com- ponents of GPU including instruction pipeline, shared memory access time, and global memory access time. It is designed to build a performance model that helps programmer find the performance bottlenecks and improve the system＇s per-formance efficiently. To demonstrate the usefulness of the model and to optimize the algorithms performance, we ana- lyzed three representative real-world programs： dense matrix multiplication, tridiagonal systems solver, and sparse ma- trix vector multiplication.

作者王卓薇程良伦赵武清

机构地区广东工业大学计算机学院

出处《计算机科学》 CSCD 北大核心 2014年第1期31-38,共8页 Computer Science

基金广州市科技项目(2012Y2-0031) 博士后基金(2013M531825) 国家自然科学基金(U1201251)资助

关键词 GPU 性能定量分析模型指令流水线共享存储器访存全局存储器访存 GPU, Quantitative performance model, Instruction pipeline, Shared memory access time, Global memory ac- cess time

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献15

1Profiler A S. ATI Stream Profller[OL]. http://developer, amd. com.
2Collange S, et al. Barra: A Parallel Functional Simulator for GPGPU[C]//IEEE International Symposium on Modeling, A- nalysis b- Simulation of Computer and Telecommunication Sys- tems (MASCOTS). 2010.
3Nsight N P. NVIDIA Parallel Nsight[OL]. http://developer. nvidia, com.
4Diamos G F, et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems[C]// 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010. Vienna, Austria: Institute of Electrical and Electronics Engineers Inc, 2010.
5Ryoo S, et al. Program optimization carving for GPU computing [J]. Journal of Parallel and Distributed Computing, 2008, 68 (10) : 1389-1401.
6Liu Y,Zhang E Z, Shen X. A Cross-Input Adaptive Framework for GPU Program Optimizations [ C]//23rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2009. Rome, Italy; IEEE Computer Society, 2009.
7Meng J, Skadron K. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs[C]//23rd International Conference on Supercomputing, ICS'09. Yorktown Heights, NY, United states; Association for Computing Machine- ry,2009.
8Choi J W, Singh A, Vuduc R W. Model-driven autotuning of sparse matrix-vector multiply on GPUs[C]//2010 ACM SIGP- LAN Symposium on Principles and Practice of Parallel Program- ming, PPoPP' 10. Bangalore, India: Association for Computing Machinery, 2010.
9Baskaran M M, et al. A compiler framework for optimization of affine loop nests for GPGPUs[C]//22nd ACM International.Conference on Supercomputing, ICS ' 08. Island of Kos, Greece Association for Computing Machinery, 2008.
10Collange S, et al. Barra: A Parallel Functional Simulator for GPGPU. in Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS) [C]///2010 IEEE In- ternational Symposium on. 2010.

同被引文献6

1喻之斌,金海,邹南海.计算机体系结构软件模拟技术[J].软件学报,2008,19(4):1051-1068. 被引量：24
2白洪涛,欧阳丹彤,李熙铭,李亭,何丽莉.基于GPU的稀疏矩阵向量乘优化[J].计算机科学,2010,37(8):168-171. 被引量：13
3陈荣鑫.基于函数式中间语言的XML查询并行化[J].重庆理工大学学报（自然科学）,2011,25(7):81-86. 被引量：3
4刘杰,迟利华,蒋杰,徐涵,晏益慧,胡庆丰.大规模并行计算机系统性能测评体系[J].计算机工程与科学,2013,35(3):25-30. 被引量：2
5邹航,王华秋,黄勇.基于GPU加速的彩虹表分析MD5哈希密码[J].重庆理工大学学报（自然科学）,2013,27(7):61-66. 被引量：2
6尹孟嘉,许先斌,熊曾刚,张涛.GPU矩阵乘法的性能定量分析模型[J].计算机科学,2015,42(12):13-17. 被引量：1

引证文献3

1尹孟嘉,许先斌,熊曾刚,张涛.GPU矩阵乘法的性能定量分析模型[J].计算机科学,2015,42(12):13-17. 被引量：1
2尹孟嘉,许先斌,何水兵,胡婧,叶从欢,张涛.GPU稀疏矩阵向量乘的性能模型构造[J].计算机科学,2017,44(4):182-187. 被引量：3
3雷超,刘江,宋佳文.矩阵乘法的GPU并行计算时耗模型与最优配置方法[J].计算机科学,2024,51(S01):810-817.

二级引证文献4

1杨世伟,蒋国平,宋玉蓉,涂潇.基于GPU的稀疏矩阵存储格式优化研究[J].计算机工程,2019,45(9):23-31. 被引量：4
2曹亚松,刘胜.面向稀疏矩阵向量乘的DMA设计与验证[J].计算机与数字工程,2019,47(11):2686-2690.
3蔺丽华,张美春,王佳仪,李敏,门浩.基于BWDSP1042的复数矩阵向量乘的优化与实现[J].计算机应用与软件,2023,40(3):298-301.
4雷超,刘江,宋佳文.矩阵乘法的GPU并行计算时耗模型与最优配置方法[J].计算机科学,2024,51(S01):810-817.

1曾庆怡,张明武,张金霜.基于GPU的域乘法并行算法的改进研究[J].信息网络安全,2013(1):22-26.
2田盼,华蓓,陆李.基于GPU的K-近邻算法实现[J].计算机工程,2015,41(2):189-192. 被引量：3
3华锋亮.GPU上不同存储器上CUDA程序功耗分析[J].信息与电脑,2016,28(3):61-62.
4DDJ006：使用CUDA profiler探索全局存储器[J].程序员,2009(1):76-77.
5孟杰,王小鸽,李三立.并行计算性能的分析和预测[J].计算机科学,1999,26(2):14-17. 被引量：2
6韩玉,闫镔,宇超群,李磊,李建新.锥束CT FDK重建算法的GPU并行实现[J].计算机应用,2012,32(5):1407-1410. 被引量：11
7翟群英,李永全.HPI在多处理系统中的应用[J].现代电子技术,2005,28(5):1-2. 被引量：3
8尹孟嘉,许先斌,熊曾刚,张涛.GPU矩阵乘法的性能定量分析模型[J].计算机科学,2015,42(12):13-17. 被引量：1
9陈虎,彭江锋,施少怀.gAC:基于GPU的高性能AC算法[J].计算机工程与应用,2012,48(12):43-48. 被引量：2
10黄敏,丁萍,罗海飚.共轭梯度法在GPU及Xeon Phi下的并行优化及比较[J].华南理工大学学报（自然科学版）,2015,43(11):35-46. 被引量：1

计算机科学

2014年第1期

浏览历史

内容加载中请稍等...

基于GPU的并行计算性能分析模型被引量：3

参考文献15

同被引文献6

引证文献3

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于GPU的并行计算性能分析模型 被引量：3

参考文献15

同被引文献6

引证文献3

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于GPU的并行计算性能分析模型被引量：3