Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique

Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique

下载PDF

导出

摘要 Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing. Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.

作者 Neda Seifi Abdullah Al-Mamun Neda Seifi;Abdullah Al-Mamun(Department of Computer & Cyber Sciences&#8212,SCCS, Augusta University, Augusta, Georgia, USA)

机构地区 Department of Computer & Cyber Sciences&#

出处《Journal of Computer and Communications》 2024年第5期124-139,共16页 电脑和通信（英文）

关键词 Data Layout Optimization CUDA Performance Optimization GPU Memory Optimization Dynamic Programming Matrix Multiplication Memory Access Pattern Optimization in CUDA Data Layout Optimization CUDA Performance Optimization GPU Memory Optimization Dynamic Programming Matrix Multiplication Memory Access Pattern Optimization in CUDA

分类号 TP3 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

1Dana Mackenzie.After Artificial Intelligence Breaks Longstanding Matrix Multiplication Records,Humans Quickly Do Better[J].Engineering,2023(9):1-3.
2Pranay R. Kommera,Suresh S. Muknahallipatna,John E. McInroy.Optimized CUDA Implementation to Improve the Performance of Bundle Adjustment Algorithm on GPUs[J].Journal of Software Engineering and Applications,2024,17(4):172-201.
3Feng Zhenfu,Zhang Yaying,Yang Lele,Xing Lidong.Convolutional neural network adaptation and optimization method in SIMT computing mode[J].The Journal of China Universities of Posts and Telecommunications,2024,31(2):105-112.
4Shaojie Zhang.Time Predictable Modeling Method for GPU Architecture with SIMT and Cache Miss Awareness[J].Journal of Electronic Research and Application,2024,8(2):109-115.
5Qinlu He,Fan Zhang,Genqing Bian,Weiqi Zhang,Zhen Li.Research on Performance Optimization of Spark Distributed Computing Platform[J].Computers, Materials & Continua,2024,79(5):2833-2850.
6胡怡,陈道琨,杨超,马文静,刘芳芳,宋超博,孙强,史俊达.国产SW26010-Pro处理器上3级BLAS函数众核并行优化[J].软件学报,2024,35(3):1569-1584.
7Yingchao Li,JianbinWang,HaibinWang.GNN Representation Learning and Multi-Objective Variable Neighborhood Search Algorithm for Wind Farm Layout Optimization[J].Energy Engineering,2024,121(4):1049-1065.
8Chao Zheng,Edward Yost,Ariel R.Muliadi,Nicolin Govender,Ling Zhang,Chuan-Yu Wu.DEM analysis of the influence of stirrer design on die filling with forced powder feeding[J].Particuology,2024,88(5):107-115.
9武铮,金旭,安虹.申威26010众核处理器上Winograd卷积算法的研究与优化[J].计算机研究与发展,2024,61(4):955-972.
10Kaiming Cai,Tianli Jin,Wen Siang Lew.Spin-based magnetic random-access memory for high-performance computing[J].National Science Review,2024,11(3):14-17. 被引量：1

Journal of Computer and Communications

2024年第5期

浏览历史

内容加载中请稍等...

Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique

相关作者

相关机构

相关主题

浏览历史