期刊文献+

面向DCU的LDS访存向量化优化 被引量:1

Vectorization Optimization of LDS Memory Access for DCU
下载PDF
导出
摘要 在深度计算器(DCU)中,本地数据共享(LDS)是相较于全局内存延迟更低、带宽更高的关键存储部件。随着异构程序对LDS的使用越来越频繁,LDS访存效率低下成为限制异构程序性能的重要因素。此外,LDS访问过程中存在bank冲突的特性,使LDS的访问应遵循一定原则才能高效利用,当线程间的数据访问呈现重叠的访存特征时,访问向量化指令会因此产生延迟。针对此问题,提出面向DCU的LDS访存向量化优化方法。通过实现连续数据访问的向量化,减少LDS的访问次数,降低访存耗时,由此提高程序访存效率。在此基础上,通过设计访存特征的判断方法,提出能够有效解决数据重叠的LDS访存向量化方法,实现一种面向国产通用加速器的LDS高效访存技术,确保向量化方法对访存效率的有效提升。实验结果表明:在使用LDS的异构程序中,LDS访存向量化实现后程序性能平均提升了22.6%,验证了所提方法的有效性;同时,向量化方法能够实现LDS线程间访存数据重叠问题的优化,使异构程序得到平均30%的性能提升。 In a domestic general-purpose accelerator Deep Computing Unit(DCU),Local Data Shared(LDS)is a key storage component with a lower latency and higher bandwidth than global memory.As heterogeneous programs use LDS more frequently,the low memory access efficiency of LDS has become an important limiting factor in the performance of heterogeneous programs.In addition,owing to bank conflicts in the LDS access process,LDS access must follow certain principles to be used efficiently.When the data access between threads presents overlapping memory access characteristics,access vectorization instructions create delays.To address this problem,an optimization method for the LDS memory access vectorization for the DCU is proposed.This method reduces the number of LDS accesse and time-consuming memory accesse by realizing the vectorization of continuous data access,thereby improving the efficiency of program memory access.On this basis,through the determination of memory access characteristics,an LDS access vectorization method that can effectively address data overlap is proposed,and an efficient LDS memory access technology for domestic general-purpose accelerators is realized to ensure the vectorization method effectively improve the memory access efficiency.The experimental results demonstrate that in the heterogeneous programs using LDS,the program performance is improved by an average of 22.6%after the LDS access vectorization is implemented,which verifies the effectiveness of this study.Simultaneously,the vectorization method can realize the overlapping of memory access data between LDS threads,and improves the performance of heterogeneous programs by an average of 30%.
作者 杨思驰 赵荣彩 韩林 王洪生 YANG Sichi;ZHAO Rongcai;HAN Lin;WANG Hongsheng(School of Computer and Artificial Intelligence,Zhengzhou University,Zhengzhou 450000,Henan,China;National Supercomputing Center in Zhengzhou,Zhengzhou 450000,Henan,China)
出处 《计算机工程》 CAS CSCD 北大核心 2024年第2期206-213,共8页 Computer Engineering
基金 河南省重大科技专项(221100210600)。
关键词 深度计算器 本地数据共享 访存向量化 访存特征 bank冲突 Deep Computing Unit(DCU) Local Data Shared(LDS) memory access vectorization memory access characteristic bank conflict
  • 相关文献

参考文献10

二级参考文献80

  • 1李冬梅,陈军霞.聚类分析法在公交网络评价中的应用[J].河北科技大学学报,2012,33(3):279-282. 被引量:6
  • 2吴圣宁,李思昆.多媒体处理器的SIMD代码生成[J].计算机科学,2007,34(7):268-270. 被引量:2
  • 3AllenR,KennedyK现代体系结构的优化编译器[M].张兆庆,乔如良,冯晓兵,等,译.北京:机械工业出版社,2004.
  • 4KENNETH M, EDWARD A. The FFT on a GPU[A]. Pro- ceedings of the ACM Siggraph/Eurographics Conference on Graphics Hardware[C]. San Diego : [s. n.], 2003.112-119.
  • 5NVIDIA.Corporation CUDA2.0编程指南[EB/OL].http://down.csdn.net/detail/gaopengpian/2788197,2010-10-27.
  • 6Intel Corporation. Intel 64 and IA-32 Architectures Software Developer' s Manual [EB/OL ]. 12014-11-15 1. http ://www. intel, com/Assets/PDF/manual/252046, pdf.
  • 7Stewart J. An Investigation of SIMD Instruction Sets[D]. Ballarat,Australia:University of Ballarat,2005.
  • 8D'Arcy P, Beach S. StarCore SC140: A New DSP Architecture for Portable Devices[ Z]. 1999.
  • 9Amarasinghe S P,Anderson J A M,Lam M S, et al. An Overview of the SUIF Compiler for Scalable Parallel Machines [ C]//Proceedings of the 7th SIAM Con- ference on Parallel Processing for Scientific Computing. Philadelphia, USA : SIAM, 1995:662-667.
  • 10Naishlos D. Autovectorization in GCC [ C ]//Proceed- ings of 2004 GCC Developers Summit. Ottawa, Canada: [ s. n. ] ,2004 : 105-118.

共引文献19

同被引文献3

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部