期刊文献+

基于GPU直访存储架构的推荐模型预估系统

A Recommendation Model Inference System Based on GPU Direct Storage Access Architecture
下载PDF
导出
摘要 新型深度学习推荐模型已广泛应用至现代推荐系统,其独有的特征——包含万亿嵌入参数的嵌入层,带来的大量不规则稀疏访问已成为模型预估的性能瓶颈.然而,现有的推荐模型预估系统依赖CPU对内存、外存等存储资源上的嵌入参数进行访问,存在着CPU-GPU通信开销大和额外的内存拷贝2个问题,这增加了嵌入层的访存延迟,进而损害模型预估的性能.提出了一种基于GPU直访存储架构的推荐模型预估系统GDRec.GDRec的核心思想是在嵌入参数的访问路径上移除CPU参与,由GPU通过零拷贝的方式高效直访内外存资源.对于内存直访,GDRec利用统一计算设备架构(compute unified device architecture,CUDA)提供的统一虚拟地址特性,实现GPU核心函数(kernel)对主机内存的细粒度访问,并引入访问合并与访问对齐2个机制充分优化访存性能;对于外存直访,GDRec实现了一个轻量的固态硬盘(solid state disk,SSD)驱动程序,允许GPU从SSD中直接读取数据至显存,避免内存上的额外拷贝,GDRec还利用GPU的并行性缩短提交I/O请求的时间.在3个点击率预估数据集上的实验表明,GDRec在性能上优于高度优化后的基于CPU访存架构的系统NVIDIA HugeCTR,可以提升多达1.9倍的吞吐量. Emerging deep learning recommendation models(DLRM)have been widely used in modern recommendation systems.The unique embedding layer of DLRM,commonly with tens of trillions of parameters,induces massive irregular access to storage resources,which becomes the performance bottleneck of model inference.Existing inference systems rely on CPU access to embedding parameters on DRAM and SSD.However,we find that this architecture suffers from excessive CPU-GPU communication overhead and redundant memory copies,resulting in increased latency of embedding layers and limited inference performance.In this paper,we propose GDRec,a recommendation model inference system based on the architecture of GPU direct storage access.The core idea of GDRec is to eliminate the CPU from the access path of embedding parameters and let the GPU directly access storage resources with the paradigm of zero copy.For direct access to DRAM,GDRec retrofits the unified virtual addressing feature of CUDA,to allow GPU kernels to issue fine-grained access to host DRAM.GDRec further introduces two optimizations,access coalescing and access aligning,to fully unleash the performance of DRAM access.For direct access to SSD,GDRec implements a lightweight NVMe driver on GPU,allowing GPU to submit I/O commands to read data from SSD to GPU memory directly,without extra copies on DRAM.GDRec also leverages the massive parallelism of GPU to shorten the submission time of I/O commands.Experiments on three public datasets show that GDRec can improve inference throughput by 1.9 times,compared with a highly-optimized recommendation model inference system,NVIDIA HugeCTR.
作者 谢旻晖 陆游游 冯杨洋 舒继武 Xie Minhui;Lu Youyou;Feng Yangyang;Shu Jiwu(Department of Computer Science and Technology,Tsinghua University,Beijing 100084)
出处 《计算机研究与发展》 EI CSCD 北大核心 2024年第3期589-599,共11页 Journal of Computer Research and Development
基金 国家自然科学基金优秀青年科学基金项目(62022051)。
关键词 GPU直访存储 参数存储 推荐系统 预估系统 存储系统 GPU direct storage access parameter store recommendation system inference system storage system
  • 相关文献

参考文献1

二级参考文献82

  • 1Miller G A. WordNet: A lexical database for English [J]. Communications of the ACM, 1995, 38(11): 39-41.
  • 2Bollacker K, Evans C, Paritosh P, et al. Freebase: A collaboratively created graph database for structuring human knowledge [C] //Proe of KDD. New York: ACM, 2008: 1247-1250.
  • 3Miller E. An introduction to the resource description framework [J]. Bulletin of the American Society for Information Science and Technology, 1998, 25(1): 15-19.
  • 4Bengio Y. Learning deep architectures for AI [J]. Foundations and Trends in Machine Learning, 2099, 2 (1) 1-127.
  • 5Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives [J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.
  • 6Turian J, Ratinov L, Bengio Y. Word representations: A simple and general method for semi-supervised learning [C]// Proc of ACL. Stroudsburg, PA: ACL, 2010:384-394.
  • 7Manning C D, Raghavan P, Schutze H. Introduction to Information Retrieval [M]. Cambridge, UK: Cambridge University Press, 2008.
  • 8Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their eompositionality [C] //Proe of NIPS. Cambridge, MA: MIT Press, 2013:3111-3119.
  • 9Zhao Y, Liu Z, Sun M. Phrase type sensitive tensor indexing model for semantic composition [C] //Proc of AAAI. Menlo Park, CA: AAAI, 2015: 2195-2202.
  • 10Zhao Y, Liu Z, Sun M. Representation learning for measuring entity relatedness with rich information [C] //Proc of IJCAI. San Francisco, CA: Morgan Kaufmann, 2015: 1412-1418.

共引文献254

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部