基于GPU直访存储架构的推荐模型预估系统

A Recommendation Model Inference System Based on GPU Direct Storage Access Architecture

下载PDF

导出

摘要新型深度学习推荐模型已广泛应用至现代推荐系统,其独有的特征——包含万亿嵌入参数的嵌入层,带来的大量不规则稀疏访问已成为模型预估的性能瓶颈.然而,现有的推荐模型预估系统依赖CPU对内存、外存等存储资源上的嵌入参数进行访问,存在着CPU-GPU通信开销大和额外的内存拷贝2个问题,这增加了嵌入层的访存延迟,进而损害模型预估的性能.提出了一种基于GPU直访存储架构的推荐模型预估系统GDRec.GDRec的核心思想是在嵌入参数的访问路径上移除CPU参与,由GPU通过零拷贝的方式高效直访内外存资源.对于内存直访,GDRec利用统一计算设备架构(compute unified device architecture,CUDA)提供的统一虚拟地址特性,实现GPU核心函数(kernel)对主机内存的细粒度访问,并引入访问合并与访问对齐2个机制充分优化访存性能;对于外存直访,GDRec实现了一个轻量的固态硬盘(solid state disk,SSD)驱动程序,允许GPU从SSD中直接读取数据至显存,避免内存上的额外拷贝,GDRec还利用GPU的并行性缩短提交I/O请求的时间.在3个点击率预估数据集上的实验表明,GDRec在性能上优于高度优化后的基于CPU访存架构的系统NVIDIA HugeCTR,可以提升多达1.9倍的吞吐量. Emerging deep learning recommendation models(DLRM)have been widely used in modern recommendation systems.The unique embedding layer of DLRM,commonly with tens of trillions of parameters,induces massive irregular access to storage resources,which becomes the performance bottleneck of model inference.Existing inference systems rely on CPU access to embedding parameters on DRAM and SSD.However,we find that this architecture suffers from excessive CPU-GPU communication overhead and redundant memory copies,resulting in increased latency of embedding layers and limited inference performance.In this paper,we propose GDRec,a recommendation model inference system based on the architecture of GPU direct storage access.The core idea of GDRec is to eliminate the CPU from the access path of embedding parameters and let the GPU directly access storage resources with the paradigm of zero copy.For direct access to DRAM,GDRec retrofits the unified virtual addressing feature of CUDA,to allow GPU kernels to issue fine-grained access to host DRAM.GDRec further introduces two optimizations,access coalescing and access aligning,to fully unleash the performance of DRAM access.For direct access to SSD,GDRec implements a lightweight NVMe driver on GPU,allowing GPU to submit I/O commands to read data from SSD to GPU memory directly,without extra copies on DRAM.GDRec also leverages the massive parallelism of GPU to shorten the submission time of I/O commands.Experiments on three public datasets show that GDRec can improve inference throughput by 1.9 times,compared with a highly-optimized recommendation model inference system,NVIDIA HugeCTR.

作者谢旻晖陆游游冯杨洋舒继武 Xie Minhui;Lu Youyou;Feng Yangyang;Shu Jiwu(Department of Computer Science and Technology,Tsinghua University,Beijing 100084)

机构地区清华大学计算机科学与技术系

出处《计算机研究与发展》 EI CSCD 北大核心 2024年第3期589-599,共11页 Journal of Computer Research and Development

基金国家自然科学基金优秀青年科学基金项目(62022051)。

关键词 GPU直访存储参数存储推荐系统预估系统存储系统 GPU direct storage access parameter store recommendation system inference system storage system

分类号 TP302.1 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献1

1刘知远,孙茂松,林衍凯,谢若冰.知识表示学习研究进展[J].计算机研究与发展,2016,53(2):247-261. 被引量：260

二级参考文献82

1Miller G A. WordNet: A lexical database for English [J]. Communications of the ACM, 1995, 38(11): 39-41.
2Bollacker K, Evans C, Paritosh P, et al. Freebase: A collaboratively created graph database for structuring human knowledge [C] //Proe of KDD. New York: ACM, 2008: 1247-1250.
3Miller E. An introduction to the resource description framework [J]. Bulletin of the American Society for Information Science and Technology, 1998, 25(1): 15-19.
4Bengio Y. Learning deep architectures for AI [J]. Foundations and Trends in Machine Learning, 2099, 2 (1) 1-127.
5Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives [J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.
6Turian J, Ratinov L, Bengio Y. Word representations: A simple and general method for semi-supervised learning [C]// Proc of ACL. Stroudsburg, PA: ACL, 2010:384-394.
7Manning C D, Raghavan P, Schutze H. Introduction to Information Retrieval [M]. Cambridge, UK: Cambridge University Press, 2008.
8Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their eompositionality [C] //Proe of NIPS. Cambridge, MA: MIT Press, 2013:3111-3119.
9Zhao Y, Liu Z, Sun M. Phrase type sensitive tensor indexing model for semantic composition [C] //Proc of AAAI. Menlo Park, CA: AAAI, 2015: 2195-2202.
10Zhao Y, Liu Z, Sun M. Representation learning for measuring entity relatedness with rich information [C] //Proc of IJCAI. San Francisco, CA: Morgan Kaufmann, 2015: 1412-1418.

共引文献259

1余传明,李浩男,王曼怡,黄婷婷,安璐.基于深度学习的知识表示研究:网络视角[J].数据分析与知识发现,2020,4(1):63-75.
2张骁雄,杨琴琴,何浩然,丁鲲.面向俄乌冲突的时序知识图谱推理系统设计与实现[J].网络安全与数据治理,2023,42(S01):157-162.
3赵晓函,周子力,李天宇,陈丹华,王凯莉.一种基于IC参数的知识图谱嵌入方法[J].中文信息学报,2021,35(10):48-55.
4詹威威,程序,蔡惠民,刘汪洋,王彬,余正涛.基于综合影响力模型的改进EvolveKG方法及应用研究[J].计算机应用研究,2020,37(S01):159-162.
5阿布都克力木·阿布力孜,张雨宁,阿力木江·亚森,郭文强,哈里旦木·阿布都克里木.预训练语言模型的扩展模型研究综述[J].计算机科学,2022,49(S02):43-54. 被引量：11
6郝卫,魏赟.基于知识图谱表示学习的推荐算法优化[J].智能计算机与应用,2020,10(4):22-26. 被引量：3
7甘惟,吴志强,王元楷,徐浩文,严娟,何珍,赵紫辰.AIGC辅助城市设计的理论模型建构[J].城市规划学刊,2023(2):12-18. 被引量：16
8许升健.年薪制的困惑[J].金山企业管理,2000(1):40-41.
9王春凯,冯键.跨界数据融合在保险行业中的应用[J].保险理论与实践,2019,0(3):38-50.
10徐增林,盛泳潘,贺丽荣,王雅芳.知识图谱技术综述[J].电子科技大学学报,2016,45(4):589-606. 被引量：507

1王继昌,吕高锋,刘忠沛,杨翔瑞.一种面向可编程确定性零拷贝的FPGA加速器[J].小型微型计算机系统,2024,45(3):692-698.
2么宇光.分布式存储在超融合架构方案中的应用[J].长江信息通信,2023,36(11):86-88. 被引量：1
3罗伟峰,赖丹晖,袁旭东,邱子良,秦思远.边缘计算与云计算融合的运维数据安全存储方法[J].计算机应用文摘,2024,40(5):119-121.
4梁典,胡惠莹.数字金融发展对农户家庭储蓄率的影响研究——基于CHFS数据的实证分析[J].内蒙古科技与经济,2024(1):75-78.
5王桂林,张禹,王晓东,刘轩宇,吕文伟.哈尔乌素露天煤矿首采区北端帮内排压帮高度优化[J].煤炭工程,2023,55(S01):6-10.
6刘文林.数字技术支持下的海量图像信息多云存储平台设计[J].自动化与仪表,2024,39(2):135-139. 被引量：2
7朱怡芳.玉文化的想象性叙事[J].中国非物质文化遗产,2024(1):67-79.
8杨海玲,黄冰,陆丽前.炒炭程度对广山楂色差值、7种化学成分及止血作用影响[J].中药药理与临床,2023,39(12):78-83.
9冯新政,张大伟,徐海卿,鞠琴.基于多GPU数值框架的流域地表径流过程数值模拟[J].南水北调与水利科技（中英文）,2024,22(1):48-55.
10江思羽.组织行为与制度变迁——欧盟能源治理的三重逻辑[J].世界经济与政治,2024(2):118-152.

计算机研究与发展

2024年第3期

浏览历史

内容加载中请稍等...

基于GPU直访存储架构的推荐模型预估系统

参考文献1

二级参考文献82

共引文献259

相关作者

相关机构

相关主题

浏览历史