摘要
爆炸式增长的数据对存储和处理数据提出了更高的需求,GPU数据库作为新硬件数据库的一个重要分支,在大容量和高性能处理方面有其独特的优势.GPU数据库作为高性能数据库的代表,在最近几年受到学术界和产业界的关注,一批具有代表性的研究成果和标志性的实际产品已经出现.GPU数据库的技术发展按照GPU加速型和GPU内存型两种技术路线展开.两种技术路线都有相应的原型系统或产品出现.虽然两种GPU数据库的发展路线在实现上有所不同,但GPU数据库最基本的功能部分和核心技术是相似的,都有查询编译、查询优化、查询执行以及存储管理等功能.当前主流的数据传输方案除了PCIe之外,NVLink、RDMA和CXL等传输方案也为不同处理器之间的数据传输提供了更多的可能性.大多数GPU数据库使用列存储模型来存储数据,少数GPU数据库(如PG-Strom)对两种存储模型都支持.在列存储模型上利用压缩技术能减少数据的存储空间和传输时延.在GPU数据库上进行的压缩和解压的时间应该在整个数据处理的过程中占比很少.在GPU数据库上建立和维护索引不应该有很大的系统开销.JIT编译时间短、编译效率高,是GPU数据库编译的主流.操作符对数据库查询性能的影响非常明显,连接操作、分组聚集和OLAP运算符是目前研究最多的三个类型.目前大多数的研究中,连接和分组聚集算子通常结合在一起研究.在连接算子执行的过程中还和表的连接顺序结合在一起进行考虑.OLAP算子是GPU数据库中的又一个被大量研究的算子,GPU数据库在OLAP算子和模型方面持续受到研究者的关注.GPU数据库有三种查询处理模型,即行处理、列处理和向量化处理.向量化处理和列处理在实际系统中应用较多.由于GPU加速型数据库技术的发展,CPU-GPU协同处理模型上的查询方案与查询引擎也有一定数量的研究成果出现.当前GPU数据库的查询优化研究主要有三部分:多表连接顺序、查询重写和代价模型.然而,GPU数据库的代价评估模型在目前还没有很好的解决方案,GPU数据库的查询优化在未来仍有很大的研究空间.事务在GPU数据库中没有得到很好的研究,尽管有单独的原型系统,但目前的研究还没有取得重大进展.本文总结了GPU数据库各种关键技术已有的研究成果,指出GPU数据库当前存在的问题和面临的挑战,对未来的研究方向进行了展望.
The explosive growth of data has increased the demands for data storage and processing.GPU databases,as an important branch of new hardware databases,have unique advantages in high-capacity and high-performance processing.As representatives of high-performance databases,GPU databases have attracted the attention of both academia and industry in recent years,with a number of representative research results and landmark practical products emerging.The technical development of GPU databases unfolds along two routes:GPU-accelerated and GPU-memory-based.Both routes have corresponding prototype systems or products.Although these development routes differ in implementation,the basic functionalities and core technologies of GPU databases are similar,including query compilation,query optimization,query execution,and storage management.The rapid development of new hardware offers more possibilities for data processing,storage,and transmission.Current mainstream data transmission solutions,besides PCIe,include NVLink,RDMA,and CXL,which provide more possibilities for data transfer between different processors.Most GPU databases use a columnar storage model for data storage,while a few GPU databases(such as PG-Strom) support both storage models.The columnar storage model can utilize compression techniques to reduce data storage space and transmission latency.Data compression schemes on GPU databases generally adopt lightweight compression methods,ensuring that the time spent on data compression and decompression constitutes a small portion of the overall data processing time and does not significantly increase the system's time overhead.Building and maintaining indexes on GPU databases should be lightweight and should not incur significant system overhead.Compilation time directly affects query performance,with JIT compilation being the mainstream for GPU database compilation due to its short compilation time and high efficiency.Operators significantly impact database query performance,with join operations,group aggregation,and OLAP operators being the most studied types.In most current studies,join and group aggregation operators are often researched together,considering the join order of tables during the execution of join operators.OLAP operators are another extensively researched type in GPU databases,with the advantages of GPU databases in handling analytical workloads drawing continuous attention from researchers.GPU databases have three query processing models:row processing,column processing,and vectorized processing.Vectorized processing and column processing are more commonly applied in practical systems.Additionally,due to the development of GPU-accelerated database technology,a certain number of research results on query schemes and query engines for the CPU-GPU collaborative processing model have emerged.The query optimization of GPU databases mainly involves three aspects:multi-table join order,query rewriting,and cost models.However,there is currently no good solution for the cost evaluation model of GPU databases,indicating that query optimization in GPU databases still has significant research space in the future.Transactions,a major feature and an important function of database systems,have developed very maturely and comprehensively on disk databases.However,this critical technology has not been well studied in GPU databases.Although there are individual prototype systems,current research has not achieved significant progress.This paper summarizes the existing research results of various key technologies of GPU databases,points out the current problems and challenges faced by GPU databases,elaborates on the overall development trends and evolution processes of GPU databases,summarizes the most promising research points at present,and provides an outlook on future research directions.
作者
刘鹏
陈红
张延松
李翠平
LIU Peng;CHEN Hong;ZHANG Yan-Song;LI Cui-Ping(Key Laboratory of DaLa Engineering and Knowledge Engineering(MOE),Renmin University of China,Beijing 100872;School of Information,Renmin University of China,Beijing 100872;Engineering Research Center for Database and Business Intelligence(MOE),Renmin University of China,Beijing 100872)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2024年第11期2691-2724,共34页
Chinese Journal of Computers
基金
国家自然科学基金(62072460,62076245,62172424,62276270)
北京市自然科学基金(4212022)资助。