摘要
在多核处理器、大内存、非易失内存等新硬件技术的支持下,异构存储与计算平台成为主流的高性能计算平台.传统的数据库引擎采用一体化设计,新兴数据库则采用存算分离和算子下推技术以更好适应新型分布式存储架构.提出了一种新颖的基于管算存分离方法的内存数据库实现技术,在存算分离技术的基础上进一步根据数据库模式、数据分布与负载计算特征将数据集划分为元数据集和数值集,将统一的查询引擎分解为元数据管理引擎、计算引擎和存储引擎,将包含语义信息的元数据管理抽象为独立的管理层,将无语义的数值存储和计算抽象为计算存储层,其中计算密集型负载定义为计算层,数据密集型负载设计为存储层,并根据硬件平台的不同分离或合并计算与存储层.内存数据库的实现技术分为几个层次:1)模式优化,实现数据库存储中“数(数值)”与“据(元数据)”的分离,根据数据的内在特性选择不同的存储与计算策略;2)模型优化,采用Fusion OLAP模型,实现在关系存储模型上的高性能多维计算;3)算法优化,通过代理键索引、向量索引支持优化的向量连接、向量聚集算法,提高OLAP性能;4)系统设计优化,通过数据库引擎分层技术实现管理与计算分离、存储与计算分离以及多维计算算子下推到存储层.实验结果表明,管算存分离计算模型可以灵活地支持CPU-GPU异构计算平台、DRAM-PM(Persistent Memory,持久内存)异构存储平台和外部存储平台,采用开源的Arrow内存列存储引擎作为数据库“数”的存储引擎,以及应用多维计算算子下推到Arrow存储引擎技术的OLAP实现技术在SSB基准测试中与存算结合的内存OLAP实现技术性能相当,查询性能优于主流内存数据库Hyper和OmniSciDB,以及基于Arrow存储的GPU数据库PG-Strom.
Heterogeneous storage/computing platform has been main-stream high performance computing platform with the support of multicore processors,big memory and non-volatile memory techniques.The traditional database engines are co-designed for storage and compute,the emerging databases employ separation of storage and compute and pushdown compute techniques for novel distributed storage infrastructure.This paper introduces a novel in-memory database implementation based on separation of manage,compute and storage technique,based on separation of storage and compute,it further separates the dataset into meta dataset and value dataset according to the characteristics of database schema,data distribution and workload.The unified query engine is divided into meta data management engine,computing engine and storage engine.The meta data with semantic information management is abstracted as independent management layer,the non-semantic value storage and compute are abstracted as compute and storage layers,and the compute-intensive workload is further defined as compute layer,the dataintensive workload is defined as storage layer,the compute layer and storage layer can be combined or separated according to different hardware configurations.The implementation of inmemory database is designed as following levels:1)schema optimization,separating value and meta data in database to choose different storage and compute strategies according to the inner data features;2)data model optimization,the Fusion OLAP model supports the high performance multidimensional compute on relational storage model;3)algorithm optimizations,using surrogate key index and vector index to support the optimal vector join and vector aggregation for higher OLAP performance;4)system design optimizations,the layered database engine separates the mange and compute,storage and compute,and pushdown multidimensional compute to storage layer.The experimental results show that the separation of manage,compute and storage model can flexibly support hybrid CPU-GPU computing platform,hybrid DRAM-PM(Persistent Memory)storage platform and external storage platform,by employing the opensource in-memory column store Arrow as data storage engine for database and pushing down multidimensional compute to Arrow storage engine,the OLAP implementation proves to be equal performance as OLAP implementation co-designed for storage and compute in Star Schema Benchmark,the OLAP performance outperforms the leading in-memory databases Hyper,OmniSciDB and Arrow based GPU database PG-Strom.
作者
张延松
韩瑞琛
刘专
张宇
ZHANG Yan-Song;HAN Rui-Chen;LIU Zhuan;ZHANG Yu(Key Laboratory of Data Engineering and Knowledge Engineering Renmin University,Ministry of Education,Beijing 100872;School of Information,Renmin University of China,Beijing 100872;National Survey Research Center at Renmin University of China,Beijing 100872;Intel China Research Center Ltd,Beijing 100190;National Satellite Meteorological Centre,Beijing 100081,China)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2023年第4期761-779,共19页
Chinese Journal of Computers
基金
国家自然科学基金项目(61732014,61772533)
北京市自然科学基金项目(4192066)资助.
关键词
内存数据库
数据分离
存算分离
管算分离
向量索引
in-memory database
separation of data and meta data
separation of storage and compute
separation of manage,compute and storage
vector index