摘要
大数据集群环境中,随机访问的低效性使得基于行级别抽样的近似查询处理方法在构建样本时效率低下。该文将利用集群环境中数据分块存储的特性,以分块级别来进行抽样。在基准测试数据集和真实数据集上的实验,显示此方法在降低数据读取率,提高查询响应速度的同时,保持较高的查询精度。实验中,仅需要读取少于20%的数据就可以获得低于5%的查询误差,且为数据集每个分块的预计算的特征数据所需要的存储空间小于数据集所占空间的0.04%。
In big data cluster environment,the inefficiency of random access makes the approximate query processing method based on row-level sampling inefficient in constructing samples.This paper will make use of the characteristics of data block storage in the cluster environment to sample at the block level.Experiments on benchmark data sets and real data sets show that this method not only reduces the data reading rate and improves the query response speed,but also maintains high query accuracy.In the experiment,only less than 20%of the data need to be read to obtain a query error of less than 5%,and the storage space required for the precalculated feature data for each block of the dataset is less than 0.04%of the space occupied by the dataset.
出处
《科技创新与应用》
2024年第24期19-22,26,共5页
Technology Innovation and Application
基金
国家自然科学基金国际(地区)合作与交流项目(62061136006)
国家自然科学基金重点项目(61832004)。
关键词
近似查询处理
聚类
分块抽样
数据跳过
特征计算
approximate query processing
clustering
block sampling
data skip
feature calculation