摘要
时间序列数据在能源、制造、金融、气候等领域有着广泛应用,聚合查询是相关分析场景中常见的查询需求,快速获取海量数据的概要信息,对于提高数据分析工作的效率具有重要意义.通过存储元数据加速聚合查询是一种有效的提升聚合查询执行效率的手段,但现有的时间序列数据库都使用时间窗口切分数据,需要对数据进行实时排序和分区,难以适应物联网场景下高并发、大吞吐量的数据写入特点.因此,提出了一种面向聚合查询的ApacheIoTDB物理元数据管理方案.该方案按照数据文件的物理存储特性切分数据,并结合同步计算和异步计算策略,优先保证数据的写入性能.针对时间序列数据中普遍存在的乱序数据,将时间范围重叠的一组文件抽象为乱序文件组并提供元数据,聚合查询会被重写为3个结合物理元数据和原始数据的子查询高效执行.多个数据集上的实验验证了该方案对聚合查询执行效率的提升效果以及不同计算策略对性能的影响.
Timeseries data is widely used in energy, manufacturing, finance, climate and many other fields. Aggregate queries are quite common in timeseries data analysis scenarios to quickly obtain summary of massive data. It is an effective way to acceleratin g aggregate queries by storing metadata. However, most existing timeseries databases slice data with fixed time windows, which requires real-time sorting and partitioning. In IoT applications with high writing concurrency and throughput, these additional costs are unacceptable. This study proposes a physical metadata management solution in Apache IoTDB for accelerating aggregate queries, in which data are sliced according to the physical storage sharding of files. Both synchronous and asynchronous computing are adopted to ensure writin g performance ahead of queries. Out-of-order data streams are another major challenge in IoTDB applications. This study abstracts files with overlapping time ranges into out-of-order file groups and provides metadata for each group. Then aggregate queries will be rewritten into three sub-queries and efficiently executed on physical metadata and timeseries data. Experiments on various datasets have shown the improvement in performance of aggregate queries with the proposed solution, as well as the validity of different computing st rategies.
作者
赵东明
邱圆辉
康瑞
宋韶旭
黄向东
王建民
ZHAO Dong-Ming;QIU Yuan-Hui;KANG Rui;SONG Shao-Xu;HUANG Xiang-Dong;WANG Jian-Min(School of Software,Tsinghua University,Beijing 100084,China;National Engineering Research Center for Big Data Software(Tsinghua University),Beijing 100084,China;Beijing National Research Center for Information Science and Technology(Tsinghua University),Beijing 100084,China)
出处
《软件学报》
EI
CSCD
北大核心
2023年第3期1027-1048,共22页
Journal of Software
基金
国家自然科学基金(62072265,62021002)
国家重点研发计划(2021YFB3300500,2019YFB1705301,2019YFB17070 01)
北京信息科学与技术国家研究中心青年创新基金(BNR2022RC01011)
工信部2020年新兴平台软件项目。
关键词
预聚合
聚合查询
查询重写
物理元数据管理
时间序列数据库
pre-aggregation
aggregate query
query rewriting
physical metadata management
timeseries database