摘要
针对数据流并行聚集问题,提出了一种不同于关系数据和时间序列数据处理的并行聚集方法。为解决已经划分出的数据流元组无法再现的特点,提出能够感知数据流变化的采样算法对数据流采样。利用近似等深柱状图技术描述采样数据的分布特征,平均分配数据流量。使用时间聚集森林结构计算时间窗聚集。通过验证采样个数对并行聚集的影响,数据分布对近似划分向量算法性能的影响,测试数据流量与并行聚集加速比的关系,证明本算法能够高效地计算数据流聚集查询。
In light of parallel aggregating algorithm for data stream, a new parallel aggregating strategy with relational data and temporal sequence data was put forward.To tackle the problem that partitioned tuples were unable to recur over data streams,a change-aware sampling algorithm was used to sample data streams first,and an approximate equal depth histogram algorithm was used to describe data distribution in query window to partition data streams averagely,and then a novel m-order Temporal Aggregate forest data structure was applied to computing aggregate in time window.It verified the effect of numbers of sampling on parallel aggregation,the effect of data distribution on the performance of approximate partition vector algorithm,and the relation between quantity of data streams and parallel aggregation speed ratio.Experiments prove that the proposed algorithm can provide accurate answer to parallel aggregation over data streams efficiently.
出处
《解放军理工大学学报(自然科学版)》
EI
2008年第1期29-33,共5页
Journal of PLA University of Science and Technology(Natural Science Edition)
关键词
数据流
并行处理
近似技术
柱状图
聚集
data streams
parallel processing
approximate technique
histogram
aggregation