摘要
目前较高效的时空热点查询算法是基于Spark分布式计算框架和Getis-Ord统计量的两阶段map-reduce算法。为了解决其在第一阶段map-reduce,遍历全部轨迹数据导致耗时严重及数据分布不均匀导致资源空闲的问题,本文提出一种对轨迹数据采样的方法S-RSampling(stratified-random sampling),通过分析轨迹数据的分布规律,确定采样规模,从而减轻数据分布不均的影响,大幅降低查询时间;为了解决在第二阶段map-reduce,计算大量无用立方单元格导致计算浪费的问题,本文提出一种阈值过滤方法TFiltering(threshold filtering),根据单元格属性值的分布,动态确定阈值T,将属性值top-T的立方单元格作为热点候选集,从而减少计算浪费。实验表明,本文所提出的优化方法在查询结果准确率不降低的情况下能大幅降低查询响应时间。
The current efficient algorithm of querying spatio-temporal hot spot is the two-stage map-reduce algorithm based on the Spark distributed computing framework and Getis-Ord statistics.A method S-RSampling(Stratified-Random Sampling)of sampling trajectory data was proposed to solve the problem that traversing all the trajectory data causes serious time consuming and the uneven data distribution causes the resources to be idle in the first stage map-reduce.The method analyzes the distribution of trajectory data to determine the sampling scale,thereby reducing the impact of uneven data distribution and query time greatly.A threshold filtering method TFiltering(Threshold Filtering)was proposed to solve the problem that processing a large number of useless cubes leads to calculation waste in the second stage map-reduce.The method dynamically determines the threshold according to the distribution of cubes’value and takes the cubes of value top-T as candidate sets of hot spots,which reduces the calculation waste.Experiments show that the optimization methods proposed in this paper can reduce the query response time greatly without reducing the accuracy of query results.
作者
李艳云
牛保宁
康家兴
LI Yanyun;NIU Baoning;KANG Jiaxing(College of Information and Computer,Taiyuan University of Technology,Taiyuan 030024,China)
出处
《太原理工大学学报》
CAS
北大核心
2020年第4期522-529,共8页
Journal of Taiyuan University of Technology
基金
国家重点研发计划基金资助项目(2017YFB1401001-01)
山西省重点研发计划基金资助项目(国际科技合作)(201903D421007)。
关键词
时空热点
分布式计算
规律采样
阈值过滤
热点候选集
spatio-temporal hot spot
distributed computing
regular sampling
threshold filtering
hot spot candidate set