摘要
多表连接查询是大数据分析领域重要的查询类型之一,然而连接查询的实现代价很高,从而影响了大数据分析结果的时效性。在线聚集能够在查询完成前反馈具有统计意义的估计结果具有重要的意义。目前已有的多表连接在线聚集算法从各表进行统一随机采样,导致连接结果的产出率低,并且导致分组连接查询的估计准确率低。针对这一问题,提出了基于马尔可夫链的多表连接在线聚集技术,将多表连接的实现过程转换为马尔可夫链上的随机游走过程,确定好连接顺序后在游走起始层创建分层样本,并设计了相应的采样策略及结果估计方法。将所提出技术在在线化Hadoop平台上实现,实验结果证明所提出方案的响应时间优于已有算法,并且具有良好的扩展性。
Multi-table join is one of the most important query operations in the field of big data analysis,however,its implementation is expensive,which affects the timeliness of the big data analysis results. Online aggregation provides feedback of statistical significance far before the query finishes,which is of great significance. The existing work on multi-table join online aggregation conducted uniform sampling on every joining table,which results in low join result yield and estimation inaccuracy on grouping join queries. To solve this problem,this paper proposed the multi-table join online aggregation technique based on Markov chain,which transformed the multi-table join process into the random walk on Markov chain,constructed stratified sample on the walk start strata after determining the join order,and designed the corresponding sampling mechanism and estimation algorithm. The experiment was conducted on the online Hadoop platform,and the results demonstrate that the response time of technique outperforms the existing algorithms,and it owns efficient scalability.
作者
史英杰
杜方
Shi Yingjie;Du Fang(School of Information Engineering,Beijing Institute of Fashion Technology,Beijing 100029,China;School of Information Engineering,Ningxia University,Yinchuan 750021,China)
出处
《计算机应用研究》
CSCD
北大核心
2019年第12期3801-3805,3810,共6页
Application Research of Computers
基金
国家自然科学基金资助项目(61502279)
北京市教委科技计划资助项目(KM201710012008)
北京服装学院高水平教师队伍建设专项资金资助项目(BIFTQG201803)
北京市服装产业数字化工程技术研究中心开放课题项目(KJCX1902-30299/009)
关键词
在线聚集
马尔可夫链
分层采样
多表连接
online aggregation
Markov chain
stratified sampling
multi-table join