海量数据上的近似连接聚集操作被引量：3

Approximate Join Aggregate on Massive Data

下载PDF

导出

摘要连接聚集操作是一种常用并且非常耗时的数据库操作.相对于准确查询,满足用户给定置信区间的近似结果由于其快得多的响应时间,更受用户的欢迎.作者分析发现现有的工作无法以既高效又满足给定的任意置信区间方式来处理近似连接聚集,因此提出了一种新的算法——(p,ε)-近似连接聚集查询(pε-AJA)来有效地返回满足任意置信区间的近似连接聚集结果.文章提出且预计算两个数据结构:连接随机样本(JRS)和连接位置索引对表(JPIPT).利用JRS,pε-AJA向用户返回近似结果的快速响应.如果利用JRS得到的近似结果没有满足给定的置信区间,pε-AJA利用JPIPT获得更多的随机连接元组.文中提出一种采样算法来获得JPIPT给定数量的样本,并且利用获得的JPIPT样本,该文提出的算法可通过对连接表的一遍顺序扫描获得连接元组.该文还提供了JPIPT和JRS有效的构建和维护算法.实验结果表明:pε-AJA可以获得相对于准确查询1～5个数量级的加速,并且可以有效地完成JPIPT和JRS的构建和维护操作. Join aggregate is a commonly used but time-consuming operation in database. Compa- ring to exact queries, approximate results satisfying user-specified confidence intervals are more attractive for their much faster responses. None of the previous work can process approximate join aggregate with both high efficiency and an arbitrarily specified confidence interval. This pa- per proposes a novel algorithm, （p,e） Approximate Join Aggregate （pe-AJA）, which is able to return approximate results for arbitrary confidence interval efficiently. Two data structures, join random sample （JRS） and join positional index pair table （JPIPT）, are presented and pre-compu- ted in ρε-AJA, ρε-AJA first makes use of JRS to make a quick response of approximate results to users. If the approximate results from JRS do not satisfy the given confidence interval, JPIPT is exploited to obtain more random join tuples. A sampling algorithm is provided to sample JPIPT tuples of specified size. Algorithms are also presented to retrieve join tuples by sampled JPIPT tuples in one pass sequential scan. The construction and maintenance of JPIPT and JRS are pro- vided in this paper. The experimental results show that ρε-AJA obtains approximate results for arbitrary confidence intervals with a speedup by 1 to 5 orders of magnitude compared to exact queries and the update operations for JPIPT and JRS are efficient.

作者韩希先杨东华李建中

机构地区哈尔滨工业大学计算机科学与技术学院哈尔滨工业大学基础与交叉科学研究院高性能计算中心

出处《计算机学报》 EI CSCD 北大核心 2010年第10期1919-1933,共15页 Chinese Journal of Computers

基金国家"九七三"重点基础研究发展规划项目基金(2006CB303005) 国家自然科学基金(60903016 60533110 60773063) 新世纪优秀人才支持计划(NCET-05-0333) 黑龙江省教育厅科学技术研究项目(11531276) NSFC-RGC of China(60831160525)资助~~

关键词 pε-近似连接聚集连接位置索引对表连接随机样本海量数据 ρε-AJA join positional index pair table join random sample massive data

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献20

1Acharya Swarup, Gibbons Phillip, Poosala Viswanath, Ra maswamy Sridhar. Join synopses for approximate query an swering//Proceedings of the 1999 ACM SIGMOD Interna tional Conference on Management of Data (SIGMOD' 99) Philadelphia, Pennsylvania, USA, ACM, 1999. 275-286.
2Hass Peter, Hellerstein Joseph. Ripple joins for online ag gregation//Proceedings of the 1999 ACM SIGMOD Interna tional Conference on Management of Data (SIGMOD' 99) Philadelphia, Pennsylvania, USA, ACM, 1999: 287-298.
3Luo Gang, Ellmann Curt, Haas Peter, Naughton Jeffrey. A sealable hash ripple join algorithm//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD'02). Madison, Wisconsin, USA, 2002. 252-262.
4Jermaine Christopher, Dobra Alin, Arumugam Subramanian, Joshi Shantanu, Pol Abhijit. A disk-based join with probabilistic guarantees//Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD'05). Baltimore, Maryland, USA, 2005. 563- 574.
5Stonebraker Mike, Abadi Daniel, Batkin Adam et al. C-Store A column oriented DBMS//Proceedings of the 31st Interna tional Conference on Very Large Data Bases (VLDB' 05) Trondheim, Norway, 2005:553-564.
6Hellerstein Joseph, Hass Peter, Wang Helen. Online aggregation//Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD' 97). Tucson, Arizona, USA, ACM, 1997: 171-182.
7Cheng Siyao, Li Jianzhong, Ren Qianqian, Yu Lei. Bernoulli sampling based (epsilon, delta)-approximate aggregation in large-scale sensor networks//Proceedings of the 29th IEEE International Conference on Computer Communications (INFOCOM'10). San Diego, CA, USA, IEEE, 2010. 1181- 1189.
8Wu Sai, Jiang Shouxu, Ooi Beng Chin, Tan Kian-Lee. Distributed online aggregation//Proceedings of the 35th Interna tional Conference on Very Large Data Bases (VLDB' 09). Lyon, France, VLDB Endowment, 2009. 443-454.
9Hass Peter. Large-sample and deterministic confidence intervals for online aggregation//Proceedings of the 9th International Conference on Scientific and Statistical Database Management (SSDBM'97). Olympia, Washington, USA: IEEE Computer Society, 1997:51-63.
10Spiegel Joshua, Polyzotis Neoklis. Tug synopses for approximate query answering. ACM Transactions on Database Systems, 2009, 34(1): 3.

同被引文献57

1Big data: Science in the petabyte era. 2014. http://www.nature.com/nature/joumal/v455/n7209/edsumm/eO80904-Ol.html.
2Directorate for Computer & Information Science & Engineering. 2014. http://www.nsf.gov/funding/pgmsumm.jsp?pims_id= 503324&org=IIS2014,2,18.
3Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Scott ML, Peterson LL, eds. Proc. of the 19th ACM Symp. on Operating Systems Principles. BoltonLanding: ACM Press, 2003.29-43. [doi: 10.1145/945445.945450].
4HadoopTM distributed file system. 2014. http://hadoop.apache.org/docs/stablel/hdfs_design.html.
5Dean J, Ghemawat S. Mapreduce: Simplified data processing on large clusters, Communication of the ACM, 2008,51 (I): 107-I 13. [doi: 10.1145/1327452.1327492].
6Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian YY. A comparison of join algorithms for log processing in MapReduce. In: Elmagarmid AK, Agrawal D, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Indianapolis: ACM Press, 2010.975-986. [doi: 10.1145/1807167.1807273].
7Luo G. Efficient join in Hadoop. Technical Report, NC 27705, Durham: Duke University.
8Hadoop MapReduce. 2014. http://hadoop.apache.org/docs/stablel/mapred_tutorial.html.
9Yang H, Dasdan A, Hsiao RL, Parker DS. Map-Reduce-Merge: Simplified relational data processing on large clusters. In: Chan CY, Ooi BC, Zhou AY, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Beijing: ACM Press, 2007. 1029-1040. [doi: 10.1145/1247480.1247602].
10Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating Mapreduce for multi-core and multiproeessor systems. In: Proc. of the 13st Int'l Conf. on High-Performance Computer Architecture (HPCA-13 2007). Phoenix: IEEE Computer Society, 2007.13-24. [doi: 10.1109/HPCA.2007.346181].

引证文献3

1宋杰,李甜甜,朱志良,鲍玉斌,于戈.MapReduce连接查询的I/O代价研究[J].软件学报,2015,26(6):1438-1456. 被引量：9
2李甜甜,于戈,郭朝鹏,宋杰.基于MapReduce的多元连接优化方法[J].计算机研究与发展,2016,53(2):467-478. 被引量：3
3宋杰,孙宗哲,毛克明,鲍玉斌,于戈.MapReduce大数据处理平台与算法研究进展[J].软件学报,2017,28(3):514-543. 被引量：95

二级引证文献107

1陈鹏.治理的算法和算法的治理[J].观察与思考,2020,0(1):95-104. 被引量：11
2赵铁柱,董辉,林玉文,袁华强.大数据技术在轨道交通领域中的研究和挑战[J].东莞理工学院学报,2019,26(1):28-32. 被引量：9
3余传明,原赛,王峰,安璐.大数据环境下文本情感分析算法的规模适配研究:以Twitter为数据源[J].图书情报工作,2019,63(4):101-111. 被引量：13
4刘越,李锦涛,虎嵩林.基于代价估计的Hive多维索引分割策略选择算法[J].计算机研究与发展,2016,53(4):798-810. 被引量：4
5梁俊杰,何利民.基于MapReduce的数据倾斜连接算法[J].计算机科学,2016,43(9):27-31. 被引量：5
6徐德智,刘扬,Sarfraz Ahmed.基于Hadoop的RDF数据存储及查询优化[J].计算机应用研究,2017,34(2):477-480. 被引量：15
7杨钊,蓝贵文,陈骐,吴聪聪,张强.基于积极算法的WFS空间连接查询优化研究[J].小型微型计算机系统,2017,38(7):1549-1553. 被引量：1
8门威.基于MapReduce的大数据处理算法综述[J].濮阳职业技术学院学报,2017,30(5):85-88. 被引量：2
9赵尔平,党红恩,刘炜.虚拟旅游中海量3D点云数据的细节层次索引技术研究[J].计算机科学,2017,44(10):171-176. 被引量：1
10门威.基于MapReduce的大数据处理算法综述[J].吉林广播电视大学学报,2017(9):48-50.

1陈勇旭,陈梦杰,刘雪冰,宋杰.基于MapReduce的连接聚集查询算法研究[J].计算机研究与发展,2013,50(S1):306-311. 被引量：7
2王伟平,李建中,张冬冬,郭龙江.基于滑动窗口的数据流连续J-A查询的处理方法[J].软件学报,2006,17(4):740-749. 被引量：18
3尚宏佳,周萍,杨青,李优,钱俊彦,张敬伟.融合多核和MapReduce的连接聚集查询优化[J].计算机研究与发展,2015,52(S1):9-18. 被引量：1
4刘波.多关系实体链计算与近似连接查询的研究[J].计算机工程与设计,2010,31(23):5116-5119.
5唐科萍,许方恒,沈才樑.基于位置服务的研究综述[J].计算机应用研究,2012,29(12):4432-4436. 被引量：48
6杨晓宁,伍卫国,刘爱华,董小社,胡雷钧.多负载均衡器集群系统中负载均衡器故障恢复机制[J].计算机工程,2004,30(23):45-46. 被引量：2
7周英华,金培权,岳丽华,龚育昌.基于位置的web搜索索引研究[J].中国科学技术大学学报,2007,37(2):147-152. 被引量：1
8陈冬霞,吉根林,方昭辉.基于内容的图像检索中SS-树索引的Java实现[J].南京师范大学学报（工程技术版）,2005,5(4):53-56. 被引量：2
9李康宁,卢艳民,陈红.基于马尔可夫预测模型的数据流滑动窗口近似连接缓存管理策略[J].计算机研究与发展,2006,43(z3):130-136.
10杨风召.一种基于特征表的协同过滤算法[J].计算机工程与应用,2007,43(6):184-187.

计算机学报

2010年第10期

浏览历史

内容加载中请稍等...

海量数据上的近似连接聚集操作被引量：3

参考文献20

同被引文献57

引证文献3

二级引证文献107

相关作者

相关机构

相关主题

浏览历史

海量数据上的近似连接聚集操作 被引量：3

参考文献20

同被引文献57

引证文献3

二级引证文献107

相关作者

相关机构

相关主题

浏览历史

海量数据上的近似连接聚集操作被引量：3