PipelineJoin:一种新的基于MapReduce的多表连接算法被引量：3

PipelineJoin:A new MapReduce-based multi-table join algorithm

下载PDF

导出

摘要 MapReduce是一个并行分布式计算模型,已经被广泛应用于处理两个或多个大型表的连接操作.现有的基于MapReduce的多表连接算法,在处理链式连接时,不能处理多个大表的连接,或者需要顺序运行较多的MapReduce任务,效率较低.为此提出了一种基于MapReduce的多表连接算法——PipelineJoin,高效地实现任意多个大表的链式连接.PipelineJoin采用流水线模型和调度器来实现MapReduce任务的流水线式执行,从而有效提高多表连接的效率,同时可以较好地克服链式多表连接算法的缺陷.最后,在不同规模的数据集上进行了大量实验,实验结果表明PipelineJoin算法与原有链式多表连接算法相比,可以有效减少连接所需的时间. MapReduce,aparallel and distributed computing model,has been widely used to process join operations for two or more large tables.The existing MapReduce-based multi-table join algorithms all have some limitations when dealing with chain join.Some methods can not process join operations for multi large tables,and others involve sequentially running too many MapReduce tasks,which leads to low efficiency.Here a new MapReduce-based multi-table join algorithm,PipelineJoin,is proposed to process chain join of a number of tables.PipelineJoin adopts a pipeline model and a scheduler to allow the overlapping execution of a series of Map tasks and Reduce tasks in the whole join process so as to enhance the efficiency of multi-table join,while effectively overcoming the deficiency of the existing methods.Extensive experimental results based on various synthetic datasets show that the proposed algorithm can greatly reduce join operation time compared with the existing chain join algorithms.

作者林子雨李雨倩李粲赖永炫

机构地区厦门大学信息科学与技术学院厦门大学软件学院

出处《中国科学技术大学学报》 CAS CSCD 北大核心 2015年第10期836-845,共10页 JUSTC

基金国家自然科学基金(61303004 1202012) 国家科技支撑计划(863)(2015BAH16F00/F01/F02)资助

关键词连接多表 MAPREDUCE PipelineJoin join multi-table MapReduce PipelineJoin

分类号 TP338.8 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献11

1Kenn Slagter,Ching-Hsien Hsu,Yeh-Ching Chung,Gangman Yi.SmartJoin: a network-aware multiway join for MapReduce[J]. Cluster Computing . 2014 (3)
2David Jiang,Anthony K. H. Tung,Gang Chen.MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters. IEEE Transactions on Knowledge and Data Engineering . 2011
3Afrati, Foto N.,Ullman, Jeffrey D.Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering . 2011
4Spyros Blanas,Jignesh M. Patel,Vuk Ercegovac,Jun Rao,Eugene J. Shekita,Yuanyuan Tian.A comparison of join algorithms for log processing in Map/Reduce. Proceedings of the ACM SIGMOD International Conference . 2010
5Yang H,Dasdan A,Hsiao R L, et al.Map-reduce-merge:simplified relational data processing on large clusters. Proceedings of the 2007 ACM SIGMOD international conference on Management of data . 2007
6Hunt P,Konar M.ZooKeeper:Wait-free Coordination for Internet-Scale Systems. USENIX Annual Technical Conference . 2010
7Eltabakh M,Tian Yuanyuan,zcan F,et al.CoHadoop:flexible data placement and its exploitation in Hadoop. Proceedings of the 37th International Conference on Very Large Data Bases (VLDB 11) . 2011
8Jens Dittrich,Jorge-Arnulfo Quiane-Ruiz,Alekh Jindal,et al.Ha-doop++:making a yellow elephant run like a cheetah(without iteven noticing). The 36 th International Conference on VeryLarge Data Bases,VLDB 2010/PVLDB . 2010
9Blanas S,Li Y,Patel J M.Design and evaluation of main memory hash joinalgorithms for multi-core CPUs. Proceedings of the ACM SIGMOD InternationalConference on Management of Data . 2011
10Yan K,Zhu H.Two MRJs for multi-way theta-join in MapReduce. Proceedings of the 6th International Conference on Internet and Distributed Computing Systems . 2013

共引文献1

1杜江,张铮,张杰鑫,邰铭.MapReduce并行编程模型研究综述[J].计算机科学,2015,42(S1):537-541 564. 被引量：24

同被引文献12

1姚晓娜,祝忠明.基于分面搜索引擎Solr的机构知识库访问统计[J].现代图书情报技术,2011(7):37-40. 被引量：10
2李锐,王斌.文本处理中的MapReduce技术[J].中文信息学报,2012,26(4):9-20. 被引量：18
3梁秋实,吴一雷,封磊.基于MapReduce的微博用户搜索排名算法[J].计算机应用,2012,32(11):2989-2993. 被引量：12
4薛胜军,潘吴斌.基于MapReduce的气象数据并行PK-means算法[J].武汉理工大学学报,2012,34(12):139-142. 被引量：3
5黄山,王波涛,王国仁,于戈,李佳佳.MapReduce优化技术综述[J].计算机科学与探索,2013,7(10):865-885. 被引量：30
6李伟卫,赵航,张阳,王勇.基于MapReduce的海量数据挖掘技术研究[J].计算机工程与应用,2013,49(20):112-117. 被引量：35
7陈子军,张娟娜,刘文远.MapReduce框架下基于范围的空间文本相似连接[J].小型微型计算机系统,2015,36(10):2245-2251. 被引量：3
8范素娟,田军锋.基于Hadoop的云计算平台研究与实现[J].计算机技术与发展,2016,26(7):127-132. 被引量：10
9高见文,薛行贵,罗杰,姜源,吴启武.基于迭代式MapReducede的海量数据并行聚类算法研究[J].中国科技论文,2016,11(14):1626-1631. 被引量：6
10徐文涛,刘锋,朱二周.基于MapReduce的新型微博用户影响力排名算法研究[J].计算机科学,2016,43(9):66-70. 被引量：6

引证文献3

1朱晓丽,邓惠俊,陈小虎.基于Hadoop云计算平台的数据处理研究[J].科技经济市场,2018(7):11-12. 被引量：1
2王晨阳.基于MapReduce的快消品电商网站热搜品牌TOP-N计算[J].福建工程学院学报,2019,17(4):365-370.
3贺雪梅.基于MapReduce的连接方法研究[J].计算机产品与流通,2017,0(9):153-153.

二级引证文献1

1白茹.基于云计算和Hadoop的网络舆情监控系统设计[J].电子设计工程,2019,27(16):141-144. 被引量：9

1蒋旭东,周立柱.数据仓库查询处理中的一种多表连接算法[J].软件学报,2001,12(2):190-195. 被引量：30
2王婧,王腾蛟,杨冬青,李红燕.云计算平台上基于过滤器的多表连接算法[J].计算机研究与发展,2011,48(S3):245-253. 被引量：2
3栗青霞,王换换,傅喆.改进的Apriori算法在试题关联分析中的应用[J].电子科技,2014,27(2):35-38. 被引量：2
4李超,潘清.面向Web服务的多核体系结构流水线模型[J].装备学院学报,2013,24(4):89-93.
5杨金凤,刘锋.一种新的改进Apriori算法[J].微型机与应用,2010,29(1):55-56. 被引量：1
6黄建明,赵文静,王星星.基于十字链表的Apriori改进算法[J].计算机工程,2009,35(2):37-38. 被引量：25
7宋广佳,张艳明.基于十字链表的Apriori算法的实现[J].赤峰学院学报（自然科学版）,2012(18):32-34. 被引量：1
8蒋旭东,冯建华,周立柱.联机分析查询处理中的一种聚集算法[J].软件学报,2002,13(1):65-70. 被引量：14
9白正,张宏宇,王萍.基于无锁队列算法的报文分发流水线模型[J].网络安全技术与应用,2013(2):10-12. 被引量：1
10夏斌.基于分布式数据库的半连接查询计划选择算法[J].电子技术与软件工程,2017(4):199-199.

中国科学技术大学学报

2015年第10期

浏览历史

内容加载中请稍等...

PipelineJoin:一种新的基于MapReduce的多表连接算法被引量：3

参考文献11

共引文献1

同被引文献12

引证文献3

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

PipelineJoin:一种新的基于MapReduce的多表连接算法 被引量：3

参考文献11

共引文献1

同被引文献12

引证文献3

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

PipelineJoin:一种新的基于MapReduce的多表连接算法被引量：3