摘要
MapReduce是一个并行分布式计算模型,已经被广泛应用于处理两个或多个大型表的连接操作.现有的基于MapReduce的多表连接算法,在处理链式连接时,不能处理多个大表的连接,或者需要顺序运行较多的MapReduce任务,效率较低.为此提出了一种基于MapReduce的多表连接算法——PipelineJoin,高效地实现任意多个大表的链式连接.PipelineJoin采用流水线模型和调度器来实现MapReduce任务的流水线式执行,从而有效提高多表连接的效率,同时可以较好地克服链式多表连接算法的缺陷.最后,在不同规模的数据集上进行了大量实验,实验结果表明PipelineJoin算法与原有链式多表连接算法相比,可以有效减少连接所需的时间.
MapReduce,aparallel and distributed computing model,has been widely used to process join operations for two or more large tables.The existing MapReduce-based multi-table join algorithms all have some limitations when dealing with chain join.Some methods can not process join operations for multi large tables,and others involve sequentially running too many MapReduce tasks,which leads to low efficiency.Here a new MapReduce-based multi-table join algorithm,PipelineJoin,is proposed to process chain join of a number of tables.PipelineJoin adopts a pipeline model and a scheduler to allow the overlapping execution of a series of Map tasks and Reduce tasks in the whole join process so as to enhance the efficiency of multi-table join,while effectively overcoming the deficiency of the existing methods.Extensive experimental results based on various synthetic datasets show that the proposed algorithm can greatly reduce join operation time compared with the existing chain join algorithms.
基金
国家自然科学基金(61303004
1202012)
国家科技支撑计划(863)(2015BAH16F00/F01/F02)资助