摘要
随着"大数据"时代的到来,分布式数据处理得到了广泛的应用和发展.在基于云计算的海量数据处理中,复杂处理要求逐渐增多,数据分析通常需要跨越多个数据集,因此亟需高效的多表连接机制.现有的基于MapReduce的多表连接机制多采用串行级联方式实现多个不同数据集连接,操作灵活但效率不高.本文在分析现有并行连接模型的基础上,提出基于二维节点矩阵的分级多表连接模型TD-HMJ.TD-HMJ在一次Map过程中处理全部连接属性,Reduce过程建立二维节点矩阵实现多组3(或2)表并行连接,并通过多级Reduce过程实现多组间连接.理论分析和实验表明TD-HMJ减少了数据传输量,缩短了多表连接时间,提高了连接效率.
With the coming of big data age, distributed data processing has achieved a wide range of applications and development. In cloud computing, complex processing requirements gradually increase, and data analysis always spans multiple data sets, therefore it has an urgent need for high effective mechanism in multi-joins. Existing MapReduce-based multi-join mechanisms implement the join of multiple data sets via cascade method, which is flexible but poor efficiency. The paper analyzes existing concurrent join model and proposes a two-dimension node matrix based hierarchized multi-join model ( TD-HMJ ). TD-HMJ handles all key properties in one Map process. In Reduce process, it implements several groups of 3 ( or 2 ) -table join by establishing a two-dimensional Reduce node matrix and finishes the join between groups through multi-level Reduce processes. Theoretical analysises and experiments show that TD-HMJ decreases data transmission, curtails the time of multi-join, and increases the system efficiency.
出处
《小型微型计算机系统》
CSCD
北大核心
2014年第5期945-950,共6页
Journal of Chinese Computer Systems
基金
河南省教育厅自然科学基金项目(2011B520035)资助
河南省教育厅科学技术研究重点项目(13A520651)资助