摘要
在数据统计分析查询中表间的等值连接是常用的操作之一,但代价较高。大数据环境下大表之间等值连接的效率更低。为了解决该问题,提出了一种基于Spark的两表等值连接过程优化方法。首先根据数据价值密度特征构建Bloom filter完成表的过滤操作;其次结合simi-join和partition join两者的优势,对过滤后的单侧表使用贪心算法进行拆分;最后对拆分后的子集进行连接,因此把两大表的连接过程转换为分阶段进行的两小表连接。代价分析和实验结果表明,该算法与现有基于Spark的连接操作相比,不仅在性能上得到了提升,而且当出现数据倾斜时对算法效率影响较小。
The equivalence connection between tables in the statistical analysis of data is one of the commonly used operations,but the price is relatively high. In big data environment,the connection of large scale data tables is less efficient. In order to solve this problem,this paper proposed a method for optimization of two-table equivalent connection process based on Spark. First,it constructed the Bloom filter to complete the filtering operation according to the low density of data density.Secondly,it combined the advantages of simi-join method and partition join methods,and used the greedy algorithm splitting methods for the filtered unilateral table. Lastly,it joined the split subsets,so it changed the connection process of two big tables into two stages of the two small table connection. Cost analysis and experiments show that the proposed algorithm improve performance compared with the existing Spark-based connection operation performance and data tilt.
作者
张子栋
郑延斌
Zhang Zidong;Zheng Yanbin(College of Computer Engineering,Jimei University,Xiamen Fujian 361021,China;College of Computer & Information Technology,Henan Normal University,Xinxiang Henan 453007,China)
出处
《计算机应用研究》
CSCD
北大核心
2019年第2期486-489,共4页
Application Research of Computers
基金
河南省科技攻关项目(132102210537
132102210538)
河南省软科学项目(142400411001)