期刊文献+

基于Spark的两表等值连接过程优化 被引量:1

Optimization of two-table equivalent connection process based on Spark
下载PDF
导出
摘要 在数据统计分析查询中表间的等值连接是常用的操作之一,但代价较高。大数据环境下大表之间等值连接的效率更低。为了解决该问题,提出了一种基于Spark的两表等值连接过程优化方法。首先根据数据价值密度特征构建Bloom filter完成表的过滤操作;其次结合simi-join和partition join两者的优势,对过滤后的单侧表使用贪心算法进行拆分;最后对拆分后的子集进行连接,因此把两大表的连接过程转换为分阶段进行的两小表连接。代价分析和实验结果表明,该算法与现有基于Spark的连接操作相比,不仅在性能上得到了提升,而且当出现数据倾斜时对算法效率影响较小。 The equivalence connection between tables in the statistical analysis of data is one of the commonly used operations,but the price is relatively high. In big data environment,the connection of large scale data tables is less efficient. In order to solve this problem,this paper proposed a method for optimization of two-table equivalent connection process based on Spark. First,it constructed the Bloom filter to complete the filtering operation according to the low density of data density.Secondly,it combined the advantages of simi-join method and partition join methods,and used the greedy algorithm splitting methods for the filtered unilateral table. Lastly,it joined the split subsets,so it changed the connection process of two big tables into two stages of the two small table connection. Cost analysis and experiments show that the proposed algorithm improve performance compared with the existing Spark-based connection operation performance and data tilt.
作者 张子栋 郑延斌 Zhang Zidong;Zheng Yanbin(College of Computer Engineering,Jimei University,Xiamen Fujian 361021,China;College of Computer & Information Technology,Henan Normal University,Xinxiang Henan 453007,China)
出处 《计算机应用研究》 CSCD 北大核心 2019年第2期486-489,共4页 Application Research of Computers
基金 河南省科技攻关项目(132102210537 132102210538) 河南省软科学项目(142400411001)
关键词 SPARK 等值连接 大数据 优化 拆分 Spark equivalent connection large data optimize split
  • 相关文献

参考文献4

二级参考文献50

  • 1周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,50(S1):369-377. 被引量:11
  • 2Nie CH, Leung H. A survey of combinatorial testing. ACM Computing Survey, 2011,43(2):1-29. [doi: 10.1145/1883612.1883618].
  • 3Kuhn D, Reilly M. An investigation of the applicability of design of experiments to software testing. In: Proc. of the 27th Annual NASA Goddard/IEEE Software Engineering Workshop. NASA Goddard Space Flight Center, 2002. 1-5.
  • 4Grindal M, Offutt A J, Andler SF. Combination testing strategies: A survey. Software Testing, Verification, and Reliability, 2005, 15(3):167-199. [doi: 10.1002/stvr.319].
  • 5Grindal M, Lindstrom B, Offutt AJ, Andler SF. An evaluation of combination strategies for test case selection. Empirical Software Engineering, 2006,11:583-611. [doi: 10.1007/s 10664-006-9024-2].
  • 6Yan J, Zhang J. Combinatorial testing: Principle and methods. Ruan Jian Xue Bao/Joumal of Software, 2009,20(6): 1393-1405 (in Chinese with English abstract), http://www.jos.org.cn/1000-9825/3497.htm [doi: 10.3724/SP.J.1001.2009.03497].
  • 7Williams AW, Prober RL. A practical strategy for testing pair-wise coverage of network interfaces. In: Proc. of the 7th Int'l Symp. on Software Reliability Engineering (ISSRE'96). White Plaints, 1997. 246-254. [doi: 10.1109/ISSRE.1996.558835].
  • 8Nurmela KJ. Upper bounds for covering arrays by tabu search. Discrete Applied Mathematics, 2004,138(1-2):143-152. [doi: 10. 1016/S0166-218X(03)00291-9].
  • 9Cohen MB, Gibbons PB, Mugridge WB, Colbourn CJ. Constructing test suites for interaction testing. In: Proc. of the 25th Int'l Conf. on Software Engineering (ICSE 2003). Portland, 2003.38-48. http://dx.doi.org/10.1109/ICSE.2003.1201186.
  • 10Cohen Trans. Cohen Symp. DM, Dalai SR, Fredman ML, Patton GC. The AETG system: An approach to testing based on combinatorial design. IEEE on Software Engineering, 1997,23(7):437-444. [doi: 10.1109/32.605761].

共引文献56

同被引文献6

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部