基于Spark的两表等值连接过程优化被引量：1

Optimization of two-table equivalent connection process based on Spark

下载PDF

导出

摘要在数据统计分析查询中表间的等值连接是常用的操作之一,但代价较高。大数据环境下大表之间等值连接的效率更低。为了解决该问题,提出了一种基于Spark的两表等值连接过程优化方法。首先根据数据价值密度特征构建Bloom filter完成表的过滤操作;其次结合simi-join和partition join两者的优势,对过滤后的单侧表使用贪心算法进行拆分;最后对拆分后的子集进行连接,因此把两大表的连接过程转换为分阶段进行的两小表连接。代价分析和实验结果表明,该算法与现有基于Spark的连接操作相比,不仅在性能上得到了提升,而且当出现数据倾斜时对算法效率影响较小。 The equivalence connection between tables in the statistical analysis of data is one of the commonly used operations,but the price is relatively high. In big data environment,the connection of large scale data tables is less efficient. In order to solve this problem,this paper proposed a method for optimization of two-table equivalent connection process based on Spark. First,it constructed the Bloom filter to complete the filtering operation according to the low density of data density.Secondly,it combined the advantages of simi-join method and partition join methods,and used the greedy algorithm splitting methods for the filtered unilateral table. Lastly,it joined the split subsets,so it changed the connection process of two big tables into two stages of the two small table connection. Cost analysis and experiments show that the proposed algorithm improve performance compared with the existing Spark-based connection operation performance and data tilt.

作者张子栋郑延斌 Zhang Zidong;Zheng Yanbin(College of Computer Engineering,Jimei University,Xiamen Fujian 361021,China;College of Computer & Information Technology,Henan Normal University,Xinxiang Henan 453007,China)

机构地区集美大学计算机工程学院河南师范大学计算机与信息工程学院

出处《计算机应用研究》 CSCD 北大核心 2019年第2期486-489,共4页 Application Research of Computers

基金河南省科技攻关项目(132102210537 132102210538) 河南省软科学项目(142400411001)

关键词 SPARK 等值连接大数据优化拆分 Spark equivalent connection large data optimize split

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献4

1周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,50(S1):369-377. 被引量：11
2卞昊穹,陈跃国,杜小勇,高彦杰.Spark上的等值连接优化[J].华东师范大学学报（自然科学版）,2014(5):263-270. 被引量：12
3王卓,陈群,李战怀,潘巍,尤立.基于增量式分区策略的MapReduce数据均衡方法[J].计算机学报,2016,39(1):19-35. 被引量：24
4聂长海,蒋静.覆盖表生成的可配置贪心算法优化[J].软件学报,2013,24(7):1469-1483. 被引量：14

二级参考文献50

1周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,50(S1):369-377. 被引量：11
2Nie CH, Leung H. A survey of combinatorial testing. ACM Computing Survey, 2011,43(2):1-29. [doi: 10.1145/1883612.1883618].
3Kuhn D, Reilly M. An investigation of the applicability of design of experiments to software testing. In: Proc. of the 27th Annual NASA Goddard/IEEE Software Engineering Workshop. NASA Goddard Space Flight Center, 2002. 1-5.
4Grindal M, Offutt A J, Andler SF. Combination testing strategies: A survey. Software Testing, Verification, and Reliability, 2005, 15(3):167-199. [doi: 10.1002/stvr.319].
5Grindal M, Lindstrom B, Offutt AJ, Andler SF. An evaluation of combination strategies for test case selection. Empirical Software Engineering, 2006,11:583-611. [doi: 10.1007/s 10664-006-9024-2].
6Yan J, Zhang J. Combinatorial testing: Principle and methods. Ruan Jian Xue Bao/Joumal of Software, 2009,20(6): 1393-1405 (in Chinese with English abstract), http://www.jos.org.cn/1000-9825/3497.htm [doi: 10.3724/SP.J.1001.2009.03497].
7Williams AW, Prober RL. A practical strategy for testing pair-wise coverage of network interfaces. In: Proc. of the 7th Int'l Symp. on Software Reliability Engineering (ISSRE'96). White Plaints, 1997. 246-254. [doi: 10.1109/ISSRE.1996.558835].
8Nurmela KJ. Upper bounds for covering arrays by tabu search. Discrete Applied Mathematics, 2004,138(1-2):143-152. [doi: 10. 1016/S0166-218X(03)00291-9].
9Cohen MB, Gibbons PB, Mugridge WB, Colbourn CJ. Constructing test suites for interaction testing. In: Proc. of the 25th Int'l Conf. on Software Engineering (ICSE 2003). Portland, 2003.38-48. http://dx.doi.org/10.1109/ICSE.2003.1201186.
10Cohen Trans. Cohen Symp. DM, Dalai SR, Fredman ML, Patton GC. The AETG system: An approach to testing based on combinatorial design. IEEE on Software Engineering, 1997,23(7):437-444. [doi: 10.1109/32.605761].

共引文献56

1李宗福,李阳,李昂,陈康.基于Hadoop与机器学习的舆情分析与应用[J].计算机应用研究,2020,37(S01):43-46. 被引量：1
2赵会群,孙晶,张爆,王同林.嵌入式API测试套生成方法和技术[J].软件学报,2014,25(2):373-385. 被引量：2
3卓可秋,童国平,虞为.一种基于Spark的论文相似性快速检测方法[J].图书情报工作,2015,59(11):134-142. 被引量：2
4王诏远,王宏杰,邢焕来,李天瑞.基于Spark的蚁群优化算法[J].计算机应用,2015,35(10):2777-2780. 被引量：23
5刘寒梅,韩宏莹.基于反馈调度的MapReduce负载均衡分区算法研究[J].信息通信,2015,28(10):41-42. 被引量：1
6张艳明,姚宏亮.FTT调度模型中面向性能优化的消息周期指定算法[J].计算机工程,2015,41(12):171-175.
7王桂兰,周国亮,萨初日拉,朱永利.Spark环境下的并行模糊C均值聚类算法[J].计算机应用,2016,36(2):342-347. 被引量：11
8周国亮,萨初日拉,朱永利.Spark环境下基于多维布隆过滤器的星型连接算法[J].计算机应用,2016,36(2):353-357. 被引量：1
9王卓,陈群,李战怀,潘巍,尤立.基于增量式分区策略的MapReduce数据均衡方法[J].计算机学报,2016,39(1):19-35. 被引量：24
10胡忠奎,屈波,黄斌,黎文阳.一种基于虚拟处理区间划分的负载均衡等值连接算法[J].现代计算机,2016,0(2):3-7.

同被引文献6

1邓爱林,左子叶,朱扬勇.基于项目聚类的协同过滤推荐算法[J].小型微型计算机系统,2004,25(9):1665-1670. 被引量：147
2廖彬,张陶,于炯,尹路通,郭刚,国冰磊.MapReduce能耗建模及优化分析[J].计算机研究与发展,2016,53(9):2107-2131. 被引量：11
3许智宏,蒋新宇,董永峰,赵嘉伟.一种基于Spark的改进协同过滤算法研究[J].计算机应用与软件,2017,34(5):247-254. 被引量：8
4廖彬,张陶,国冰磊,于炯,张旭光,刘炎.基于Spark的ItemBased推荐算法性能优化[J].计算机应用,2017,37(7):1900-1905. 被引量：8
5陆俊尧,李玲娟.基于Spark的协同过滤算法并行化研究[J].计算机技术与发展,2019,29(1):85-89. 被引量：12
6邓爱林,朱扬勇,施伯乐.基于项目评分预测的协同过滤推荐算法[J].软件学报,2003,14(9):1621-1628. 被引量：557

引证文献1

1邹红旭,潘冠华,李吟.基于Spark框架的改进协同过滤算法[J].计算机技术与发展,2020,30(5):38-42. 被引量：1

二级引证文献1

1王松,周学广,陈瑞.基于Spark分布式支持向量机的TMS数据纠错方法研究[J].计算机科学与应用,2020,10(4):710-720.

1朱崇恺.光催化固氮[J].知识就是力量,2019(1):5-5.
2张健.以大数据为核心的审计方式方法创新研究[J].计算机产品与流通,2018,7(8):99-99.
3易佑宁,彭文博.陶瓷膜技术处理含油废水的应用研究[J].江苏陶瓷,2019,52(1):31-32. 被引量：4
4王欣.新时代中国社会主要矛盾转化的哲学理论依据分析[J].青春岁月,2018(7):187-187.
5李红,袁俊丽,栗卓新,TILLMANN Wolfgang,胡安明.纳米连接过程的分子动力学模拟研究进展[J].中国机械工程,2019,30(4):486-493. 被引量：5
6陆婷婷.生活:开放语文课堂的“根本”[J].作文成功之路（小学）,2019,0(2):68-68.
7代水平,高宇.《乡村振兴法》立法:功能定位、模式选择与实现路径[J].西北大学学报（哲学社会科学版）,2019,49(2):19-27. 被引量：32
8时政热点词语点击[J].西藏政报,2018(4):1-1.
9吴栋.道德与法治灵动教育的培养方法[J].教师博览（下旬刊）,2019,9(2):81-83.
10徐光海.神秘纵火案[J].小读者,2019,0(3):52-52.

计算机应用研究

2019年第2期

浏览历史

内容加载中请稍等...

基于Spark的两表等值连接过程优化被引量：1

参考文献4

二级参考文献50

共引文献56

同被引文献6

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于Spark的两表等值连接过程优化 被引量：1

参考文献4

二级参考文献50

共引文献56

同被引文献6

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于Spark的两表等值连接过程优化被引量：1