Spark环境下基于多维布隆过滤器的星型连接算法被引量：1

Star join algorithm based on multi-dimensional Bloom filter in Spark

下载PDF

导出

摘要为了适应联机分析处理(OLAP)系统中实时数据高性能分析需求不断提高的需求,提出一种能够适合Spark环境并结合多维Bloom Filter(MDBF)的星型连接算法SMDBFSJ。首先,根据多个维表构建MDBF,利用其占用空间小的特点,广播到所有节点;然后,在本地节点完成事实表过滤操作,事实表不需要在节点间移动数据;最后,过滤后的事实表与维表采用重划分方式进行连接,进而得到最终结果。SMDBFSJ算法避免了事实表数据移动,通过MDBF减小了需要广播的数据量,充分结合了广播连接和重划分连接的优势。实验结果表明了该算法的有效性,在单机和集群环境下,该算法相比重划分连接均获得了3倍左右的性能提升。 To meet the high performance analysis requirements for real-time data in On-Line Analytical Processing（ OLAP） system, a new star join algorithm which is suitable for Spark platform was proposed based on Multi-Dimensional Bloom Filter（ MDBF）, namely SMDBFSJ（ Spark Multi-Dimensional Bloom Filter Star Join）. First of all, MDBF was built according to the dimension tables and broadcasted to all the nodes based on the feature of small size. Then the fact table was filtered completely on the local node, and there was no data movement between nodes. Finally, the filtered fact table and dimension tables were joined using repartition join model to get the final result. SMDBFSJ algorithm avoides the data moving of fact table, reduces the size of broadcasting data using MDBF, as well as fully combines the advantages of broadcast join and repartition join. The experimental results prove the validity of SMDBFSJ algorithm, in stand-alone and cluster environments.SMDBFSJ algorithm can obtain about three times of performance improvement compared with repartition join in Spark.

作者周国亮萨初日拉朱永利

机构地区国网冀北电力有限公司技能培训中心华北电力大学控制与计算机工程学院

出处《计算机应用》 CSCD 北大核心 2016年第2期353-357,共5页 journal of Computer Applications

基金中央高校基本科研业务费专项资金资助项目(13MS103) 河北省自然科学基金资助项目(F2014502069)~~

关键词布隆过滤器星型连接联机分析处理 SPARK Bloom filter star join On-Line Analytical Processing（OLAP） Spark

分类号 TP393.027 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献14

1LI F, OZSU M T, CHEN G, et al. R-store: a scalable distributed system for supporting real-time analytics[C]//ICDE 2014: Proceedings of the IEEE 30th International Conference on Data Engineering. Washington, DC: IEEE Computer Society, 2014: 40-51.
2ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing[C]//NSDI 2012: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation. New York: ACM, 2012: 15-28.
3GU R, WANG S, WANG F, et al. Cichlid: efficient large scale RDFS/OWL reasoning with Spark[C]//IPDPS 2015: Proceedings of the IEEE 29th International Parallel & Distributed Processing Symposium. Piscataway: IEEE, 2015: 700-709.
4QIU H, GU R, YUAN C, et al. YAFIM: a parallel frequent itemset mining algorithm with Spark[C]//IPDPSW 2014: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. Piscataway: IEEE, 2014: 1664-1671.
5张磊,方祝和,周敏奇,黄岚.面向内存计算的连接算法[J].华东师范大学学报（自然科学版）,2014(5):180-191. 被引量：6
6HACIGüMü? H, IYER B, MEHROTRA S. Providing database as a service[C]//ICDE 2002: Proceedings of the 2002 IEEE International Conference on Data Engineering. Washington, DC: IEEE Computer Society, 2002: 29-38.
7卞昊穹,陈跃国,杜小勇,高彦杰.Spark上的等值连接优化[J].华东师范大学学报（自然科学版）,2014(5):263-270. 被引量：12
8HAN H, JUNG H, EOM H, et al. Scatter-Gather-Merge: an efficient star-join query processing algorithm for data-parallel frameworks[J]. Cluster Computing, 2011, 14(2): 183-197.
9ZHOU G, ZHU Y, WANG G. Cache conscious star-join in MapReduce environments[C]//Cloud-I '13: Proceedings of the 2nd International Workshop on Cloud Intelligence. New York: ACM, 2013: Article No. 1.
10LIN Y, AGRAWAL D, CHEN C, et al. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework[C]//SIGMOD '11: Proceedings of the 2011 ACM Conference on Management of Data. New York: ACM, 2011: 961-972.

二级参考文献44

1LIN TAN, SHERWOOD T. A high throughput string matching architecture for intrusion detection and prevention [ C]// ISCA'05: 32nd International Symposium on Computer Architecture. Washington, DC: IEEE, 2005: 112-122.
2DHARPMAPURIKAR S, LOCKWOOD J W. Fast and scalable pattern matching for network intrusion detection systems[J]. IEEE Journal on Selected Areas in Communications, 2006, 24(10) : 1781 - 1792.
3DHARPMAPURIKAR S, KRISHNAMURTHY P, SPROULI. T, et al. Deep packet inspection using parallel bloom filters [ J]. IEEE Micro, 2004, 24(1): 52-61.
4GUO D, WU JIECHEN, HONGHUI, et al. Theory and network applications of dynamic bloom filters [ C]// INFOCOM 2006: 25th IEEE International Conference on Computer Communications. Washington, DC: IEEE, 2006: 1- 12.
5普拉特纳H,蔡尔A.内存数据管理[M].SAP译.1版.北京:清华大学出版社,2013.
6ZHOU J,ROSS K A.Implementing database operations using SIMD instructions[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data.ACM,2002:145-156.
7BALKESEN C,ALONSO G,OZSU M.Multi-core,main-memory joins:Sort vs.hash revisited[J].Proceedings of the VLDB Endowment,2013,7(1):85-96.
8NYBERG C,BARCLAY T,CVETANOVIC Z,et al.AlphaSort:A RISC machine sort[C]//ACM SIGMOD Record.ACM,1994,23(2):233-242.
9BALKESEN C,TEUBNER J,ALONSO G,et al.Main-memory hash joins on multi-core CPUs:Tuning to the underlying hardware[C]//2013 IEEE 29th International Conference on data Engineering (ICDE) IEEE,2013:362-373.
10ALBUTIU M C,KEMPER A,NEUMANN T.Massively parallel sort-merge joins in main memory multi-core database systems[J].Proceedings of the VLDB Endowment,2012,5(10):1064-1075.

共引文献22

1王伟晨.基于布隆过滤器算法的数据检索误判率研究[J].计算机产品与流通,2020,0(3):260-261. 被引量：2
2冯锋,吴杰.基于Bloom filter的RFID中间件数据过滤算法研究[J].计算机应用研究,2015,32(5):1441-1444. 被引量：7
3卓可秋,童国平,虞为.一种基于Spark的论文相似性快速检测方法[J].图书情报工作,2015,59(11):134-142. 被引量：2
4王诏远,王宏杰,邢焕来,李天瑞.基于Spark的蚁群优化算法[J].计算机应用,2015,35(10):2777-2780. 被引量：23
5王桂兰,周国亮,萨初日拉,朱永利.Spark环境下的并行模糊C均值聚类算法[J].计算机应用,2016,36(2):342-347. 被引量：11
6胡忠奎,屈波,黄斌,黎文阳.一种基于虚拟处理区间划分的负载均衡等值连接算法[J].现代计算机,2016,0(2):3-7.
7王天立,黄海.一种基于最优子树的数据库多连接查询优化混合算法[J].莆田学院学报,2016,23(2):46-49.
8刘年国,王芬,吴家奇,李雪,陶涛.基于Counting Bloom Filter的海量网页快速去重研究[J].安徽电气工程职业技术学院学报,2016,21(3):92-97.
9朱坤,张娜娜,朱丹丹.基于Hadoop平台的布隆过滤分布式并行算法设计与实现[J].互联网天地,2016(10):69-73.
10林晓佳.海量网页模式下的特定数据快速挖掘模型的改进[J].宁德师范学院学报（自然科学版）,2016,28(4):384-389.

同被引文献4

1王晓军,孙惠.基于MapReduce的多路连接优化方法研究[J].计算机技术与发展,2013,23(6):59-62. 被引量：5
2孙莉,李静,刘国华.列存储数据查询中的连接策略优化方法[J].计算机研究与发展,2013,50(8):1647-1656. 被引量：2
3汪璟玢,彭志星.多路空间连接优化算法研究[J].小型微型计算机系统,2013,34(11):2431-2436. 被引量：2
4卞昊穹,陈跃国,杜小勇,高彦杰.Spark上的等值连接优化[J].华东师范大学学报（自然科学版）,2014(5):263-270. 被引量：12

引证文献1

1乔百友,朱俊海,郑宇杰,申木川,王国仁.一种基于Spark的多路空间连接查询处理算法[J].计算机研究与发展,2017,54(7):1592-1602.

1吴小蓉.ARP攻击的快速抵御[J].网管员世界,2008(24):93-94.
2梁义海,乔卫民,王彦瑜,敬岚.串口通信星型连接的CPLD实现[J].单片机与嵌入式系统应用,2003,3(11):24-26. 被引量：2
3刘威,郭渊博,黄鹏.基于多维布隆过滤器的模式匹配引擎[J].计算机应用,2011,31(1):107-109. 被引量：8
4潘昊,鄂海红,宋美娜.布隆过滤器在网页消重中的应用[J].软件,2015,36(12):166-170. 被引量：2
5谷岩.基于三层结构的联机分析处理(OLAP)系统的研究与设计[J].计算机应用与软件,2005,22(3):80-82. 被引量：4
6焦敏,张延松,王珊,陈红.内存OLAP多核并行查询优化技术研究[J].计算机学报,2014,37(9):1895-1910. 被引量：3
7朱巧明,李培峰,吕强,杨季文.汉语属性库OLAP系统的设计[J].中文信息学报,1998,12(3):44-49. 被引量：3
8谷岩.基于数据仓库技术的医院信息系统(HIS)的实现方案研究[J].计算机系统应用,2005,14(7):5-8. 被引量：3
9李爱凤,陈启买.基于数据仓库技术的教务管理系统(EAS)[J].现代计算机,2007,13(2):67-69. 被引量：1
10解晨光,刘明刚.并行框架下基于位图索引的多表星型连接算法[J].计算机工程与设计,2014,35(9):3107-3112.

计算机应用

2016年第2期

浏览历史

内容加载中请稍等...

Spark环境下基于多维布隆过滤器的星型连接算法被引量：1

参考文献14

二级参考文献44

共引文献22

同被引文献4

引证文献1

相关作者

相关机构

相关主题

浏览历史

Spark环境下基于多维布隆过滤器的星型连接算法 被引量：1

参考文献14

二级参考文献44

共引文献22

同被引文献4

引证文献1

相关作者

相关机构

相关主题

浏览历史

Spark环境下基于多维布隆过滤器的星型连接算法被引量：1