期刊文献+

Spark环境下基于多维布隆过滤器的星型连接算法 被引量:1

Star join algorithm based on multi-dimensional Bloom filter in Spark
下载PDF
导出
摘要 为了适应联机分析处理(OLAP)系统中实时数据高性能分析需求不断提高的需求,提出一种能够适合Spark环境并结合多维Bloom Filter(MDBF)的星型连接算法SMDBFSJ。首先,根据多个维表构建MDBF,利用其占用空间小的特点,广播到所有节点;然后,在本地节点完成事实表过滤操作,事实表不需要在节点间移动数据;最后,过滤后的事实表与维表采用重划分方式进行连接,进而得到最终结果。SMDBFSJ算法避免了事实表数据移动,通过MDBF减小了需要广播的数据量,充分结合了广播连接和重划分连接的优势。实验结果表明了该算法的有效性,在单机和集群环境下,该算法相比重划分连接均获得了3倍左右的性能提升。 To meet the high performance analysis requirements for real-time data in On-Line Analytical Processing( OLAP) system, a new star join algorithm which is suitable for Spark platform was proposed based on Multi-Dimensional Bloom Filter( MDBF), namely SMDBFSJ( Spark Multi-Dimensional Bloom Filter Star Join). First of all, MDBF was built according to the dimension tables and broadcasted to all the nodes based on the feature of small size. Then the fact table was filtered completely on the local node, and there was no data movement between nodes. Finally, the filtered fact table and dimension tables were joined using repartition join model to get the final result. SMDBFSJ algorithm avoides the data moving of fact table, reduces the size of broadcasting data using MDBF, as well as fully combines the advantages of broadcast join and repartition join. The experimental results prove the validity of SMDBFSJ algorithm, in stand-alone and cluster environments.SMDBFSJ algorithm can obtain about three times of performance improvement compared with repartition join in Spark.
出处 《计算机应用》 CSCD 北大核心 2016年第2期353-357,共5页 journal of Computer Applications
基金 中央高校基本科研业务费专项资金资助项目(13MS103) 河北省自然科学基金资助项目(F2014502069)~~
关键词 布隆过滤器 星型连接 联机分析处理 SPARK Bloom filter star join On-Line Analytical Processing(OLAP) Spark
  • 相关文献

参考文献14

  • 1LI F, OZSU M T, CHEN G, et al. R-store: a scalable distributed system for supporting real-time analytics[C]//ICDE 2014: Proceedings of the IEEE 30th International Conference on Data Engineering. Washington, DC: IEEE Computer Society, 2014: 40-51.
  • 2ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing[C]//NSDI 2012: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation. New York: ACM, 2012: 15-28.
  • 3GU R, WANG S, WANG F, et al. Cichlid: efficient large scale RDFS/OWL reasoning with Spark[C]//IPDPS 2015: Proceedings of the IEEE 29th International Parallel & Distributed Processing Symposium. Piscataway: IEEE, 2015: 700-709.
  • 4QIU H, GU R, YUAN C, et al. YAFIM: a parallel frequent itemset mining algorithm with Spark[C]//IPDPSW 2014: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. Piscataway: IEEE, 2014: 1664-1671.
  • 5张磊,方祝和,周敏奇,黄岚.面向内存计算的连接算法[J].华东师范大学学报(自然科学版),2014(5):180-191. 被引量:6
  • 6HACIGüMü? H, IYER B, MEHROTRA S. Providing database as a service[C]//ICDE 2002: Proceedings of the 2002 IEEE International Conference on Data Engineering. Washington, DC: IEEE Computer Society, 2002: 29-38.
  • 7卞昊穹,陈跃国,杜小勇,高彦杰.Spark上的等值连接优化[J].华东师范大学学报(自然科学版),2014(5):263-270. 被引量:12
  • 8HAN H, JUNG H, EOM H, et al. Scatter-Gather-Merge: an efficient star-join query processing algorithm for data-parallel frameworks[J]. Cluster Computing, 2011, 14(2): 183-197.
  • 9ZHOU G, ZHU Y, WANG G. Cache conscious star-join in MapReduce environments[C]//Cloud-I '13: Proceedings of the 2nd International Workshop on Cloud Intelligence. New York: ACM, 2013: Article No. 1.
  • 10LIN Y, AGRAWAL D, CHEN C, et al. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework[C]//SIGMOD '11: Proceedings of the 2011 ACM Conference on Management of Data. New York: ACM, 2011: 961-972.

二级参考文献44

  • 1LIN TAN, SHERWOOD T. A high throughput string matching architecture for intrusion detection and prevention [ C]// ISCA'05: 32nd International Symposium on Computer Architecture. Washington, DC: IEEE, 2005: 112-122.
  • 2DHARPMAPURIKAR S, LOCKWOOD J W. Fast and scalable pattern matching for network intrusion detection systems[J]. IEEE Journal on Selected Areas in Communications, 2006, 24(10) : 1781 - 1792.
  • 3DHARPMAPURIKAR S, KRISHNAMURTHY P, SPROULI. T, et al. Deep packet inspection using parallel bloom filters [ J]. IEEE Micro, 2004, 24(1): 52-61.
  • 4GUO D, WU JIECHEN, HONGHUI, et al. Theory and network applications of dynamic bloom filters [ C]// INFOCOM 2006: 25th IEEE International Conference on Computer Communications. Washington, DC: IEEE, 2006: 1- 12.
  • 5普拉特纳H,蔡尔A.内存数据管理[M].SAP译.1版.北京:清华大学出版社,2013.
  • 6ZHOU J,ROSS K A.Implementing database operations using SIMD instructions[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data.ACM,2002:145-156.
  • 7BALKESEN C,ALONSO G,OZSU M.Multi-core,main-memory joins:Sort vs.hash revisited[J].Proceedings of the VLDB Endowment,2013,7(1):85-96.
  • 8NYBERG C,BARCLAY T,CVETANOVIC Z,et al.AlphaSort:A RISC machine sort[C]//ACM SIGMOD Record.ACM,1994,23(2):233-242.
  • 9BALKESEN C,TEUBNER J,ALONSO G,et al.Main-memory hash joins on multi-core CPUs:Tuning to the underlying hardware[C]//2013 IEEE 29th International Conference on data Engineering (ICDE) IEEE,2013:362-373.
  • 10ALBUTIU M C,KEMPER A,NEUMANN T.Massively parallel sort-merge joins in main memory multi-core database systems[J].Proceedings of the VLDB Endowment,2012,5(10):1064-1075.

共引文献22

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部