期刊文献+

分布式流数据加载和查询技术优化 被引量:7

Optimization on Distributed Stream Data Loading and Querying
下载PDF
导出
摘要 分布式流查询是一种基于数据流的实时查询计算方法,近年来得到了广泛的关注和快速发展。综述了分布式流处理框架在实时关系型查询上取得的研究成果;对涉及分布式数据加载、分布式流计算框架、分布式流查询的产品进行了分析和比较;提出了基于Spark Streaming和Apache Kafka构建的分布式流查询模型,以并发加载多个文件源的形式,设计内存文件系统实现数据的快速加载,相较于基于Apache Flume的加载技术提速1倍以上。在Spark Streaming的基础上,实现了基于Spark SQL的分布式流查询接口,并提出了自行编码解析SQL语句的方法,实现了分布式查询。测试结果表明,在查询语句复杂的情况下,自行编码解析SQL的查询效率具有明显的优势。 Distributed stream query is a kind of real-time query computation method based on data stream, which has been widely concerned and developed rapidly in recent years. This paper summarized the research results of the distribu- ted stream processing framework in real-time relational query. There is an in-depth comparison of some products, inclu- ding the distributed data loading framework, distributed stream computing framework and distributed stream query sys- tems. The paper proposed a distributed stream query model based on Spark Streaming and Apache Kafka, and designed a fast data loading technology based on virtual memory file system, which gets the data loading speed one time faster compare to Apache Flume. On the basis of Spark Streaming, a distributed stream query interface based on Spark SQL was realized,and a method for parsing SQL queries was proposed to implement distributed query in data stream. The experiment results demonstrate that, in the case of complex SQL queries, the method of analyzing SQL by writing code by oneself has obvious advantages.
出处 《计算机科学》 CSCD 北大核心 2017年第5期172-177,共6页 Computer Science
基金 国家自然科学基金(61271275 61202067)资助
关键词 大数据 流处理系统 分布式流查询 查询优化 Kafka快速加载 Big data, Stream processing system,Distributed stream query, Query optimization,Kafka fast loading
  • 相关文献

参考文献1

二级参考文献90

  • 1Big data: Science in the petabyte era. Nature, 2008, 465 (7209) : 1-136.
  • 2Carney D, Cetintemel U, Cherniack M, et al. Monitoring streams A new class of data management applications// Proceedings of the 28th International Conference on Very Large Data Bases (VLDB2002). Hong Kong, China, 2002: 215-226.
  • 3Chandrasekaran S, Cooper O, Deshpande A, et al. TelegraphCQ: Continuous dataflow processing for an uncertain world//Pruceedings of the 1st Biennial Conference on Innovative Data Systems Research (CIDR 2003). Asilomar, USA, 2003:269-280.
  • 4Arasu A, Babcock B, Babu S, et al. STREAM: The stanford stream data manager. IEEE Data Engineering Bulletin, 2003, 26(1): 19-26.
  • 5Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters//Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004). San Francisco, USA, 2004:137-150.
  • 6Li Feng, Ooi B C, Ozsu M T, Wu S. Distributed data management using MapReduce. ACM Computing Surveys, 2014, 46(3): 31:1-31:42.
  • 7Neumeyer L, Robbins B, Nair A, Kesari A. S4: Distributed stream computing platform//Proceedings of the 2010 Industrial Conference on Data Mining Workshops (ICDM2010). Berlin, Germany, 2010:170-177.
  • 8Toshniwal A, Taneja S, Shukla A, et al. Storm@Twitter// Proceedings of the 2014 International Conference on Management of Data (SIGMOD 2014). Snowbird, USA, 2014: 147-156.
  • 9Zhang H, Chen G, Ooi B C, et al. In-memory big data management and processing: A survey. IEEE Transactions on Knowledge and Data Engineering, 2015, 27 (7) : 1920- 1948.
  • 10Lin Q, Ooi B C, Wang Z, Yu C. Scalable distributed stream join proeessing//Proceedings of the 2015 International Conference on Management of Data (SIGMOD 2015 ). Melbourne, Australia, 2015:811-825.

共引文献20

同被引文献50

引证文献7

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部