期刊文献+

基于Hadoop的SQL查询引擎性能研究 被引量:8

Research on SQL-on-Hadoop systems
下载PDF
导出
摘要 Apache Hadoop处理超大规模数据集有非常出色的表现,相比较于传统的数据仓库和关系型数据库有不少优势.为了让原有业务能够充分利用Hadoop的优势,SQL-on-Hadoop系统越来越受到工业界和学术界的关注.基于Hadoop的SQL查询引擎种类繁多,各有优势,其运算引擎主要包括三种:1传统的Map/Reduce引擎;2新兴的Spark引擎;3基于shared-nothing架构的MPP引擎.本文选取了其中最有代表性的三种SQL查询引擎—Hive、Spark SQL、Impala,并使用了一种类TPC-H的测试基准对它们的决策支持能力进行测试及评估.从实验结果来看,Impala和Spark SQL相对于传统的Hive都有较大的提高,其中Impala的部分查询比Hive快了10倍以上,并且Impala在完成查询所占用的集群资源也是最少的.然而若从稳定性、易用性、兼容性和性能等多个方面进行对比,并不存在各方面均最优的查询引擎,因此在构建基于Hadoop的数据仓库系统时,推荐采用Hive+Impala或者Hive+Spark SQL的混合架构. Hadoop has huge advantage over traditional data warehouse and RDBMs on storing and processing large amount of data.In order to be compatible with existing business logic,SQL-on-Hadoop systems are getting more and more attentions from both industry and academia.There are variable kinds of SQL-on-Hadoop systems with different architectures and different execution engines.Those systems are generally divided into three categories:traditional engines based on Map/Reduce,newborn engines based on Spark,and MPP engines based on shared-nothing architecture.In this paper,three SQL-on-Hadoop systems,Hive,Spark SQL and Impala,are chosen to represent each category,respectively.A TPC-H like workload is used to benchmark the efficiency and resource usage for each system.Through detailed analysis of the experimental result,both Impala and Spark SQL are faster than Hive.In some particular queries,Impala is10 Xfaster than Hive with minimum CPU/RAM usage among the three SQL systems.However,when compared in terms of stability,usability,compatibility and performance,no one can beat others at all aspects.So while building the data warehouse system based on Hadoop,it is recommended to use a hybrid architecture using Hive+Impala or Hive+Spark SQL.
出处 《华中师范大学学报(自然科学版)》 CAS 北大核心 2016年第2期174-182,共9页 Journal of Central China Normal University:Natural Sciences
基金 国家自然科学基金项目(61272112 61472287) 湖北省自然科学基金重点项目(2015CFA068)
关键词 大数据 SQL-on-Hadoop 数据仓库 SPARK SQL IMPALA Hive big data SQL-on-Hadoop data warehouse Spark SQL Impala Hive
  • 相关文献

参考文献1

二级参考文献42

  • 1[OL].<http://hadoop.apache.org.>.
  • 2WinterCorp: 2005 TopTen Program Summary. http:// www. wintercorp, com/WhitePapers/WC TopTenWP. pdf.
  • 3TDWI Checklist Report: Big Data Analytics. http://tdwi. org/research/2010/08/Big-Data-Analytics, aspx.
  • 4Chaudhuri S, Dayal U. An overview of data warehousing and OLAP technology. SIGMOD Rec, 1997,26(1): 65-74.
  • 5Madden S, DeWitt D J, Stonebraker M. Database parallelism choices greatly impact scalability. DatabaseColumn Blog. http://www, databasecolumn, com/2007/10/database-parallelism-choices, html.
  • 6Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters//Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI ' 04). San Francisco, California, USA, 2004: 137-150.
  • 7DeWitt D J, Gerber R H, Graefe G, Heytens M L, Kumar K B, Muralikrishna M. GAMMA--A high performance dataflow database machine//Proceedings of the 12th International Conference on Very Large Data Bases (VLDB' 86). Kyoto, Japan, 1986:228-237.
  • 8Fushimi S, Kitsuregawa M, Tanaka H. An overview of the system software of a parallel relational database machine// Proceedings of the 12th International Conference on Very Large DataBases(VLDB'86). Kyoto, Japan, 1986:209-219.
  • 9Brewer E A. Towards robust distributed systems//Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC' 00). Portland, Oregon, USA, 2000:7.
  • 10http: //www. dbms2, com/2008/08/26/known-applications of mapreduce/.

共引文献615

同被引文献40

引证文献8

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部