摘要
Hive作为建立在Hadoop上的数据仓库,已成为很多企业处理大数据的首选。然而,传统企业中大量的遗留应用依赖于传统关系型数据库(RDBMS),迁移时需要翻译大量查询语句。提出了一种基于查询树的SQL到HiveQL的自动翻译方法。该方法利用SQL解析器将SQL语句解析为查询树,提供8种不同的重写策略重构查询树,进而将其转化为正确的HiveQL语句,实现了一个翻译工具——DFMapper。在基准测试集TPC-DS上进行的查询实验证明,DFMapper可以正确翻译绝大多数的查询语句,并且具有很强的扩展性。
Due to the increasing amount of data in storing and processing,the traditional RDBMS encounters performance bottleneck.As a data warehouse built on Hadoop for providing data analysis and summarization as a right alternative to the traditional RDBMS,Hive becomes the first choice for many enterprises to deal with big data for its massive scale out and fault tolerance capabilities.In traditional enterprises particularly,a wide variety of legacy applications depend on the traditional RDBMS.Therefore,when migrating these applications to Hive necessary,a large number of queries need to be translated,which will consume hug cost of labor and time via manual way.This paper proposes a query-tree based approach for automatically translating SQL in RDBMS into proper HiveQL.The SQL parser is applied to parsing SQL sentence to query trees that will be supplied with correspondence between tables and columns during pretreatment.By taking into account of set operations,correlated subqueries,and other structures that HiveQL support weakly,this paper proposes eight different rewriting strategies to reconstruct query trees,and in turn to transform those queries in HiveQL sentences.A translation tool called DFMapper may provide a strategies loader to dynamically adjust the specific strategies according to actual requirements,e.g.,the version of Hive,SQL dialect,etc.,via the modification of externalized configuration.Besides,a validator is designed to verify the accuracy of translation by comparing the result sets of queries executed in RDBMS and Hive,respectively.It is demonstrated via the experiments on the TPC-DS benchmark composed of 99 different queries and covering a varity of ANSI SQL syntax that DFMapper can correctly translate the vast majority of universal queries with strong extensibility.
作者
张成博
虞慧群
郭健美
杨定裕
范贵生
ZHANG Chengbo;YU Huiqun;GUO Jianmei;YANG Dingyu;FAN Guisheng(Department of Computer Science and Engineering,East China University of Science and Technology,Shanghai 200237,China;Department of System Software,Alibaba Group,Hangzhou 311121,China;Shanghai Dianji University,Shanghai 203106,China)
出处
《华东理工大学学报(自然科学版)》
CAS
CSCD
北大核心
2019年第1期148-155,共8页
Journal of East China University of Science and Technology
基金
国家自然科学基金(61702334
61772200)
上海市信息化发展专项资金(201602008)
上海市浦江人才计划(17PJ1401900)
上海市自然科学基金(17ZR1406900
17ZR1429700)
华东理工大学教育研究基金(ZH1726108)