摘要
当前的大规模数据分析通常在MapReduce框架下执行查询,由于MapReduce框架本身的冗余性以及查询之间的重叠性,复用已有查询的结果可以大幅提高查询的执行效率。复用查询的结果需要对其进行存储和匹配管理,产生高昂的系统开销,抵消复用的部分效果。针对目前先进的查询结果复用系统Re Store在管理查询结果和匹配中存在的效率低下的问题,提出森林结构的Job存储管理技术和与之相适应的匹配算法,提高查询的匹配效率,减少系统的开销。为了使系统能够充分复用已执行查询的结果,提出对多个查询进行预处理的方案;通过改变各查询进入Pig编译器进行编译的顺序,从而改变Job的执行顺序,使得加载相同数据集的Job同时执行,减少与存储库进行匹配的次数。实验表明,在构建存储结构与匹配已有结果过程中,提出的方法与Re Store相比,节约16.3%的时间开销,伸缩性也更好。
The current large-scale data analysis is usually to execute queries in MapReduce framework.Because of the redundancy of MapReduce framework and overlap among queries,reusing the results of queries can significantly improve the efficiency of the execution of queries.It is necessary to store the results and match queries,which have significant overhead and offset some of the benefits.To alleviate the problem,ReStore,the state of the art system for reusing query results,as an example,was taken to improve its efficiency.A forest structure for managing query results is proposed and a matching algorithm is developed.Both of them can contribute to improving the efficiency of the system and reduce overhead.In order to fully enable the system to reuse the results of executed queries,a preprocessing scheme is proposed,which arranges queries in an order to enter Pig compiler according to their proximity in terms of datasets to be operated,so that the queries operate on the same datasets can be executed in sequence and matching can be localized.Experiments show that the proposed techniques can reduce 16.3%time cost,with a better scaling up factor.
作者
石霖
牛保宁
张锦文
SHI Lin;NIU Bao-ning;ZHANG Jin-wen(Computer Science and Technology,Taiyuan University of Technology,Taiyuan 030024,China)
出处
《科学技术与工程》
北大核心
2018年第8期220-227,共8页
Science Technology and Engineering
基金
国家自然科学基金(61572345)
国家"863"高技术研究发展计划(2015BAH37F01)资助