期刊文献+

基于Hive的计算结果特征提取与重用策略 被引量:4

Calculation Results Characteristics Extract and Reuse Strategy Based on Hive
下载PDF
导出
摘要 现有MapReduce工作流中作业之间需将计算结果物化到HDFS(Hadoop distributed file system),大量磁盘I/O导致其效率较低.基于现有代表性工作Hive,提取并保存MapReduce工作流产生计算结果的数据特征,提出一种计算结果匹配和重用策略.首先,根据查询条件定义连接图与连接体等结构,用于可复用计算结果的匹配.基于该结构,根据HiveQL(Hive query language)解析出的抽象语法树,提出生成查询语句连接体算法,并遍历候选连接体列表,给出最佳重用方案生成方法,包括单连接体重用和多连接体重用策略.进一步,为了增加计算结果的重用概率,提出多键选择、推迟算数运算和语义理解3种方法.最后,使用数据仓库基准测试数据集TPC-H和SSB进行实验,验证了所提出的重用计算结果以提高数据处理速度的有效性. Jobs in MapReduce workflow need to materialize intermediate data into HDFS(Hadoop distributed file system),which causes a large amount of I/O overhead and low efficiency.Based on existing representative work Hive,this paper proposes a strategy to match and reuse the MapReduce calculation results by extracting and storing the characteristics of the results.Firstly,we define JoinGraph,Join-Object and other structures according to the query condition,which can be used to find reusable results.Based on the abstract syntax tree generated by HiveQL(Hive query language)parser,an algorithm is proposed to generate Join-Object of the query.Followed by traversing the candidate Join-Object list,an algorithm is provided to generate the best reuse solution including single Join-Object and multiple Join-Objects reuse.In addition,we provide three methods to increase the reuse probability,including multi-key selection,arithmetic delay and semantic understanding.Finally,we conduct the experiments using TPC-H and SSB benchmarks.The results show that the efficiency is improved by 28%-52% when reusing single Join-Object by TPC-H,while it is improved by up to 75% when reusing multiple Join-Objects,and the efficiency of all the 22 queries is improved by 15.7% on average.By SSB,the efficiency is improved by 40%to 76%,55% on average.
出处 《计算机研究与发展》 EI CSCD 北大核心 2015年第9期2014-2024,共11页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61103046) 中央高校基本科研业务费专项 东华大学"励志计划"(B201312)
关键词 MAPREDUCE Hive 计算结果重用 连接体 数据管理 MapReduce Hive calculation results reuse Join-Object data management
  • 相关文献

参考文献17

  • 1郑柯.500TB-Facebook每天收集的数据量[DB/OL].[2014 -03-01]. http://www, infoq, com/cn/news/2012/OS/FB- collect 500TB-everyday).
  • 2Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107-113.
  • 3覃雄派,王会举,杜小勇,王珊.大数据分析——RDBMS与MapReduce的竞争与共生[J].软件学报,2012,23(1):32-45. 被引量:386
  • 4Thusoo A, Sarma J S, Jain N, et al. Hive: A warehousing solution over a Map-Reduce framework[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1626-1629.
  • 5Olston C, Reed B, Srivastava U, et al, Pig latin: A not-so- foreign language for data processing [C] //Proc of the 2008 ACM SIGMOD Int Conf on Management of Data. New York. ACM, 2008: 1099-1110.
  • 6Apache Software Foundation. HDFS architecture guide [EB/OL].[ 2014-03-20]. http://hadoop, apache, org/docs/ rl. 2.1/hdfs_design. html.
  • 7Halevy A Y. Answering queries using views: A survey [J]. The VLDB Journal, 2001, 10(4): 270-294.
  • 8Elghandour I, Aboulnaga A. Restore: Reusing results of MapReduce jobs [J]. Proceedings of the VLDB EndowmeW., 2012, 5(6): 586-597.
  • 9Transaction Processing Performance Council. TPC Benchmark H: Standard specification revision 2. 17. 0 [EB/ OL]. [2014-04-15]. http://www, tpc. org/tpch/spec/tpch2. 17.0. pdf.
  • 10O'Neil P, O'Neil E J, Chen X. The star schema benchmark (SSB) [EB/OL]. [2014-03-14]. http://labs, inovia, fr/code/ pgbench/t runk/StarSchemaB, pdf.

二级参考文献86

  • 1Zhou MQ, Zhang R, Zeng DD, Qian WN, Zhou AY. Join optimization in the MapReduce environment for column-wise data store. In: Fang YF, Huang ZX, eds. Proc. of the SKG. Ningbo: IEEE Computer Society, 2010.97-104. [doi: 10.1109/SKG.2010.18].
  • 2Afrati FN, Ullman JD. Optimizing joins in a Map-Reduce environment. In: Manolescu I, Spaecapietra S, Teubner J, Kitsuregawa M, Leger A, Naumann F, Ailamaki A, Ozcan F, eds. Proc. of the EDBT. Lausanne: ACM Press, 2010. 99-110. [doi: 10.1145/ 1739041.1739056].
  • 3Sandholm T, Lai K. MapReduce optimization using regulated dynamic prioritization. In: Douceur JR, Greenberg AG, Bonald T, Nieh J, eds. Proc. of the SIGMETRICS. Seattle: ACM Press, 2009. 299-310. [doi: 10.1145/1555349.1555384].
  • 4Hoefler T, Lumsdaine A, Dongarra J. Towards; efficient MapReduce using MPI. In: Oster P, ed. Proc. of the EuroPVM/MPI. Berlin: Springer-Verlag, 2009. 240-249. [doi: 10.100'7/978-3-642-03770-2_30].
  • 5Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 2010, 3(1-2):494-505.
  • 6Kambatla K, Rapolu N, Jagannathan S, Grama A. Asynchronous algorithms in MapReduce. In: Moreira JE, Matsuoka S, Pakin S, Cortes T, eds. Proc. of the CLUSTER. Crete: IEEE Press, 2010. 245-254. [doi: 10.1109/CLUSTER.2010.30].
  • 7Polo J, Carrera D, Becerra Y, Torres J, Ayguad6 E, Steinder M, Whalley I. Performance-Driven task co-scheduling for MapReduce environments. In: Tonouchi T, Kim MS, eds. Proc. of the 1EEE Network Operations and Management Symp. (NOMS). Osaka: IEEE Press, 2010. 373-380. [doi: 10.1109/NOMS.2010.5488494].
  • 8Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I. Improving MapReduce performance in heterogeneous environments. In: Draves R, van Renesse R, eds. Proc. of the ODSI. Berkeley: USENIX Association, 2008.29-42.
  • 9Xie J, Yin S, Ruan XJ, Ding ZY, Tian Y, Majors J, Manzanares A, Qin X. Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: Taufer M, Rfinger G, Du ZH, eds. Proc. of the Workshop on Heterogeneity in Computing (IPDPS 2010). Atlanta: IEEE Press, 2010. 1-9. [doi: 10.1109/IPDPSW.2010.5470880].
  • 10Polo J, Carrera D, Becerra Y, Beltran V, Torres J, Ayguad6 E. Performance management of accelerated MapReduce workloads in heterogeneous clusters. In: Qin F, Barolli L, Cho SY, eds. Proc. of the ICPP. San Diego: IEEE Press, 2010. 653-662. [doi: 10.1109/ ICPP.2010.73].

共引文献411

同被引文献23

引证文献4

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部