期刊文献+

基于ORC元数据的Hive Join查询Reducer负载均衡方法 被引量:3

ORC Metadata Based Reducer Load Balancing Method for Hive Join Queries
下载PDF
导出
摘要 负载不均衡问题位列影响大规模MapReduce集群性能因素的首位,而Hive join查询非常容易触发该问题。通用解决方案是基于中间键值对的key频率分布设计能够实现负载均衡的key划分算法。现有工作估算key频率分布时依赖于对map的输出进行监控采样,使得通信开销较大并显著延后了shuffle的启动。针对Hive join查询,提出了基于ORC元数据的key频率分布估计方法和相应的负载均衡key划分方法。该方法具有计算量小、通信开销小、不影响现有shuffle机制的优点。通过基准测试证明了该方法在key频率分布估算效率上的巨大提升及相应的key划分方法对Hive join查询性能的提升。 The load imbalance problem ranks first among the performance issues in large-scale MapReduce cluster,and it's very prone to be triggered by Hive join queries.An effective solution is to design reducer load balancing partitioning algorithms by consulting the key's frequency distribution histogram estimated from intermediate key-value pairs.The existing works of key histogram estimation rely on monitoring and sampling the output of map in a distributed way,which triggers huge network traffic load and notably delays the start of the shuffle.A novel key histogram estimation method based on ORC metadata and the corresponding load balancing partitioning strategy was proposed for Hive join queries.The proposals only need some light-weight computation before the start of the job,thus imposing no extra loads on network traffics and the shuffle.Benchmarking test proves the proposal's significant improvement on both the key histogram estimation and the reducer load balancing.
出处 《计算机科学》 CSCD 北大核心 2018年第3期158-164,共7页 Computer Science
基金 国家重点研发计划项目:科学大数据管理系统(2016YFB1000600) 协同精密定位技术(2016YFB0501900)资助
关键词 负载均衡 MAPREDUCE Hive JOIN REDUCER ORC Load balancing MapReduce Hive Join Reducer ORC
  • 相关文献

参考文献1

二级参考文献20

  • 1周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,50(S1):369-377. 被引量:11
  • 2Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Operating Systems Design : Implementation, 2004, 51(1) : 147-152.
  • 3Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system//Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). Nevada, USA, 2010:1-10.
  • 4Rasmussen A, Conley M, Kapoor R, et at. Themis: An I/O efficient MapReduce//Proceedings of the ACM Symposium on Cloud Computing (SOCC'12). San Jose, USA, 2012.
  • 5Ren K, Kwon Y, Balazinska M, Howe B. Hadoop's adolescence: A comparative workload analysis from three research clusters. Carnegie Mellon University (CMU), USA: Technical Report CMU-PDL-12-106, 2012.
  • 6Lin J, et al. The curse of Zipf and limits to parallelization: A look at the stragglers problem in MapReduee//Proceedings of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval. Boston, USA, 2009.
  • 7Gufler B, Augsten N, Reiser A, Kemper A. Handing data skew in MapReduce//Proeeedings of the 1st InternationalConference on Cloud Computing and Services Science. Noordwijkerhout, The Netherlands, 2011, 146:574 583.
  • 8Racha S C. Load Balancing Map-Reduce Communications for Efficient Executions of Applications in a Cloud [M]. S. disser tation]. Indian Institute of Science, Bangalore, India, 2012.
  • 9Kwon Y, et al. A study of skew in MapReduce applications. Open Cirrus Summit, Moscow, Russia, 2011.
  • 10Kwon Y, Balazinska M, Howe B, Rolia J. Skew-resistant parallel processing of feature-extracting scientific userdefined functions//Proceedings of the 1st ACM Symposium on Cloud Computing. Indianapolis, USA, 2010:75-86.

共引文献22

同被引文献23

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部