期刊文献+

Hadoop MapReduce短作业执行性能优化 被引量:28

Performance Optimization for Short Job Execution in Hadoop MapReduce
下载PDF
导出
摘要 Hadoop MapReduce并行计算框架被广泛应用于大规模数据并行处理.近年来,由于其能较好地处理大规模数据,Hadoop MapReduce也被越来越多地使用在查询应用中.为了能够处理大规模数据集,Hadoop的基本设计更多地强调了数据的高吞吐率.然而在处理对短作业响应性能有较高要求的查询应用时,Hadoop MapReduce并行计算框架存在明显不足.为了提升Hadoop对于短作业的执行效率,对原有的Hadoop MapReduce作出以下3点优化:1)通过优化原有的setup和cleanup任务的执行方式,成功地缩短了作业初始化环境准备和作业结束环境清理的时间;2)将首次任务分配从"拉"模式转变为"推"模式;3)将作业执行过程中JobTracker和TaskTrackers之间的控制消息通信从现有的周期性心跳机制中分离出来,采用即时传递机制.最后,采用一种典型的基于MapReduce并行化的查询应用BLAST,对优化工作进行了评估.各种不同类型BLAST作业的测试实验表明,与现有的标准Hadoop相比,优化后的Hadoop平均执行性能提升约23%. Hadoop MapReduce is a widely used parallel computing framework for solving dataintensive problems.Now days,for its good capability for processing large scale data,Hadoop MapReduce has also been adopted in many query applications.To be able to process large scale datasets,the fundamental design of the standard Hadoop places more emphasis on the highthroughput of data than on the job execution performance.This causes performance limitation when we use Hadoop MapReduce to execute short jobs.This paper proposes several optimization methods to improve the execution performance of MapReduce jobs,especially for short jobs.We make three major optimizations:1) reduce the time cost during the initialization and termination stages of a job by optimizing its setup and cleanup tasks; 2) change the assignment model of the first batch of tasks from the pull model to the push model; 3) replace the heartbeat-base communication mechanism with an instant message communication mechanism for event notifications between the JobTracker and TaskTrackers.We also adopt a typical MapReduce-based parallel query application,BLAST,to evaluate the effects of our optimizations.Experimental results show that the job execution performance of our improved version of Hadoop is about 23% faster on average than the standard Hadoop for different types of BLAST MapReduce jobs.
出处 《计算机研究与发展》 EI CSCD 北大核心 2014年第6期1270-1280,共11页 Journal of Computer Research and Development
基金 国家自然科学基金专项基金项目(61223003) 国家"八六三"高技术研究发展计划基金项目(2011AA01A202) 美国Intel Labs大学研究资助项目
关键词 MAPREDUCE 并行计算 短作业 性能优化 大数据处理 MapReduce parallel computing short job performance optimization big data processing
  • 相关文献

参考文献4

二级参考文献81

  • 1倪巍伟,陆介平,孙志挥.基于向量内积不等式的分布式k均值聚类算法[J].计算机研究与发展,2005,42(9):1493-1497. 被引量:15
  • 2宁焕生,张瑜,刘芳丽,刘文明,渠慎丰.中国物联网信息服务系统研究[J].电子学报,2006,34(B12):2514-2517. 被引量:151
  • 3Dean J,Ghemawat S.MapReduce:Simplified Data Processing on Large Cluster[C] //Proc.of OSDI'04.Boston,MA,USA:[s.n.] ,2004.
  • 4Hadoop Distributed Filesystem[EB/OL].(2008-12-13).http://hadoop.apache.org/hdfs/.
  • 5IBM Research.Cloud Analytics:Do We Really Need to Reinvent the Storage Stack?[Z].2009.
  • 6Apache Hadoop[EB/OL].(2009-09-12).http://hadoop.apache.org/.
  • 7Wikipedia. Cloud computing [EB/OL]. [ 2008-11 -16 ]. http ://en. wikipedia, org/wiki/Cloud computing.
  • 8Ghemawat S, Gobioff H, Leung S. The Google file system [C] //Proc of the 19th ACM Symp on Operating System Principles(SOSP). New York, ACM, 2003:29-43.
  • 9Dean J, Ghemawat S. MapReduee: Simplified data processing on large clusters [C] //Proc of the 6th USENIX Symp on Operating Systems Design and Implementation (OSDI). San Francisco: USENIX Association, 2004: 137- 150.
  • 10Chang F, Dean J, Ghemawat S. et al. Bigtable: A distributed storage system for structured data [C] //Proc of the 7th USENIX Syrup on Operating Systems Design and Implementation(OSDI). San Francisco: USENIX Association, 2006:205-218.

共引文献319

同被引文献226

  • 1董新华,李瑞轩,周湾湾,王聪,薛正元,廖东杰.Hadoop系统性能优化与功能增强综述[J].计算机研究与发展,2013,50(S2):1-15. 被引量:69
  • 2庄绪强.基于云计算技术的用户用电智能分析技术研究[J].自动化与仪器仪表,2016(2):187-189. 被引量:8
  • 3伍湘君,黄丽萍.超级计算机上矩阵乘的并行计算与实现[J].应用气象学报,2005,16(1):122-128. 被引量:6
  • 4李端,钱富才,李力,高建军.动态规划问题研究[J].系统工程理论与实践,2007,27(8):56-64. 被引量:30
  • 5方木云,刘辉.高级软件工程[M].北京:清华大学出版社,2011.
  • 6Dean J, Ghemawat S. MapReduce: Simplified data pro- eessing on large clusters [ C ]//Pro(.. of lhe OSI)I 2004. 2004. 137-150.
  • 7Yang HC, Dasdan A, Hsiao RL, Parker DS. Map-Reduce-Merge: Simplified relational data processing on large cluster [ C ] ///Proc. ofthe SIGMOD 2007. 2007, 1029-1040.
  • 8mmel R. Google's MapReduce programming model:Revisi- ted. Science Computer Program[ J]. 2008,70( 1 ) : 1-30.
  • 9金海,廖小飞,叶晨成.内存计算:大数据处理的机遇与挑战[J].中国计算机协会通信,2013,9(4):40-46.
  • 10Lee BC, Ipek E, Mutlu O, Burger D. Architecting phase change memory as a scalable DRAM alternative[ C]/jln- ternational Symposium on Computer Architecture, 2009.

引证文献28

二级引证文献82

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部