摘要
基于MapReduce的程序被越来越多地应用于大型数据分析的应用中.Apache Hadoop是最常用的开源MapReduce模型之一.程序运行时间的缩短对于MapReduce程序以及所有数据处理应用而言至关重要,而能够准确估算MapReduce程序的执行时间是优化程序的重要环节.本文定义了一个在Hadoop2.x版本中能够准确估算MapReduce作业负载执行时间的性能模型.该模型包括一个优先级树模型与一个排队网络模型,分别用于展示一个MapReduce作业中不同任务之间的依赖关系及MapReduce作业内的同步约束.最后,实验证明了该模型的可用性.
MapReduce-based systems are increasingly being used for large-scale data analysis applications.Apache Hadoop is one of the most common open-source implementations of such paradigm.Minimizing the execution time is vital for MapReduce as well as for all data-processing applications,and the accurate estimation of execution time is essential for optimization.In this study,the author created a MapReduce performance model for Hadoop2.x that can precisely estimate the execution time of workload in MapReduce.This model combines a precedence tree model that can capture dependencies between different tasks in one MapReduce job,and a queueing network model that can capture the intra-job synchronization constraints.Such an analytical performance model is a particularly attractive tool as it might provide reasonably accurate job response time at significantly lower cost than the simulation experiment of real dataanalysis systems.Furthermore,a clear understanding of systematic job response time under different circumstances is key to making decisions in MapReduce workload management and resource capacity planning.
作者
吴岳
WU Yue(Forest Industry Planning and Design Institute,National Forestry and Glassland Administration,Beijing 100010,China)
出处
《计算机系统应用》
2021年第2期219-225,共7页
Computer Systems & Applications