摘要
As data volume grows, many enterprises are considering using MapReduce for its simplicity. However, how to evaluate the performance improvement before deployment is still an issue. Current researches of MapReduce performance are mainly based on monitoring and simulation, and lack mathematical models. In this paper, we present a simple but powerful performance model for the prediction of the execution time of a MapReduce program with limited resources. We study each component of MapReduce framework, and analyze the relation between the overall performance and the number of mappers and reducers based on our model. Two typical MapReduce programs are evaluated in a small cluster with 13 nodes. Experimental results show that the mathematical performance model can estimate the execution time of MapReduce programs reliably. According to our model, number of mappers and reducers can be tuned to form a better execution pipeline and lead to better performance. The model also points out potential bottlenecks of the framework and future improvement.
As data volume grows, many enterprises are considering using MapReduce for its simplicity. However, how to evaluate the performance improvement before deployment is still an issue. Current researches of MapReduce performance are mainly based on monitoring and simulation, and lack mathematical models. In this paper, we present a simple but powerful performance model for the prediction of the execution time of a MapReduce program with limited resources. We study each component of MapReduce framework, and analyze the relation between the overall performance and the number of mappers and reducers based on our model. Two typical MapReduce programs are evaluated in a small cluster with 13 nodes. Experimental results show that the mathematical performance model can estimate the execution time of MapReduce programs reliably. According to our model, number of mappers and reducers can be tuned to form a better execution pipeline and lead to better performance. The model also points out potential bottlenecks of the framework and future improvement.
基金
supported by CHB Project "Unstructured Data Management System" under Grant No.2010ZX01042-002-003