摘要
MapReduce是一个目前应用广泛的并行计算框架,如何解决Reduce节点的负载平衡问题是MapReduce程序执行效率的一个重要研究方向.基于抽样的划分是一种比较有效的数据划分方法,为了使得抽样方法发挥最大程度的效益,研究了抽样效果与其重要影响因素之间的定量关系,并给出了相关理论及其证明推导,同时通过实验进一步验证了理论的正确性.基于研究的结果,可以在给定MapReduce环境中,通过分析数据特征,找到最优抽样样本规模,从而通过尽可能小的抽样代价来得到满足要求的数据划分.通过将研究成果应用在改进的Terasort算法上,以实例验证了其在MapRedece平台上的实际意义.
MapReduce is a widely used parallel computing framework.Its tend to be an important research aspect that how balance load of each reduce.Sample based partitioning is an efficient data partition method.In order to make the best of sampling method,in this paper,we study the quantitative relationship between sampling performance and its important factors.We come up with related theorems and give their demonstrations.Then we prove our theorems through experiments. Our theorems provide an efficient way to find the optimal sampling scale of MapReduce programs. The optimal sample size will give even division and minimized cost of pre-processing.We also show our experiment result on improved Terasort algorithm here in order to show significant of our theorems in practice.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2013年第S2期77-84,共8页
Journal of Computer Research and Development