期刊文献+

MapReduce上基于抽样的数据划分最优化研究 被引量:13

Optimization Study on Sample Based Partition on MapReduce
下载PDF
导出
摘要 MapReduce是一个目前应用广泛的并行计算框架,如何解决Reduce节点的负载平衡问题是MapReduce程序执行效率的一个重要研究方向.基于抽样的划分是一种比较有效的数据划分方法,为了使得抽样方法发挥最大程度的效益,研究了抽样效果与其重要影响因素之间的定量关系,并给出了相关理论及其证明推导,同时通过实验进一步验证了理论的正确性.基于研究的结果,可以在给定MapReduce环境中,通过分析数据特征,找到最优抽样样本规模,从而通过尽可能小的抽样代价来得到满足要求的数据划分.通过将研究成果应用在改进的Terasort算法上,以实例验证了其在MapRedece平台上的实际意义. MapReduce is a widely used parallel computing framework.Its tend to be an important research aspect that how balance load of each reduce.Sample based partitioning is an efficient data partition method.In order to make the best of sampling method,in this paper,we study the quantitative relationship between sampling performance and its important factors.We come up with related theorems and give their demonstrations.Then we prove our theorems through experiments. Our theorems provide an efficient way to find the optimal sampling scale of MapReduce programs. The optimal sample size will give even division and minimized cost of pre-processing.We also show our experiment result on improved Terasort algorithm here in order to show significant of our theorems in practice.
出处 《计算机研究与发展》 EI CSCD 北大核心 2013年第S2期77-84,共8页 Journal of Computer Research and Development
关键词 抽样 MAPREDUCE框架 数据倾斜 负载平衡 数据集划分 sampling MapReduce data skew load balance dataset division
  • 相关文献

参考文献1

  • 1Shadi Ibrahim,Hai Jin,Lu Lu,Bingsheng He,Gabriel Antoniu,Song Wu.Handling partitioning skew in MapReduce using LEEN[J].Peer-to-Peer Networking and Applications.2013(4)

同被引文献70

引证文献13

二级引证文献85

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部