摘要
MapReduce是一种处理大规模数据的并行计算模型,针对传统模型中reduce阶段各个结点负载不均衡的问题,提出一种reduce阶段负载均衡分区算法.算法将map阶段产生的中间数据划分为更多的分区,减少了每个分区的工作量,每次给reducetask分配一个分区,reducetask完成一个分区的工作之后会继续获得新的分区,直到所有的分区都被分配完毕,实现了动态调节reducetask的负载.还改进了MapReduce的通信协议来支持算法并且设计了新的容错机制.最后,通过重写Hadoop平台内核实现了算法并进行了实验分析,结果表明,该算法在不影响MapReduce模型的情况下显著的缩短了任务的处理时间.
This paper proposes a reduce stage load balanced partition algorithm to solve the problem of load imbalance of reduce phase of MapReduce framework. The algorithm divides the data generated by map phase into more partitions so as to reduce the workload of each partition. Each reducetask is assigned one partition, and it would be assigned a new one after finishing one partition until all par- tions have been assigned. This paper also improves the MapReduce communication protocols to support the algorithm and designed a new fault-tolerant mechanisms. Finally, we implement the algorithm by recompiling the core of hadoop, the experimental results indica- ted the validity of the proposed scheme.
出处
《小型微型计算机系统》
CSCD
北大核心
2015年第2期240-243,共4页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61300195)资助
中央高校基本科研业务费项目(N110323009)资助
辽宁省教育厅科学研究一般项目(L2013099)资助