摘要
针对MapReduce计算模型Hash分区策略易引发Reduce阶段输入数据倾斜问题,提出基于Hash虚拟平衡重分区的数据倾斜处理算法HVBR-SH(Hash Virtual Balance Repartitioning based Skew Handling).HVBR-SH在Map阶段采用虚拟分区,使得<Key,Value>键值对分散存储,为后续重分区提供更优分区组合;在Reduce阶段,HVBR-SH利用连续虚拟分区平衡重组的方法将收集到的虚拟分区重新划分成与Reduce任务数相同分区,并确保重分区后最大分区的数据量最小,加快整个Reduce阶段的执行速度.对比实验结果表明,HVBR-SH算法能有效平衡各个Reduce任务的输入规模并控制运行时间,有效改善了Reduce输入倾斜问题,提高了M apReduce任务的执行效率.
Aiming at solving the data skew problem in MapReduce computing model, a data processing algorithm, named HVBR-SH ( Hash Virtual Balance Repartitioning based Skew Handling ), is presented in this paper. In the Map phase, virtual partitioning method is applied, so the 〈 Key, Value 〉 pairs can be discretely stored, providing more combination types for the subsequent repartitioning process. In the Reduce phase, applying balance repartitioning method for continuous virtual partitions, the collected virtual partitions from the map phase are repartitioned into new partitions the same number as Reduce tasks, which ensures the number of the biggest partitions is minimum in all partitions. Therefore,the running time of the whole Reduce phase will be improved. Experimental results show that HVBR-SH can effectively balance the input data size of various Reduce tasks and control the running time. As a result, it can handle the data skew in MapReduce and improve the efficiency of running MapReduce job.
出处
《小型微型计算机系统》
CSCD
北大核心
2015年第8期1706-1710,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(U1304603)资助
河南省教育厅科学技术研究重点项目(13A520651)资助
郑州市重大科技专项项目(131PZDZX050)资助