摘要
数据倾斜是严重影响MapReduce性能的因素之一。数据倾斜问题的现有解决方法需要用户对应用类型提供针对的分区函数,或是为MapReduce编写额外的采样过程,增加了用户的负担。为解决上述问题,提出了一种基于压力统计的负载均衡策略。该策略充分利用MapReduce中的混洗阶段,在reducer准备数据的同时进行统计,以获取全局数据分布。系统根据数据分布情况对负载较重节点进行调度,平衡整个集群负载,而无需用户提供额外的输入。此外,考虑到上层不同的应用类型,引入了压力反馈机制来进一步提高调度策略的性能。实验结果表明,提出的负载均衡调度策略的性能优于默认策略性能。
Data skew is one of the factors which seriously affects the performance of MapReduce.Existing solutions for the data skew problem increase the burden that the users need to provide the partition function for the specific application,or write additional sampling processes for the MapReduce.To solve this problem,we presented a load balancing strategy based on pressure statistics.To get the global data distribution,we computed the statistics while preparing data,which makes full use of the shuffle stage in MapReduce.To balance the entire cluster,the strategy schedules the heavy nodes according to the data distribution,without requiring the user to provide additional input.In addition,due to the complexity of the applications,we introduced the pressure feedback mechanism,and further improved the performance of the scheduling policy.The experimental results show that our strategy is far more efficient than the default strategy.
出处
《计算机科学》
CSCD
北大核心
2015年第4期141-146,共6页
Computer Science
基金
国家自然科学基金项目(61373015
61300052
41301407)
国家教育部高等学校博士学科点专项科研基金资助项目(20103218110017)
江苏高校优势学科建设工程项目(PAPD)
中央高校基本科研业务费专项项目(NP2013307)资助