MapReduce中连接负载均衡优化研究被引量：4

Optimizing load balancing of joins in MapReduce

下载PDF

导出

摘要数据分析和处理是大规模分布式数据处理应用中的重要任务。由于简单易用和具有灵活性,MapReduce编程模型逐渐成为大规模分布式数据处理系统(如Hadoop系统)的核心模型。由于所处理的数据可能不是均匀分布的,MapReduce编程模型在处理连接操作时,会出现数据倾斜问题。数据倾斜问题严重降低了MapReduce执行连接操作的效率。针对MapReduce中连接操作的数据倾斜问题,分析了造成MapReduce连接性能瓶颈的原因并建立负载均衡代价模型,提出了用范围分割方法控制连接过程中的数据倾斜问题实现负载均衡的策略。实验结果表明,所提方法明显提高了连接的效率。 Data analysis and processing is one of the most important tasks in large-scale distributed data processing applications. Due to its simplicity and scalability, MapReduce programming model has gradually become the crucial model for large-scale distributed data processing systems （eg. Hadoop）. Since the data may be uniformly distributed, data skew occurs when MapReduce programming model joins data,thus degrading the join performance severely. To solve data skew, its reason is analyzed, the load balancing cost model is established, and the rangepartitioner algorithm is proposed to control data skew so as to realize load balancing. Experimental results demonstrate that our method can obviously im- prove the efficiency of joins.

作者翟红敏刘国华赵威刘源源翟红坤

机构地区东华大学计算机科学与技术学院国网黑龙江省电力有限公司信息通信公司

出处《计算机工程与科学》 CSCD 北大核心 2014年第10期1860-1865,共6页 Computer Engineering & Science

基金国家自然科学基金资助项目(61070032)

关键词 MAPREDUCE 连接数据倾斜范围分割负载均衡 MapReduce join data skew rangepartitioner load balancing

分类号 TP391.9 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献15

1Dean J,Ghemawat S.MapReduce:Simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
2Blanas S,Patel J M,Ercegovac V,et al.A comparison of join algorithms for log processing in MapReduce[C]∥Proc of the 2010ACM SIGMOD International Conference on Management of Data,2010:975-986.
3Afrati F N,Ullman J D.Optimizing multiway joins in a MapReduce environment[J].IEEE Transactions on Knowledge and Data Engineering,2011,23(9):1282-1298.
4Gufler B,Augsten N,Reiser A,et al.Load balancing in MapReduce based on scalable cardinality estimates[C]∥Proc of the International Conference on Data Engineering,2012:522-533.
5Gufler B,Augsten N,Reiser A,et al.Handling data skew in MapReduce[C]∥Proc of the 1st International Conference on Cloud Computing and Services Science,2011:574-583.
6Yang H,Dasdan A,Hsiao R L,et al.Map-reduce-merge:Simplified relational data processing on large clusters[C]∥Proc of the 2007ACM SIGMOD International Conference on Management of Data,2007:1029-1040.
7Wang H,Qin X,Zhang Y,et al.LinearDB:A relational approach to make data warehouse scale like MapReduce[C]∥Proc of DASFAA’11,2011:306-320.
8Dittrich J,Quiané-Ruiz J A,Jindal A,et al.Hadoop++:Making ayellow elephant run like a cheetah(without it even noticing)[J].Proceedings of the VLDB Endowment,2010,3(1-2):515-529.
9Eltabakh M Y,Tian Y,zcan F,et al.CoHadoop:flexible data placement and its exploitation in Hadoop[J].Proceedings of the VLDB Endowment,2011,4(9):575-585.
10Okcan A,Riedewald M.Processing theta-joins using MapReduce[C]∥Proc of the 2011ACM SIGMOD International Conference on Management of Data,2011:949-960.

同被引文献20

1周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,50(S1):369-377. 被引量：11
2Blanas S,Patel J M,Ercegovac V,et al. A Comparison of Join Algorithms for Log Processing in MapReduce[C]. Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010.
3Chen Q,Yao J,Xiao Z. LIBRA: Lightweight Data Skew Mitigation in MapReduce[J]. IEEE Transactions on Parallel & Distributed Sys- tems, 2015.
4Atta F,Viglas S D, Niazi S. SAND Join--A Skew Handling Join Algorithm for Google's MapReduce Framework[C]. Multitopic Confer- ence (INMIC),2011 IEEE 14th International. IEEE,2011.
5李乔,郑啸.云计算研究现状综述[J].计算机科学,2011,38(4):32-37. 被引量：433
6彭辅权,金苍宏,吴明晖,应晶.MapReduce中shuffle优化与重构[J].中国科技论文,2012,7(4):241-245. 被引量：8
7李玉林,董晶.基于Hadoop的MapReduce模型的研究与改进[J].计算机工程与设计,2012,33(8):3110-3116. 被引量：36
8王彦明,奉国和,薛云.近年来Hadoop国外研究综述[J].计算机系统应用,2013,22(6):1-5. 被引量：22
9黄承真,王雷,刘小龙,况亚萍.Hadoop任务分配策略的改进[J].计算机应用,2013,33(8):2158-2162. 被引量：4
10顾荣,严金双,杨晓亮,袁春风,黄宜华.Hadoop MapReduce短作业执行性能优化[J].计算机研究与发展,2014,51(6):1270-1280. 被引量：28

引证文献4

1胡忠奎,屈波,黄斌,黎文阳.一种基于虚拟处理区间划分的负载均衡等值连接算法[J].现代计算机,2016,0(2):3-7.
2熊倩,张,郭明,徐婕.MapReduce Shuffle性能改进[J].计算机应用,2017,37(A01):58-62. 被引量：5
3郑钤,向军.一种基于负载代价的MapReduce等值连接优化算法[J].湖北民族学院学报（自然科学版）,2018,36(3):342-347.
4黄河清,林峰.Hadoop负载均衡的诊断与处理[J].福建电脑,2021,37(7):36-39.

二级引证文献5

1胡欢欢.新常态下仪器设备经济供给侧配送中心自动选择技术研究[J].自动化与仪器仪表,2018,0(12):62-64.
2冯兴杰,刘芳.基于Hadoop的ADS-B数据解析与存储方法[J].航天控制,2017,35(5):80-86.
3侯伟凡,樊玮,张宇翔.改进的Spark Shuffle内存分配算法[J].计算机应用,2017,37(12):3401-3405. 被引量：1
4孟陆,金永.基于分布式的玻璃缺陷检测技术研究及性能优化[J].计算机测量与控制,2019,27(12):47-51. 被引量：2
5汪世伟,陈新房,杨丽佳.MapReduce与Spark的Shuffle过程比较——以词频统计为例[J].电脑与电信,2023(12):29-32.

1杨颖,杨磊,乐嘉锦.基于范围分割数据的负载平衡算法的研究[J].计算机应用研究,2006,23(4):42-44.
2李晓理,张维存,王伟.基于有界扰动分区的多模型自适应控制[J].控制理论与应用,2006,23(2):283-286. 被引量：6
3胡忠奎,屈波,黄斌,黎文阳.一种基于虚拟处理区间划分的负载均衡等值连接算法[J].现代计算机,2016,0(2):3-7.
4邓建军,徐立鸿,吴启迪.模糊逻辑系统的区域分割学习方法[J].同济大学学报（自然科学版）,2002,30(1):92-96.
5蔡蓉,章国安,金丽.车载自组织网中基于车辆密度的可靠性路由协议[J].电信科学,2016,32(9):107-112. 被引量：1
6周波,谢雪冬,陈新河.投射光栅的三维形体重建[J].黑龙江科技学院学报,2006,16(5):313-316.
7陈晓兵,廖文和,孙全平.一种高速数控加工自适应进给速度生成算法[J].中国机械工程,2008,19(2):204-207. 被引量：2

计算机工程与科学

2014年第10期

浏览历史

内容加载中请稍等...

MapReduce中连接负载均衡优化研究被引量：4

参考文献15

同被引文献20

引证文献4

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

MapReduce中连接负载均衡优化研究 被引量：4

参考文献15

同被引文献20

引证文献4

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

MapReduce中连接负载均衡优化研究被引量：4