Hadoop环境下基于数据本地化的Reduce任务调度策略被引量：1

Scheduling Strategy of Reduce Task Based on Data Localization in Hadoop

下载PDF

导出

摘要在MapReduce模型任务处理过程中,当Reduce任务开始执行,远程拉取Map阶段的输出数据时,会消耗大量的网络带宽,甚至会出现网络瓶颈问题。本文提出基于数据本地化和负载均衡的任务分配策略。该策略中用户首先设置采样数据量M,在Map阶段对前M个数据块进行采样;其次根据采样结果,同时考虑数据本地化因素,将Reduce任务进行分配;然后基于负载均衡将Reduce任务进行再分配,通过任务分配,系统生成一个任务分配表;最后启动Reduce任务,系统开始数据拉取,未被采样的数据根据任务分配表进行任务分配。通过大量实验验证,基于数据本地化和负载均衡的任务分配策略,既能减少Shuffle阶段数据的传输量,又能降低网络带宽的消耗,同时可以避免出现某些节点空闲而其它节点任务量大甚至处理不了的情况,从而提高了集群处理数据的整体能力。 In the MapReduce task processing, when Reduce task is executed, and the data need to be pulled in the Map stage, it will cost a large amount of network bandwidth, network bottlenecks will occur even. Therefore, we propose a task allocation strategy based on localization and load balance. First of all, the user sets the sampling variable M. The Map function is executed in Map stages, and we select the first M data blocks for sampling. Next, the system assigns tasks by considering the data localiza- tion and the sample results. Once again, the system assigns tasks by considering the load balance. The system will generate a task allocation table after the task allocation based on the data localization and the load balance. Finally, the system executes the Reduce task and begins to pull data. Subsequent tasks are assigned based on the task allocation table. Through experimental veri- fication, assigning task based on the data localization and the load balance can not only reduce the transmission of data and the network bandwidth consumption in Shuffle stage, but also it can avoid the situation that there are many tasks on some nodes and there are no tasks on other nodes. So the strategy can improve the overall capacity of the data processing.

作者王浩

机构地区重庆医科大学附属第二医院信息中心

出处《计算机与现代化》 2016年第1期114-120,共7页 Computer and Modernization

基金重庆市科技计划项目(cstc2013jcsf10034)

关键词采样 MAPREDUCE 本地化任务分配负载均衡 sampling MapReduce localization task allocation load balance

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献16

1Dean J, Ghematat S. MapReduce: Simplified data processing on large clusters[C]// Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation. 2004:10.
2Thusoo A, Sarma J S, Jain N, et al. Hive: A warehousing solution over a map-reduce framework[J]. Proceedings of the VLDB Endowment, 2009,2(2):1626-1629.
3Lin Yuting, Agrawal D, Chen Chun, et al. Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework[C]// Proceedings of the 2011 ACM Conference on Management of Data. 2011:961-972.
4The Apache Software Foundation. What Is Apache Hadoop? [DB/OL]. http://hadoop.apache.org/, 2015-09-30.
5Ghemawat S, Gobioff H, Leung Shun-Tak. The Google file system[C]// Proceedings of the 19th ACM Symposium on Operating Systems Principles. 2003:29-43.
6Kaldewey T, Shekita E J, Tata S. Clydesdale: Structured data processing on MapReduce[C]// Proceedings of the 15th International Conference on Extending Database Technology. 2015:15-25.
7TPC. TPC-H[DB/OL]. http://www.tpc.org/tpch/, 2015-9-30.
8Chang Fay, Dean J, Ghemawat S, et al. Bigtable: A distributed storage system for structured data[J]. ACM Transactions on Computer Systems, 2008,26(2):205-218.
9Chen Shih-ying, Chen Po-chun. An efficient join query processing based on MJR framework, software engineering[C]// Proceedings of the 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. 2012:698-703.
10Mackey G, Sehrish S, Bent J, et al. Introducing Map-Reduce to high end computing[C]// Proceedings of Petascale Data Storage Workshop, 2008. 2008:1-6.

同被引文献10

1李倩,施霞萍.基于Hadoop MapReduce图像处理的数据类型设计[J].软件导刊,2012,11(4):182-183. 被引量：8
2金永,魏博,王召巴,朱林泉,杨继亮.基于双CCFL的玻璃缺陷检测技术研究[J].中北大学学报（自然科学版）,2013,34(1):66-69. 被引量：1
3余发山,田西方,韩超超,娄慧明.玻璃生产缺陷在线检测技术研究[J].河南理工大学学报（自然科学版）,2013,32(4):476-480. 被引量：7
4石兵华,金永,王召巴,陈友兴,陈玉.基于数字光栅投影的浮法玻璃缺陷检测方法研究[J].光电子．激光,2014,25(3):521-525. 被引量：11
5田进华,张韧志.基于MapReduce数字图像处理研究[J].电子设计工程,2014,22(15):93-95. 被引量：7
6熊倩,张,郭明,徐婕.MapReduce Shuffle性能改进[J].计算机应用,2017,37(A01):58-62. 被引量：5
7张帅,贾如春.基于Hadoop的大数据信息安全监控云平台设计与研究[J].计算机测量与控制,2017,25(9):72-74. 被引量：6
8Jianjiang Li,Jie Wang,Bin Lyu,Jie Wu,Xiaolei Yang.An Improved Algorithm for Optimizing MapReduce Based on Locality and Overlapping[J].Tsinghua Science and Technology,2018,23(6):744-753. 被引量：5
9郝娟,吕晓琪,温秀梅,谷宇,黄显武.Hadoop平台下基于内容的医学图像检索[J].现代电子技术,2017,40(4):115-119. 被引量：3
10刘军,李威,吴梦婷,陈起凤.Hadoop平台下新型图像并行处理模型设计[J].计算机工程与应用,2019,55(6):186-190. 被引量：3

引证文献1

1孟陆,金永.基于分布式的玻璃缺陷检测技术研究及性能优化[J].计算机测量与控制,2019,27(12):47-51. 被引量：2

二级引证文献2

1张旭中,翟道远,陈俊.基于深度强化学习的木材缺陷图像识别及分割模型研究[J].电子测量技术,2020,43(17):80-86. 被引量：11
2查云威,陈志豪,李伟朝.基于改进Faster R⁃cnn的手机屏幕缺陷检测方法研究[J].计算机应用文摘,2022,38(22):78-80. 被引量：1

1刘丽卓.基于Hadoop数据本地化性能研究[J].福建电脑,2015,31(11):106-107.
2高飞,李遥.高校数据中心构建之探索[J].中国高新技术企业,2008(9):112-112.
3郑燕,王杨.网格技术概述[J].科学咨询,2011(12):93-94.
4刘美娟.企业数据存储和备份系统的架构探析[J].电子技术与软件工程,2013(19):234-234.
5白欣,左继章,向建军.实时集群中一种基于任务分配表的动态负载平衡算法[J].计算机工程与应用,2003,39(1):39-41. 被引量：15
6梁军,李威,肖琳,徐歆恺.NVIDIA Tegra K1异构计算平台访存优化研究[J].计算机工程,2016,42(12):44-49. 被引量：3
7司雅楠,阮宁.虚拟化Hadoop系统的数据资源调控与管理体系[J].新乡学院学报,2016,33(3):29-32. 被引量：1
8丁明.计算思维在计算机基础教学中的应用[J].电子制作,2014,22(6X):132-132. 被引量：1
9王来,翟健宏.基于HDFS的分布式存储策略分析[J].智能计算机与应用,2016,6(1):5-8. 被引量：8
10声音[J].锻造与冲压,2012(3):14-14.

计算机与现代化

2016年第1期

浏览历史

内容加载中请稍等...

Hadoop环境下基于数据本地化的Reduce任务调度策略被引量：1

参考文献16

同被引文献10

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

Hadoop环境下基于数据本地化的Reduce任务调度策略 被引量：1

参考文献16

同被引文献10

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

Hadoop环境下基于数据本地化的Reduce任务调度策略被引量：1