MapReduce计算模型下基于虚拟分区的数据倾斜处理方法被引量：5

Handling Data Skew in MapReduce Programming Model Based on Virtual Partitioning Method

下载PDF

导出

摘要针对MapReduce计算模型Hash分区策略易引发Reduce阶段输入数据倾斜问题,提出基于Hash虚拟平衡重分区的数据倾斜处理算法HVBR-SH(Hash Virtual Balance Repartitioning based Skew Handling).HVBR-SH在Map阶段采用虚拟分区,使得<Key,Value>键值对分散存储,为后续重分区提供更优分区组合;在Reduce阶段,HVBR-SH利用连续虚拟分区平衡重组的方法将收集到的虚拟分区重新划分成与Reduce任务数相同分区,并确保重分区后最大分区的数据量最小,加快整个Reduce阶段的执行速度.对比实验结果表明,HVBR-SH算法能有效平衡各个Reduce任务的输入规模并控制运行时间,有效改善了Reduce输入倾斜问题,提高了M apReduce任务的执行效率. Aiming at solving the data skew problem in MapReduce computing model, a data processing algorithm, named HVBR-SH （ Hash Virtual Balance Repartitioning based Skew Handling ）, is presented in this paper. In the Map phase, virtual partitioning method is applied, so the 〈 Key, Value 〉 pairs can be discretely stored, providing more combination types for the subsequent repartitioning process. In the Reduce phase, applying balance repartitioning method for continuous virtual partitions, the collected virtual partitions from the map phase are repartitioned into new partitions the same number as Reduce tasks, which ensures the number of the biggest partitions is minimum in all partitions. Therefore,the running time of the whole Reduce phase will be improved. Experimental results show that HVBR-SH can effectively balance the input data size of various Reduce tasks and control the running time. As a result, it can handle the data skew in MapReduce and improve the efficiency of running MapReduce job.

作者高宇飞曹仰杰陶永才石磊

机构地区郑州大学信息工程学院郑州大学软件学院

出处《小型微型计算机系统》 CSCD 北大核心 2015年第8期1706-1710,共5页 Journal of Chinese Computer Systems

基金国家自然科学基金项目(U1304603)资助河南省教育厅科学技术研究重点项目(13A520651)资助郑州市重大科技专项项目(131PZDZX050)资助

关键词 MAPREDUCE 数据倾斜虚拟分区 MapReduce data skew virtual partitioning

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献10

1覃雄派,王会举,杜小勇,王珊.大数据分析——RDBMS与MapReduce的竞争与共生[J].软件学报,2012,23(1):32-45. 被引量：386
2Gufler B, Augsten N, Reiser A, et al. Handing data skew in mapRe- duce[ C]. Proceedings of the I st International Conference on Cloud Computing and Services Science ,2011,146:574-583.
3Kwon Y C,Ren K,Balazinska M,et al. Managing skew in hadoop [ J]. IEEE Data Eng,Bull,2013,36( 1 ) :24-33.
4Ibrahim S,Jin H,Lu L,et al. Handling partitioning skew in MapRe- duce using LEEN [ J ]. Peer-to-Peer Networking and Applications, 2013,6(4) :409-424.
5Xu Y,Zou P, Qu W,et al. Sampling-based partitioning in MapRe- duce for skewed data [ C ]. ChinaGrid Annual Conference ( China- Grid) ,2012 Seventh, IEEE ,2012 : 1-8.
6Yang H, Dasdan A, Hsiao R L, et al. Map-reduce-merge : simplified relational data processing on large clusters [ C ]. Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, ACM,2007 : 1029 - 1040.
7Abouzeid A, Bajda-Pawlikowski K, Abadi D, et al. HadoopDB : an architectural hybrid of MapReduce and DBMS technologies for ana- lytical workloads [ J ]. Proceedings of the VLDB Endowment,2009, 2( I ) :922-933.
8Cbopra S, Rao M R. The partition problem [ J ]. Mathematical Pro- gramming, 1993,59( 1-3 ) :87-115.
9Hadoop [ EB/OL ]. http ://lucene. apache, org/hadoop ,2012.
10Ekanayake J, Pallickara S, Fox G. Mapreduce for data intensive sci- entific analyses [ C ]. IEEE Fourth International Conference on. IEEE ,2008:277-284.

二级参考文献82

1Zhou MQ, Zhang R, Zeng DD, Qian WN, Zhou AY. Join optimization in the MapReduce environment for column-wise data store. In: Fang YF, Huang ZX, eds. Proc. of the SKG. Ningbo: IEEE Computer Society, 2010.97-104. [doi: 10.1109/SKG.2010.18].
2Afrati FN, Ullman JD. Optimizing joins in a Map-Reduce environment. In: Manolescu I, Spaecapietra S, Teubner J, Kitsuregawa M, Leger A, Naumann F, Ailamaki A, Ozcan F, eds. Proc. of the EDBT. Lausanne: ACM Press, 2010. 99-110. [doi: 10.1145/ 1739041.1739056].
3Sandholm T, Lai K. MapReduce optimization using regulated dynamic prioritization. In: Douceur JR, Greenberg AG, Bonald T, Nieh J, eds. Proc. of the SIGMETRICS. Seattle: ACM Press, 2009. 299-310. [doi: 10.1145/1555349.1555384].
4Hoefler T, Lumsdaine A, Dongarra J. Towards; efficient MapReduce using MPI. In: Oster P, ed. Proc. of the EuroPVM/MPI. Berlin: Springer-Verlag, 2009. 240-249. [doi: 10.100'7/978-3-642-03770-2_30].
5Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 2010, 3(1-2):494-505.
6Kambatla K, Rapolu N, Jagannathan S, Grama A. Asynchronous algorithms in MapReduce. In: Moreira JE, Matsuoka S, Pakin S, Cortes T, eds. Proc. of the CLUSTER. Crete: IEEE Press, 2010. 245-254. [doi: 10.1109/CLUSTER.2010.30].
7Polo J, Carrera D, Becerra Y, Torres J, Ayguad6 E, Steinder M, Whalley I. Performance-Driven task co-scheduling for MapReduce environments. In: Tonouchi T, Kim MS, eds. Proc. of the 1EEE Network Operations and Management Symp. (NOMS). Osaka: IEEE Press, 2010. 373-380. [doi: 10.1109/NOMS.2010.5488494].
8Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I. Improving MapReduce performance in heterogeneous environments. In: Draves R, van Renesse R, eds. Proc. of the ODSI. Berkeley: USENIX Association, 2008.29-42.
9Xie J, Yin S, Ruan XJ, Ding ZY, Tian Y, Majors J, Manzanares A, Qin X. Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: Taufer M, Rfinger G, Du ZH, eds. Proc. of the Workshop on Heterogeneity in Computing (IPDPS 2010). Atlanta: IEEE Press, 2010. 1-9. [doi: 10.1109/IPDPSW.2010.5470880].
10Polo J, Carrera D, Becerra Y, Beltran V, Torres J, Ayguad6 E. Performance management of accelerated MapReduce workloads in heterogeneous clusters. In: Qin F, Barolli L, Cho SY, eds. Proc. of the ICPP. San Diego: IEEE Press, 2010. 653-662. [doi: 10.1109/ ICPP.2010.73].

共引文献385

1郑智泉,杨楠.智能革命下数据驱动的智慧图书馆建设分析[J].智能计算机与应用,2020(8):183-185.
2谢月锋,董现垒,陈卉,王燕,刘志成.利用网络痕迹信息即时预测儿童腹泻流行趋势[J].医学信息（医学与计算机应用）,2016,29(29):1-4.
3董新华,李瑞轩,周湾湾,王聪,薛正元,廖东杰.Hadoop系统性能优化与功能增强综述[J].计算机研究与发展,2013,50(S2):1-15. 被引量：70
4邓波,张玉超,金松昌,林旺群.基于MapReduce并行架构的大数据社会网络社团挖掘方法[J].计算机研究与发展,2013,50(S2):187-195. 被引量：10
5马宾.一种改进的并行K_近邻网络舆情分类算法研究[J].微电子学与计算机,2015,32(6):62-66. 被引量：1
6樊伟红,李晨晖,张兴旺,秦晓珠,郭自宽.图书馆需要怎样的“大数据”[J].图书馆杂志,2012,31(11):63-68. 被引量：238
7于薇.“大数据”背景下的信息处理技术分析与研究[J].数字图书馆论坛,2012(11):6-11. 被引量：3
8向剑平,乔少杰,胡剑.WMB*:一种提高大数据上软件执行效率改进算法[J].内江师范学院学报,2012,27(12):24-28. 被引量：4
9徐翔,邹复民,廖律超,朱铨.基于GemFire的海量数据计算性能实验分析[J].计算机应用,2013,33(1):226-229. 被引量：5
10黄晓斌,钟辉新.大数据时代企业竞争情报研究的创新与发展[J].图书与情报,2012(6):9-14. 被引量：120

同被引文献21

1韩蕾,孙徐湛,吴志川,陈立军.MapReduce上基于抽样的数据划分最优化研究[J].计算机研究与发展,2013,50(S2):77-84. 被引量：13
2王珊,王会举,覃雄派,周烜.架构大数据:挑战、现状与展望[J].计算机学报,2011,34(10):1741-1752. 被引量：616
3覃雄派,王会举,杜小勇,王珊.大数据分析——RDBMS与MapReduce的竞争与共生[J].软件学报,2012,23(1):32-45. 被引量：386
4赵彦荣,王伟平,孟丹,张书彬,李均.基于Hadoop的高效连接查询处理算法CHMJ[J].软件学报,2012,23(8):2032-2041. 被引量：36
5傅杰,都志辉.一种周期性MapReduce作业的负载均衡策略[J].计算机科学,2013,40(3):38-40. 被引量：15
6金健,陈群,赵保学.数据倾斜情况下基于MapReduce模型的连接算法研究[J].计算机与现代化,2013(5):22-27. 被引量：1
7程学旗,靳小龙,王元卓,郭嘉丰,张铁赢,李国杰.大数据系统和分析技术综述[J].软件学报,2014,25(9):1889-1908. 被引量：741
8万聪,王翠荣,王聪,贾朔.MapReduce模型中reduce阶段负载均衡分区算法研究[J].小型微型计算机系统,2015,36(2):240-243. 被引量：10
9李航晨,秦小麟,沈尧.数据本地性感知的MapReduce负载均衡策略[J].计算机科学,2015,42(10):50-56. 被引量：4
10王卓,陈群,李战怀,潘巍,尤立.基于增量式分区策略的MapReduce数据均衡方法[J].计算机学报,2016,39(1):19-35. 被引量：24

引证文献5

1褚龙现.基于MapReduce的等值连接中数据倾斜问题研究[J].电脑知识与技术,2016,12(11Z):226-228.
2张元鸣,蒋建波,陆佳炜,徐俊,肖刚.面向MapReduce的迭代式数据均衡分区策略[J].计算机学报,2019,42(8):1873-1885. 被引量：13
3张占峰,王文礼,耿珊珊,贾芝婷.Spark数据倾斜问题研究[J].河北省科学院学报,2020,37(1):1-7. 被引量：3
4张强,张学文.利用布隆滤波二次拆分的数据倾斜处理算法[J].计算机工程与设计,2021,42(2):475-481.
5杨迪,赵家伟,王鹏,赵建平.面向负载均衡的动态均衡分区策略[J].计算机应用与软件,2024,41(8):46-52.

二级引证文献16

1张占峰,王文礼,耿珊珊,贾芝婷.Spark数据倾斜问题研究[J].河北省科学院学报,2020,37(1):1-7. 被引量：3
2黄伟建,贾孟玉,黄亮.并行随机抽样贪心算法分区的MapReduce负载均衡研究[J].现代电子技术,2020,43(16):170-173. 被引量：3
3高雯雯.数据驱动科技情报智慧服务方案研究[J].情报科学,2020,38(8):134-140. 被引量：8
4张国华,叶苗,陆霞,吉晓香,梁德鸿.基于线程与分布式排序对比实验的设计与研究[J].实验技术与管理,2020,37(8):186-188. 被引量：1
5钟章生,陈世炉,陈志龙.利用并行惯性权重OOL-FA的大数据分类[J].计算机工程与设计,2020,41(10):2818-2824. 被引量：1
6龚健虎,张跃进.深度AWB结合改进DIT的高效大数据分类[J].计算机工程与设计,2021,42(2):468-474. 被引量：4
7张国华,叶苗,王自然,周婷婷.大数据Hadoop框架核心技术对比与实现[J].实验室研究与探索,2021,40(2):145-148. 被引量：9
8杨彦彬,干祯辉.Spark环境下SQL优化的方法[J].数字通信世界,2021(7):13-14. 被引量：2
9黄学雨,向驰,陶涛.基于MapReduce和改进密度峰值的划分聚类算法[J].计算机应用研究,2021,38(10):2988-2993. 被引量：7
10段瑞永.基于DCMM的集团级全域数据管理与共享平台研究与应用[J].电力大数据,2021,24(8):68-75. 被引量：4

1黄家明.浪潮服务器平台创新技术(一) 服务器虚拟化技术[J].科技浪潮,2006,0(1):17-18.
2竹鸣.虚拟你的世界[J].电子制作．电脑维护与应用,2004(11):4-8.
3梁晨光.虚拟分区管理硬盘更方便[J].玩电脑,2004(9):9-11.
4网络m生.虚拟分区也精彩[J].软件指南,2007(1):22-22.
5李东亮.在硬盘上虚拟分区[J].办公自动化,2005(1):56-56.
6李松林.“李鬼”传说——虚拟电脑设备实战[J].大众软件,2004(8):33-36.
7梦与梦寻.小优迷踪记找不到USB磁盘的解决方案[J].电脑爱好者,2008,0(2):34-35.
8帷幄.光驱“减负”请用Extra Drive Creator[J].电脑知识与技术（过刊）,2005(8):14-15.
9冰河洗剑.最强悍的虚拟影子系统 Returnil[J].电脑迷,2007,0(12):70-71. 被引量：1
10小虫.一句话技巧[J].大众软件,2004(24):93-93.

小型微型计算机系统

2015年第8期

浏览历史

内容加载中请稍等...

MapReduce计算模型下基于虚拟分区的数据倾斜处理方法被引量：5

参考文献10

二级参考文献82

共引文献385

同被引文献21

引证文献5

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

MapReduce计算模型下基于虚拟分区的数据倾斜处理方法 被引量：5

参考文献10

二级参考文献82

共引文献385

同被引文献21

引证文献5

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

MapReduce计算模型下基于虚拟分区的数据倾斜处理方法被引量：5