期刊文献+

MapReduce计算模型下基于虚拟分区的数据倾斜处理方法 被引量:5

Handling Data Skew in MapReduce Programming Model Based on Virtual Partitioning Method
下载PDF
导出
摘要 针对MapReduce计算模型Hash分区策略易引发Reduce阶段输入数据倾斜问题,提出基于Hash虚拟平衡重分区的数据倾斜处理算法HVBR-SH(Hash Virtual Balance Repartitioning based Skew Handling).HVBR-SH在Map阶段采用虚拟分区,使得<Key,Value>键值对分散存储,为后续重分区提供更优分区组合;在Reduce阶段,HVBR-SH利用连续虚拟分区平衡重组的方法将收集到的虚拟分区重新划分成与Reduce任务数相同分区,并确保重分区后最大分区的数据量最小,加快整个Reduce阶段的执行速度.对比实验结果表明,HVBR-SH算法能有效平衡各个Reduce任务的输入规模并控制运行时间,有效改善了Reduce输入倾斜问题,提高了M apReduce任务的执行效率. Aiming at solving the data skew problem in MapReduce computing model, a data processing algorithm, named HVBR-SH ( Hash Virtual Balance Repartitioning based Skew Handling ), is presented in this paper. In the Map phase, virtual partitioning method is applied, so the 〈 Key, Value 〉 pairs can be discretely stored, providing more combination types for the subsequent repartitioning process. In the Reduce phase, applying balance repartitioning method for continuous virtual partitions, the collected virtual partitions from the map phase are repartitioned into new partitions the same number as Reduce tasks, which ensures the number of the biggest partitions is minimum in all partitions. Therefore,the running time of the whole Reduce phase will be improved. Experimental results show that HVBR-SH can effectively balance the input data size of various Reduce tasks and control the running time. As a result, it can handle the data skew in MapReduce and improve the efficiency of running MapReduce job.
出处 《小型微型计算机系统》 CSCD 北大核心 2015年第8期1706-1710,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(U1304603)资助 河南省教育厅科学技术研究重点项目(13A520651)资助 郑州市重大科技专项项目(131PZDZX050)资助
关键词 MAPREDUCE 数据倾斜 虚拟分区 MapReduce data skew virtual partitioning
  • 相关文献

参考文献10

  • 1覃雄派,王会举,杜小勇,王珊.大数据分析——RDBMS与MapReduce的竞争与共生[J].软件学报,2012,23(1):32-45. 被引量:386
  • 2Gufler B, Augsten N, Reiser A, et al. Handing data skew in mapRe- duce[ C]. Proceedings of the I st International Conference on Cloud Computing and Services Science ,2011,146:574-583.
  • 3Kwon Y C,Ren K,Balazinska M,et al. Managing skew in hadoop [ J]. IEEE Data Eng,Bull,2013,36( 1 ) :24-33.
  • 4Ibrahim S,Jin H,Lu L,et al. Handling partitioning skew in MapRe- duce using LEEN [ J ]. Peer-to-Peer Networking and Applications, 2013,6(4) :409-424.
  • 5Xu Y,Zou P, Qu W,et al. Sampling-based partitioning in MapRe- duce for skewed data [ C ]. ChinaGrid Annual Conference ( China- Grid) ,2012 Seventh, IEEE ,2012 : 1-8.
  • 6Yang H, Dasdan A, Hsiao R L, et al. Map-reduce-merge : simplified relational data processing on large clusters [ C ]. Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, ACM,2007 : 1029 - 1040.
  • 7Abouzeid A, Bajda-Pawlikowski K, Abadi D, et al. HadoopDB : an architectural hybrid of MapReduce and DBMS technologies for ana- lytical workloads [ J ]. Proceedings of the VLDB Endowment,2009, 2( I ) :922-933.
  • 8Cbopra S, Rao M R. The partition problem [ J ]. Mathematical Pro- gramming, 1993,59( 1-3 ) :87-115.
  • 9Hadoop [ EB/OL ]. http ://lucene. apache, org/hadoop ,2012.
  • 10Ekanayake J, Pallickara S, Fox G. Mapreduce for data intensive sci- entific analyses [ C ]. IEEE Fourth International Conference on. IEEE ,2008:277-284.

二级参考文献82

  • 1Zhou MQ, Zhang R, Zeng DD, Qian WN, Zhou AY. Join optimization in the MapReduce environment for column-wise data store. In: Fang YF, Huang ZX, eds. Proc. of the SKG. Ningbo: IEEE Computer Society, 2010.97-104. [doi: 10.1109/SKG.2010.18].
  • 2Afrati FN, Ullman JD. Optimizing joins in a Map-Reduce environment. In: Manolescu I, Spaecapietra S, Teubner J, Kitsuregawa M, Leger A, Naumann F, Ailamaki A, Ozcan F, eds. Proc. of the EDBT. Lausanne: ACM Press, 2010. 99-110. [doi: 10.1145/ 1739041.1739056].
  • 3Sandholm T, Lai K. MapReduce optimization using regulated dynamic prioritization. In: Douceur JR, Greenberg AG, Bonald T, Nieh J, eds. Proc. of the SIGMETRICS. Seattle: ACM Press, 2009. 299-310. [doi: 10.1145/1555349.1555384].
  • 4Hoefler T, Lumsdaine A, Dongarra J. Towards; efficient MapReduce using MPI. In: Oster P, ed. Proc. of the EuroPVM/MPI. Berlin: Springer-Verlag, 2009. 240-249. [doi: 10.100'7/978-3-642-03770-2_30].
  • 5Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 2010, 3(1-2):494-505.
  • 6Kambatla K, Rapolu N, Jagannathan S, Grama A. Asynchronous algorithms in MapReduce. In: Moreira JE, Matsuoka S, Pakin S, Cortes T, eds. Proc. of the CLUSTER. Crete: IEEE Press, 2010. 245-254. [doi: 10.1109/CLUSTER.2010.30].
  • 7Polo J, Carrera D, Becerra Y, Torres J, Ayguad6 E, Steinder M, Whalley I. Performance-Driven task co-scheduling for MapReduce environments. In: Tonouchi T, Kim MS, eds. Proc. of the 1EEE Network Operations and Management Symp. (NOMS). Osaka: IEEE Press, 2010. 373-380. [doi: 10.1109/NOMS.2010.5488494].
  • 8Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I. Improving MapReduce performance in heterogeneous environments. In: Draves R, van Renesse R, eds. Proc. of the ODSI. Berkeley: USENIX Association, 2008.29-42.
  • 9Xie J, Yin S, Ruan XJ, Ding ZY, Tian Y, Majors J, Manzanares A, Qin X. Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: Taufer M, Rfinger G, Du ZH, eds. Proc. of the Workshop on Heterogeneity in Computing (IPDPS 2010). Atlanta: IEEE Press, 2010. 1-9. [doi: 10.1109/IPDPSW.2010.5470880].
  • 10Polo J, Carrera D, Becerra Y, Beltran V, Torres J, Ayguad6 E. Performance management of accelerated MapReduce workloads in heterogeneous clusters. In: Qin F, Barolli L, Cho SY, eds. Proc. of the ICPP. San Diego: IEEE Press, 2010. 653-662. [doi: 10.1109/ ICPP.2010.73].

共引文献385

同被引文献21

引证文献5

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部