期刊文献+

基于增量式分区策略的MapReduce数据均衡方法 被引量:23

An Incremental Partitioning Strategy for Data Balance on MapReduce
下载PDF
导出
摘要 MapReduce以其简洁的编程模型,被广泛应用于大规模和高维度数据集的处理,如日志分析、文档聚类和其他数据分析.开源系统Hadoop很好地实现了MapReduce模型,但由于自身采用一次分区机制,即通过Hash/Range分区函数对数据进行一次划分,导致在处理密集数据时,Reduce端常会出现数据倾斜的问题.虽然系统为用户提供了自定义分区函数方法,但不幸的是在不清楚输入数据分布的情况下,数据倾斜问题很难被避免.为解决数据划分的不均衡,该文提出一种将分区向Reducer指派时按照多轮分配的分区策略.该方法首先在Map端产生多于Reducer个数的细粒度分区,同时在Mapper运行过程中实时统计各细粒度分区的数据量;然后由JobTracker根据全局的分区分布信息筛选出部分未分配的细粒度分区,并用代价评估模型将选中的细粒度分区分配到各Reducer上;依照此方法,经过多轮的筛选、分配,最终在执行Reduce()函数前,将所有细粒度分区分配到Reduce端,以此解决分区后各Reducer接收数据总量均衡的问题.最后在Zipf分布数据集和真实数据集上与现有的分区切分方法Closer进行了对比,增量式分区策略更好地解决了数据划分后的均衡问题. MapReduce has been widely used in processing large data sets in a distributed cluster as a flexible computation model, such as log analysis, document clustering and other forms of data analytics. In the MapReduce open-source platform Hadoop, the default Hash/Range partition scheme usually results in unbalanced data load in the Reduce phase. Even though Hadoop allows users to define a partition function, it is difficult to achieve balanced data load without detailed information on data distribution. In this paper, we propose a novel multiple-round approach to balance data load in the Reduce phase. In our proposal, Mapper produces more fine-grained partitions than the number of Reducer and gathers the statistics on the sizes of fine-grained partitions. And then, JobTracker selects appropriate fine-grained partitions to be allocated to Reducers before running Reduce ( ) function. We introduce a cost model and propose a heuristic assignment algorithm for this task. Finally, we experimentally compare our approach with Closer, which uses a segment partition method, on both synthetic and real datasets. The experimental results show our method achieves more balanced data load.
出处 《计算机学报》 EI CSCD 北大核心 2016年第1期19-35,共17页 Chinese Journal of Computers
基金 国家"九七三"重点基础研究发展规划项目基金(2012CB316203) 国家自然科学基金(61033007 61332006 61472321) 西北工业大学基础研究基金(3102014JSJ0005 3102014JSJ0013)资助
关键词 增量分配 细粒度分区 数据倾斜 均衡分区 MAPREDUCE 大数据 incremental allocation fine-grained partition~ data skew~ balanced partitioning^MapReduce~ big data
  • 相关文献

参考文献20

  • 1Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Operating Systems Design : Implementation, 2004, 51(1) : 147-152.
  • 2Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system//Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). Nevada, USA, 2010:1-10.
  • 3王珊,王会举,覃雄派,周烜.架构大数据:挑战、现状与展望[J].计算机学报,2011,34(10):1741-1752. 被引量:615
  • 4潘巍,李战怀,伍赛,陈群.基于消息传递机制的MapReduce图算法研究[J].计算机学报,2011,34(10):1768-1784. 被引量:45
  • 5Rasmussen A, Conley M, Kapoor R, et at. Themis: An I/O efficient MapReduce//Proceedings of the ACM Symposium on Cloud Computing (SOCC'12). San Jose, USA, 2012.
  • 6Ren K, Kwon Y, Balazinska M, Howe B. Hadoop's adolescence: A comparative workload analysis from three research clusters. Carnegie Mellon University (CMU), USA: Technical Report CMU-PDL-12-106, 2012.
  • 7Lin J, et al. The curse of Zipf and limits to parallelization: A look at the stragglers problem in MapReduee//Proceedings of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval. Boston, USA, 2009.
  • 8Gufler B, Augsten N, Reiser A, Kemper A. Handing data skew in MapReduce//Proeeedings of the 1st InternationalConference on Cloud Computing and Services Science. Noordwijkerhout, The Netherlands, 2011, 146:574 583.
  • 9Racha S C. Load Balancing Map-Reduce Communications for Efficient Executions of Applications in a Cloud [M]. S. disser tation]. Indian Institute of Science, Bangalore, India, 2012.
  • 10Kwon Y, et al. A study of skew in MapReduce applications. Open Cirrus Summit, Moscow, Russia, 2011.

二级参考文献76

  • 1[OL].<http://hadoop.apache.org.>.
  • 2WinterCorp: 2005 TopTen Program Summary. http:// www. wintercorp, com/WhitePapers/WC TopTenWP. pdf.
  • 3TDWI Checklist Report: Big Data Analytics. http://tdwi. org/research/2010/08/Big-Data-Analytics, aspx.
  • 4Chaudhuri S, Dayal U. An overview of data warehousing and OLAP technology. SIGMOD Rec, 1997,26(1): 65-74.
  • 5Madden S, DeWitt D J, Stonebraker M. Database parallelism choices greatly impact scalability. DatabaseColumn Blog. http://www, databasecolumn, com/2007/10/database-parallelism-choices, html.
  • 6Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters//Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI ' 04). San Francisco, California, USA, 2004: 137-150.
  • 7DeWitt D J, Gerber R H, Graefe G, Heytens M L, Kumar K B, Muralikrishna M. GAMMA--A high performance dataflow database machine//Proceedings of the 12th International Conference on Very Large Data Bases (VLDB' 86). Kyoto, Japan, 1986:228-237.
  • 8Fushimi S, Kitsuregawa M, Tanaka H. An overview of the system software of a parallel relational database machine// Proceedings of the 12th International Conference on Very Large DataBases(VLDB'86). Kyoto, Japan, 1986:209-219.
  • 9Brewer E A. Towards robust distributed systems//Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC' 00). Portland, Oregon, USA, 2000:7.
  • 10http: //www. dbms2, com/2008/08/26/known-applications of mapreduce/.

共引文献664

同被引文献99

引证文献23

二级引证文献108

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部