期刊文献+

MapReduce中数据倾斜解决方法的研究 被引量:3

Research on Handling Data Skew in MapReduce
下载PDF
导出
摘要 随着移动互联网和物联网的飞速发展,数据规模呈爆炸性增长态势,人们已经进入大数据时代。MapReduce是一种分布式计算框架,具备海量数据处理的能力,已成为大数据领域研究的热点。但是MapReduce的性能严重依赖于数据的分布,当数据存在倾斜时,MapReduce默认的Hash划分无法保证Reduce阶段节点负载平衡,负载重的节点会影响作业的最终完成时间。为解决这一问题,利用了抽样的方法。在用户作业执行前运行一个MapReduce作业进行并行抽样,抽样获得key的频次分布后结合数据本地性实现负载均衡的数据分配策略。搭建了实验平台,在实验平台上测试WordCount实例。实验结果表明,采用抽样方法实现的数据划分策略性能要优于MapReduce默认的哈希划分方法,结合了数据本地性的抽样划分方法的效果要优于没有考虑数据本地性的抽样划分方法。 With the rapid development of mobile Intemet and the Internet of Things, the data size explosively grows, and people have been in the era of big data. As a distributed computing framework, MapReduce has the ability of processing massive data and becomes a focus in big data. But the performance of MapReduce depends on the distribution of data. The Hash partition function defaulted by MapReduce can' t guarantee load balancing when data is skewed. The time of job is affected by the node which has more data to process. In order to solve the problem, sampling is used. It does a MapReduce job to sample before dealing with user' s job in this paper. After learning the distribution of key,load balance of data partition is achieved using data locality. The example of WordCount is tested in experimental plat- form. Results show that data partition using sample is better than Hash partition, and taking data locality is much better than that using sample but no data locality.
作者 王刚 李盛恩
出处 《计算机技术与发展》 2016年第9期201-204,共4页 Computer Technology and Development
基金 国家自然科学基金资助项目(61170052)
关键词 大数据 MAPREDUCE 负载均衡 抽样 big data MapReduce load balancing sampling
  • 相关文献

参考文献13

二级参考文献109

  • 1周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,50(S1):369-377. 被引量:11
  • 2秦如新,陈静,冯一宁.一种新的关联规则抽样算法[J].中国农业大学学报,2007,12(3):85-88. 被引量:6
  • 3Leavitt N. Is Cloud Computing Really Ready for Prime Time? [J]. IEEE Computer Society Press, 2009,42 ( 1 ) :15 20.
  • 4Armbrust M, Fox A, Grith R, et al. Above the clouds:A Berkeley View of Cloud Computing[R]. UCB/EECS-2009-28. Berkeley, USA:Electrical Engineering and Computer Sciences, University of California at Berkeley, 2009.
  • 5Vaquero L, Rodero-Marino L, Caceres J, et al. A break in the clouds: towards a cloud definition [J]. SIGCOMM Computer Communication Review, 2009,39 ( 1 ) : 50-55.
  • 6Lenk A,Klems M, Nimis J, et al. What' s inside the Cloud? An Architectural Map of the Cloud Landscape[C]//Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing. 2009 : 23-31.
  • 7Amazon Web Services[EB/OL]. http://aws, amazon, corn/.
  • 8Hadoop[EB/OL]. http://hadoop, apache, org/core.
  • 9Dean J, Ghemawat S. MapReduce: Simplied data processing on large clusters[C]//Proceedings of the 6th Symposium on Operating Systems Design and Implementation. San Francisco, CA, 2004,11(18):137-150.
  • 10Hbase[EB/OL]. http://hadoop, apache, org/hbase/.

共引文献478

同被引文献4

引证文献3

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部