期刊文献+

数据倾斜情况下基于MapReduce模型的连接算法研究 被引量:1

Research on Data Skew Join Algorithm Based on MapReduce Model
下载PDF
导出
摘要 基于MapReduce的连接算法的研究是海量数据研究领域的一个重要内容,但都集中在数据分布均匀的情况下进行算法优化,而在实际应用中数据分布往往是不均匀的。本文基于此背景,提出一种适合在数据严重倾斜时使用基于MapReduce编程模型的连接算法Skew Control Join,算法通过采样获取数据集的整体分布,通过全局分区将数据集进行分割,使倾斜数据的处理平均分配到所有的Reduce任务上。实验表明在数据倾斜时,本文提出的算法具有良好的性能,达到研究目标。 The study of join algorithm based on MapReduce is a hot topic in massive data research area. However, most current optimization work is based on the assumption that the data are evenly distributed. In practical applications, the data to be processed are often skew in distribution. This paper proposes a MapReduce join algorithm called Skew Control Join, which is adaptive for serious skew data. The algorithm gets the overall data distribution by sampling, the partitions the data by total partitioner to distribute the data evenly to all Reduce tasks. Experiment results show that the algorithm is of good performance when the pro- cessed data are skew.
出处 《计算机与现代化》 2013年第5期22-27,共6页 Computer and Modernization
基金 国家"九七三"重点基础研究发展规划基金资助项目(2012CB316203) 国家自然科学基金重点资助项目(61033007) 国家"八六三"高技术研究发展基金资助项目(2012AA011004)
关键词 连接算法 数据倾斜 全局分区 采样 join algorithm data skew total partition sample
  • 相关文献

参考文献12

  • 1Dean J, Ghemawat S. MapReduce: Simplified data processing on large cluster[C]// Proc. of the 6th USENIX Symp on Opreating System Design and Implementation. 2004:137-150.
  • 2Wbite T.Hadoop权威指南(第2版)[M].周敏奇,王晓玲,金澈清,等译.北京:清华大学出版社,2011.
  • 3Bu Y, Howe B, Balazinska M, et al. HaLoop: Efficient iterative data processing on large clusters [ J ]. Proceedings of the VLDB Endowment, 2010,3(1-2) : 285-296.
  • 4Apache Software Foundation. Hadoop Distributed Cache [EB/ OL]. http://hadoop. apache, org/docs/current/api/org/a-pache/hadoop/filecache/DistributedCache, html, 2012-11-19.
  • 5Chandar J. Join Algorithms Using MapReduce [ D ]. University of Edinburgh, 2010.
  • 6Blanas S, Rao J, Tian Y, et al. A comparison of join algorithms for log processing in MapReduce [ C ]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010:975-986.
  • 7Xu Y, Zhou X, Chen L, et al. Handling data skew in parallel joins in shared-nothing systems [ C ]// Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008 : 1043-1052.
  • 8Transaction Processing Performance Council. TPC Benchmark H Version 2.14.4[EB/OL]. http://www. tpc. org/tpch/, 2012-05-01.
  • 9潘巍,李战怀,伍赛,陈群.基于消息传递机制的MapReduce图算法研究[J].计算机学报,2011,34(10):1768-1784. 被引量:45
  • 10Apache Software Foundation. Apache Hadoop Software [EB/OL]. http ://hadoop. apache.org/, 2012-03-19.

二级参考文献33

  • 1Dean J, Ghemawat S. MapReduce: Simplified dala processing on large clusters//Proceedings of the Conference on Operating System Design and Implementation(OSDU04,). San Francisco, USA, 2004: 137-150.
  • 2Thusoo A, Sarma J S, JainN, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: A warehousing solution over a map-reduce framework//Proceedings of the Conference on Very Large Databases (VLDB' 09). Lyon, France, 2009:1626-1629.
  • 3Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: A not-so-foreign language for data processing//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD' 08). Vancouver, BC, Canada, 2008:1099 1110.
  • 4Bu Y, Howe B, Balazinska M, Ernst M D. HaLoop.. Efficient iterative data processing on large clusters//Proceedings of the Conference on Very Large Databases (VLDB' 10). Sin gapore, 2010:285-296.
  • 5Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S-H, Qiu J, Fox G. Twister: A runtime for iterative MapReduce// Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. Chicago, Illinois, USA, 2010:810-818.
  • 6Wilson G V. Practical Parallel Programming. Cambridge, MA.. MIT Press, 1995.
  • 7Valiant L G. A bridging model for parallel computation. Communications of the ACM, 1990, 33(8): 103-111.
  • 8Dean J, Ghemawat S. MapReduce: A flexible data processing tool. Communications of the ACM, 2010, 53(1): 72-77.
  • 9Pavlo A, Paulson E, Rasin A, Abadi D J, DeWitt D J, Mad den S, Stonebraker M. A comparison of approaches to large scale data//Proceedings of the 2009 ACM SIGMOD Interna tional Conference on Management of Data (SIGMOD' 09) New York, USA, 2009:165-178.
  • 10Stonebraker M, Abadi D J, DeWitt D J, Madden S, Paulson E, Pavlo A, Rasin A. MapReduce and parallel DBMSs: Friends or foes? Communications of the ACM, 2010, 53(1) : 64-71.

共引文献44

同被引文献3

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部