期刊文献+

基于索引偏移的MapReduce聚类负载均衡策略 被引量:6

Load Balancing Strategy of MapReduce Clustering Based on Index Shift
下载PDF
导出
摘要 MapReduce作为一种分布式编程模型,被广泛应用于大规模和高维度数据集的处理中。其采用原始Hash函数划分数据,当数据分布不均匀时,常会出现数据倾斜的问题。基于MapReduce的聚类算法,需要多次迭代且不清楚各阶段Reduce的输入数据分布,因此现有的解决数据倾斜的方法并不适用。为解决数据划分的不均衡问题,提出一种当存在数据倾斜时更改剩余分区索引的策略。该方法在Map运行的过程中统计将要分给各reducer的数据量,由JobTrackcr监控全局的分区信息并根据数据倾斜模型动态修改原分区函数;在接下来的分区过程中,Partitioner把即将导致倾斜的分区索引到其余负载较轻的reducer上,使各节点的负载达到均衡。基于Zipf分布数据集和真实数据集,将所提算法与现有的解决数据倾斜的方法进行对比,结果证明,所提策略解决了MapReduce聚类中的数据倾斜问题,且在稳定性与执行时间上优于Hash和基于采样的动态分区法。 MapReduce has been widely used in large-scale and high-dimension datasets as a kind of distributed programming model.Original Hash partition function in MapReduce often occurs data skew when data distribution is not uniform.In the clustering algorithm based on MapReduce,existing solutions for data skew are not applicable because the input data distribution of Reduce is unclear at each stage of multiple iteration.To solve the imbalance problem of data partitioning,this paper proposed a strategy to change the remaining partition index when data is tilted.In Map phase,the amount of data which will be distributed to each reducer is counted,then the global partition information is monitored and the original partition function is dynamically modified according to the data skew model by JobTrackcr,so the Partitioner can change the index of these partitions which will cause data skew to the other reducer that has less load in the next partitioning process,and eventually balance the load of each node.Finally,this method was compared with existing methods on both synthetic datasets and real datasets.The experimental results show this strategy can solve data skew on MapReduce clustering with better stability and efficiency than Hash method and dynamic partitioning method based on sampling.
作者 周华平 刘光宗 张贝贝 ZHOU Hua-ping;LIU Guang-zong;ZHANG Bei-bei(College of Computer Science and Engineering,Anhui University of Science and Technology, Huainan, Anhui 232000, Chin)
出处 《计算机科学》 CSCD 北大核心 2018年第5期303-309,共7页 Computer Science
基金 国家自然科学基金(51174257) 安徽理工大学矿业企业安全管理研究中心招标项目(SK2015A084) 安徽省高校优秀青年人才支持计划项目资助
关键词 MAPREDUCE 数据倾斜 负载均衡 分布式聚类 索引偏移 MapReduce Data skew I.oad balance Distributed clustering Index shift
  • 相关文献

参考文献5

二级参考文献86

  • 1周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,50(S1):369-377. 被引量:11
  • 2[OL].<http://hadoop.apache.org.>.
  • 3WinterCorp: 2005 TopTen Program Summary. http:// www. wintercorp, com/WhitePapers/WC TopTenWP. pdf.
  • 4TDWI Checklist Report: Big Data Analytics. http://tdwi. org/research/2010/08/Big-Data-Analytics, aspx.
  • 5Chaudhuri S, Dayal U. An overview of data warehousing and OLAP technology. SIGMOD Rec, 1997,26(1): 65-74.
  • 6Madden S, DeWitt D J, Stonebraker M. Database parallelism choices greatly impact scalability. DatabaseColumn Blog. http://www, databasecolumn, com/2007/10/database-parallelism-choices, html.
  • 7Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters//Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI ' 04). San Francisco, California, USA, 2004: 137-150.
  • 8DeWitt D J, Gerber R H, Graefe G, Heytens M L, Kumar K B, Muralikrishna M. GAMMA--A high performance dataflow database machine//Proceedings of the 12th International Conference on Very Large Data Bases (VLDB' 86). Kyoto, Japan, 1986:228-237.
  • 9Fushimi S, Kitsuregawa M, Tanaka H. An overview of the system software of a parallel relational database machine// Proceedings of the 12th International Conference on Very Large DataBases(VLDB'86). Kyoto, Japan, 1986:209-219.
  • 10Brewer E A. Towards robust distributed systems//Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC' 00). Portland, Oregon, USA, 2000:7.

共引文献653

同被引文献33

引证文献6

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部