期刊文献+

Hadoop平台下Mahout聚类算法的比较研究 被引量:11

Comparison Research on Mahout Clustering Algorithms under Hadoop Platform
下载PDF
导出
摘要 聚类是数据挖掘中的一门重要技术,用于将物理或抽象对象的集合划分成由相似对象构成的多个类。如何将传统聚类算法应用于大规模数据的聚类,是当前大数据研究领域中的热点研究问题。对云计算平台Hadoop下开源机器学习软件库——Mahout中的Canopy、标准K-means、模糊K-means 3种聚类算法的原理及其MapReduce实现进行了比较,并在构建的有不同个数节点的集群上,在不同规模的数据集下对这3种聚类算法进行了实验,从加速比、可扩展性和规模增长性3个方面进行比较。实验结果表明,在并行环境下:Canopy算法运行速度最快,K-means算法次之,模糊K-means最慢;3种算法均有较好的加速比,其中Canopy算法加速比最好,模糊K-means算法在数据量和节点个数达到一定规模后加速比大幅提高;3种算法均有较好的可扩展性和规模增长性,且随着数据规模增加,可扩展性和规模增长性增强,其中Canopy算法可扩展性最好,模糊K-means算法的可扩展性和规模增长性增强幅度最大。 Clustering is an important technique in data mining,and it is used to divide the congregation of physical or abstract objects into multiple classes consisting of similar objects.How to apply the traditional clustering algorithm into the clustering of large scale data is the hot research issue in the current data research field.This article conducts the theory analysis and comparison on the principle of three kinds of clustering algorithms of Canopy,Standard K-means and Fuzzy K-means in open-source machine learning software library—Mahout under cloud computing platform—Hadoop and the achievement of MapReduce,and on the cluster constructed by the nodes with different number,under the data sets with different scales,conduct experiment on the three kinds of clustering algorithms,and then conduct comparison from the three aspects of speedup ratio,scalability and scale growth.The experimental results show that:in parallel environment,the running speed of Canopy algorithm is the fastest,K-means algorithm is the second and Fuzzy K-means is the slowest;the three kinds of algorithms have better speedup ratio,and among them,the speedup ratio of Canopy algorithm is the best,the speedup ratio of Fuzzy K-means algorithm substantially increases after the amount of data and the number of nodes achieving a certain scale;the three kinds of algorithms have better scalability and scale growth,and among them,the scalability of Canopy algorithm is the best,the increasing amplitude of scalability and scale growth of Fuzzy K-means algorithm is the largest.
作者 牛怡晗 海沫
出处 《计算机科学》 CSCD 北大核心 2015年第S1期465-469,共5页 Computer Science
基金 北京高等学校青年英才计划项目(YETP0988)资助
关键词 聚类 HADOOP Mahout K-MEANS 模糊K-means CANOPY 聚类 Hadoop Mahout K-means 模糊K-means Canopy
  • 相关文献

参考文献3

  • 1Apache Hadoop [OL]. http://hadoop.apache.org/ .
  • 2Owen S,Anil R,Dunning T,et al.Mahout in action. . 2010
  • 3Apache Mahout. http://Mahout.apache.org .

同被引文献63

引证文献11

二级引证文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部