期刊文献+

基于MapReduce的K-means聚类算法的优化 被引量:5

Optimization of K-means Clustering Algorithm Based on MapReduce
下载PDF
导出
摘要 针对传统的聚类算法K-means对初始中心点的选择非常依赖,容易产生局部最优而非全局最优的聚类结果,同时难以满足人们对海量数据进行处理的需求等缺陷,提出了一种基于MapReduce的改进K-means聚类算法。该算法结合系统抽样方法得到具有代表性的样本集来代替海量数据集;采用密度法和最大最小距离法得到优化的初始聚类中心点;再利用Canopy算法得到粗略的聚类以降低运算的规模;最后用顺序组合MapReduce编程模型的思想实现了算法的并行化扩展,使之能够充分利用集群的计算和存储能力,从而适应海量数据的应用场景;文中对该改进算法和传统聚类算法进行了比较,比较结果证明其性能优于后者;这表明该改进算法降低了对初始聚类中心的依赖,提高了聚类的准确性,减少了聚类的迭代次数,降低了聚类的时间,而且在处理海量数据时表现出较大的性能优势。 To deal with the problems that traditional K-means clustering algorithm is very dependent on the selection of the initial points,being prone to clustering result of local optimum rather than global optimum,and it is difficult to meet the need of dealing with massive amounts of data,an improved K-means clustering algorithm based on MapReduce is proposed.The algorithm combines systematic sampling method to get a representative sample set which is used to replace the massive data set;and uses density method and Max-Min distance method to get the optimal initial clustering centers;and adopts Canopy algorithm to get a rough clustering which can reduce the computational scale;and finally employs the idea of sequential composition of MapReduce programming model to realize the parallel extension of the algorithm,which can make full use of the computing and storage capacity of the cluster,in order to adapt to the application of massive data.The improved algorithm is compared with the traditional clustering algorithms in this paper,and the comparative results show that the performance of improved algorithm is better than the latter.The experiments show that the improved method reduces the dependence on the initial cluster centers and also reduces the number of iterations of clustering and the clustering time.Furthermore it shows greater performance advantage in dealing with massive data.
出处 《计算机测量与控制》 2016年第7期272-275,279,共5页 Computer Measurement &Control
基金 国家自然科学基金项目(11271057 51176016) 江苏省自然科学基金项目(BK2009535)
关键词 K均值算法 抽样 Canopy算法 最大最小距离法 K-means clustering algorithm sampling Canopy algorithm Max-Min distance method
  • 相关文献

参考文献9

二级参考文献82

共引文献1342

同被引文献47

引证文献5

二级引证文献69

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部