期刊文献+

基于云计算平台Hadoop的HKM聚类算法设计研究 被引量:9

HKM Clustering Algorithm Design and Research Based on Hadoop Platform
下载PDF
导出
摘要 为有效解决传统K-means聚类算法在处理大规模数据集时面临的扩展性问题,提出了一种Hadoop K-means聚类算法.该算法首先根据样本密度剔除数据集中孤立点或者噪声点的影响,再利用最大化最小距离思想选取K个初始中心,使初始聚簇中心点最优化,最后用Hadoop云计算平台的Map Reduce编程模型实现算法的并行化.实验结果表明,该算法不仅在聚类结果上具有较高的准确率和稳定性,而且能够很好地解决传统聚类算法在处理大规模数据时所面临的扩展性问题. In order to solve the problem of traditional K-means clustering algorithm in dealing with large-scale data set, a Hadoop K-means (HKM) clustering algorithm is proposed.Firstly, based on the of sample density, the algorithm excludes the effect of data set point or noise. Secondly the optimization of the initial cluster centers is carried out by selecting K initial centers guided by the thought of maximizing the minimum distance. In the end, the MapReduce programming model of Hadoop cloud computing platform is used to realize the parallelization of the algorithm. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but also can solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.
作者 张淑芬 董岩岩 陈学斌 ZHANG Shu-fen;DONG Yan-yan;Chen Xue-bin(College of Science, North China University of Science and Technology, Tangshan 063009, HebeiProvince, China;Hebei Key Laboratory of Data Science & Application, Tangshan 063009, Hebei Province, China)
出处 《应用科学学报》 CAS CSCD 北大核心 2018年第3期524-534,共11页 Journal of Applied Sciences
关键词 K-MEANS算法 样本密度 最大化最小距离 HADOOP平台 并行化计算 K-means algorithm sample density maximum minimum distance Hadoop platform parallel computing
  • 相关文献

参考文献22

二级参考文献236

共引文献497

同被引文献105

引证文献9

二级引证文献129

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部