摘要
为有效解决传统K-means聚类算法在处理大规模数据集时面临的扩展性问题,提出了一种Hadoop K-means聚类算法.该算法首先根据样本密度剔除数据集中孤立点或者噪声点的影响,再利用最大化最小距离思想选取K个初始中心,使初始聚簇中心点最优化,最后用Hadoop云计算平台的Map Reduce编程模型实现算法的并行化.实验结果表明,该算法不仅在聚类结果上具有较高的准确率和稳定性,而且能够很好地解决传统聚类算法在处理大规模数据时所面临的扩展性问题.
In order to solve the problem of traditional K-means clustering algorithm in dealing with large-scale data set, a Hadoop K-means (HKM) clustering algorithm is proposed.Firstly, based on the of sample density, the algorithm excludes the effect of data set point or noise. Secondly the optimization of the initial cluster centers is carried out by selecting K initial centers guided by the thought of maximizing the minimum distance. In the end, the MapReduce programming model of Hadoop cloud computing platform is used to realize the parallelization of the algorithm. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but also can solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.
作者
张淑芬
董岩岩
陈学斌
ZHANG Shu-fen;DONG Yan-yan;Chen Xue-bin(College of Science, North China University of Science and Technology, Tangshan 063009, HebeiProvince, China;Hebei Key Laboratory of Data Science & Application, Tangshan 063009, Hebei Province, China)
出处
《应用科学学报》
CAS
CSCD
北大核心
2018年第3期524-534,共11页
Journal of Applied Sciences