摘要
经典的分布式k-means聚类算法随机选取初始聚类中心,进行多次的迭代,容易使得聚类效率低,网络通信量大,而且聚类结果不稳定。针对这些问题,提出一种改进的分布式k-means聚类算法。该算法通过划分数据集,计算属性最密集的k个数据块作为聚类中心,以确保聚类中心的代表性,进而减少算法的迭代计算次数,提高聚类效率。通过在Hadoop分布式平台上进行实验,结果表明改进算法能减少迭代次数和收敛时间。
Classic distributed k-means clustering algorithm randomly selects the initial clustering centers.With many times iterations, it is easy to make low clustering efficiency, heavy network traf-fic, and the unstable clustering results.To solve these problems, an improved distributed k-means clustering algorithm is put forward.The algorithm selects the initial clustering centers by partitioning the data set, and calculating k classification blocks of most intensive attribute, to ensure the cluste-ring centers'representative, and then it reduces the number of iterations and improves the efficiency of clustering.Through the experiments on the Hadoop distributed platform, the results show that the improved algorithm can reduce the number of iteration and convergence time.
出处
《广西大学学报(自然科学版)》
CAS
北大核心
2014年第5期1060-1065,共6页
Journal of Guangxi University(Natural Science Edition)
基金
广西自然科学基金资助项目(2013GXNSFAA253003)