摘要
针对传统K-means算法随机选择初始聚类中心容易造成聚类结果不稳定且准确率低等问题,基于拟蒙特卡洛(Quasi-Monte Carlo,QMC)方法提出一种新的初始聚类中心确定方法;该算法利用QMC序列分布的超均匀性特点,对整个样本空间中的样本分布进行采样估计;基于k近邻距离(k-distance)对QMC序列点进行加权的K-means聚类,得到初始聚类中心。该算法的计算复杂度为O(max(d、n)logn),其中d、n分别表示样本数据的维数和数量;在人工数据和实际数据集上的仿真实验表明,该算法能选择更优的初始聚类中心,有效降低K-means算法的迭代次数,提高聚类的准确性、鲁棒性和收敛速度。
Traditional K - means clustering algorithms can randomly generate the initial seeds suffer from the instability and unreliability of the clustering results. To overcome these deficiencies, a novel method for determining the initial points of the K -means clustering algorithm was proposed based on Quasi-Monte Carlo (QMC) method. The low-discrepancy characteristic of QMC was utilized for estimating the density of data distribution in the whole sample space. Weights of all QMC points were calculated based on the mean of k - NN distance, and the initial seeds were obtained by applying weighted K -means algorithm on the QMC points. The results show that the corresponding computational complexity is O(max(d,n) logn) ,where d denotes the dimensionality of samples and n is the size of samples. Simulation results on the artificial and real-life data sets indicate that the proposed algorithm can obtain better initial points which are nearer to the real ones, decrease the iteration time of K - means algorithm, and improve its correctness, stability and convergence rate.
出处
《济南大学学报(自然科学版)》
北大核心
2017年第1期35-41,共7页
Journal of University of Jinan(Science and Technology)
基金
浙江省自然科学基金项目(LY14F030020)