摘要
针对K-均值聚类算法需要事先确定聚类数K的问题,将粒度计算引入样本相似度函数,定义了新的样本相似度,用模糊等价聚类确定数据集可能的最大类簇数Kmax.以Kmax为搜索上界,利用改进全局K-均值聚类算法,以BWP(Between-Within Proportion)为聚类有效性度量指标,提出确定最佳聚类数的一种新方法.通过UCI机器学习数据库数据集以及随机生成的人工模拟数据集实验测试,证明该算法不仅能有效确定数据集的最佳聚类数,而且适用于大规模数据集,但是会受到噪音点影响.
To determine the optimal number of clusters for K-means clustering,a new algorithm is proposed based on the granular computing and the improved global K-means clustering.This algorithm introduces the granular computing into similar function to determine the similarity between two samples,so that the potential largest number Kmax of clusters is determined by the new similar function and fuzzy equivalence relation.Then the improved global K-means clustering and the criterion of BWP(Between-Within Proportion) are combined to determine the optimal number of clusters of a dataset,where BWP is a criterion to estimate the clustering result,and the optimal number of clusters for K-means clustering is determined according to the scores of BWP on different clustering results,during the procedure the Kmax is used as the upper bound of searching for the optimal number of clusters.The new algorithm is tested and compared to available studies about how many clusters will be best for K-means clustering through the UCI datasets and synthetic datasets with noisy data.All experimental results demonstrate that our new algorithm is effective in determining the optimal number of clusters especially in large datasets.The disadvantage of it is that it is sensitive to noisy data.
出处
《陕西师范大学学报(自然科学版)》
CAS
CSCD
北大核心
2012年第1期13-18,共6页
Journal of Shaanxi Normal University:Natural Science Edition
基金
陕西省自然科学基金资助项目(2010JM3004)
中央高校基本科研业务费专项资金重点项目(GK200901006
GK201001003)
陕西师范大学研究生培养创新基金项目(2011CX029)