期刊文献+

一种确定最佳聚类数的新算法 被引量:11

A new algorithm to determine the optimal number of clusters
下载PDF
导出
摘要 针对K-均值聚类算法需要事先确定聚类数K的问题,将粒度计算引入样本相似度函数,定义了新的样本相似度,用模糊等价聚类确定数据集可能的最大类簇数Kmax.以Kmax为搜索上界,利用改进全局K-均值聚类算法,以BWP(Between-Within Proportion)为聚类有效性度量指标,提出确定最佳聚类数的一种新方法.通过UCI机器学习数据库数据集以及随机生成的人工模拟数据集实验测试,证明该算法不仅能有效确定数据集的最佳聚类数,而且适用于大规模数据集,但是会受到噪音点影响. To determine the optimal number of clusters for K-means clustering,a new algorithm is proposed based on the granular computing and the improved global K-means clustering.This algorithm introduces the granular computing into similar function to determine the similarity between two samples,so that the potential largest number Kmax of clusters is determined by the new similar function and fuzzy equivalence relation.Then the improved global K-means clustering and the criterion of BWP(Between-Within Proportion) are combined to determine the optimal number of clusters of a dataset,where BWP is a criterion to estimate the clustering result,and the optimal number of clusters for K-means clustering is determined according to the scores of BWP on different clustering results,during the procedure the Kmax is used as the upper bound of searching for the optimal number of clusters.The new algorithm is tested and compared to available studies about how many clusters will be best for K-means clustering through the UCI datasets and synthetic datasets with noisy data.All experimental results demonstrate that our new algorithm is effective in determining the optimal number of clusters especially in large datasets.The disadvantage of it is that it is sensitive to noisy data.
出处 《陕西师范大学学报(自然科学版)》 CAS CSCD 北大核心 2012年第1期13-18,共6页 Journal of Shaanxi Normal University:Natural Science Edition
基金 陕西省自然科学基金资助项目(2010JM3004) 中央高校基本科研业务费专项资金重点项目(GK200901006 GK201001003) 陕西师范大学研究生培养创新基金项目(2011CX029)
关键词 信息粒度 K-均值 全局K-均值 模糊相似度 聚类指标BWP information granularity K-means global K-means fuzzy similarity clustering criterion BWP
  • 相关文献

参考文献16

  • 1MacQueen J. Some methods for classification and analy- sis of multivariate observations[C]//Lucien M. Le Cam and Jerzy Neyman. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1 .. Statistics. Berkeley: University of California Press, 1967:281-297.
  • 2Gao Xinbo, Li Jie,Tao Dacheng, et al. Fuzziness meas urement of fuzzy sets and its application in cluster validi ty analysis[J]. International Journal of Fuzzy System 2007, 9(4) :188-197.
  • 3Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset [J]. Genome Biology, 2002, 3(7): 1-21.
  • 4Rousseeuw P J. Silhouettes.. A graphical aid to the interpre- tation and validation of cluster analysis[J].Computational and Applied Mathematics, 1987, 20: 53-65.
  • 5周世兵,徐振源,唐旭清.基于近邻传播算法的最佳聚类数确定方法比较研究[J].计算机科学,2011,38(2):225-228. 被引量:30
  • 6Kapp A V, Tibshirani R. Are clusters found in one dataset present in another dataset? [J].Biostatistics, 2007, 8(1): 9-31.
  • 7周世兵,徐振源,唐旭清.K-means算法最佳聚类数确定方法[J].计算机应用,2010,30(8):1995-1998. 被引量:142
  • 8杨善林,李永森,胡笑旋,潘若愚.K-MEANS算法中的K值优化问题研究[J].系统工程理论与实践,2006,26(2):97-101. 被引量:190
  • 9周世兵,徐振源,唐旭清.新的K-均值算法最佳聚类数确定方法[J].计算机工程与应用,2010,46(16):27-31. 被引量:90
  • 10Lin T Y. Granular eomputing: from rough sets and neighborhood systems to information granulation and computing with words[C]//European Congress on In- telligent Techniques and Soft Computing, 1997: 1602-1606.

二级参考文献53

  • 1杨善林,李永森,胡笑旋,潘若愚.K-MEANS算法中的K值优化问题研究[J].系统工程理论与实践,2006,26(2):97-101. 被引量:190
  • 2王珏,苗夺谦,周育健.关于Rough Set理论与应用的综述[J].模式识别与人工智能,1996,9(4):337-344. 被引量:264
  • 3苗夺谦.Rough Set理论在机器学习中的应用研究:博士学位论文[M].北京:中国科学院自动化研究所,1997..
  • 4Vapnik V N.统计学习理论的本质(中文版)[M].北京:清华大学出版社,2000..
  • 5黄萱菁.大规模中文文本的检索、分类与摘要研究:博士学位论文[M].上海:复旦大学,1998..
  • 6Jain A K, Dubes R C. Algorithms for clustering data [ M]. Englewood Cliffs: Prentice-Hall, 1988 : 1-334.
  • 7Huang Z. Extensions to the K-means algorithm for clustering large data sets with categorical values [J]. Data Ming and Knowledge Discovery, 1998, 2 (3): 283-304.
  • 8Maulik U, Bandyopadhyay S. Genetic algorithm based clustering technique[J]. Pattern Recognition, 2000, 33 (9): 1 455-1 465.
  • 9Selim S Z, Al-Sultan K S. A simulated annealing algorithm for the clustering[J]. Pattern Recognition, 1991, 24 (10):1 003-1 008.
  • 10Likas A, Vlassis M, Verbeek J. The global K-means clustering algorithm[J]. Pattern Recognition, 2003, 36 (2) : 451-461.

共引文献573

同被引文献135

引证文献11

二级引证文献74

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部