期刊文献+

基于K-means算法的最佳聚类数研究 被引量:14

Research on the best clustering number based on K-means algorithm
下载PDF
导出
摘要 针对聚类算法在实现的过程中需要预先设定最终聚类数目的问题,提出了基于同类全部样本的类内紧密度和类间离差度的一种新聚类有效性指标,通过该指标能够有效地确定数据集的最佳聚类簇数。在确定最佳聚类数的过程中采用K-means算法,针对K-means算法随机选择初始聚类中心的缺陷,提出以欧式距离度量样本相似度,基于样本方差,选出方差最小的前K个样本作为初始聚类中心,避免噪声点成为初始聚类中心,使得选择的初始聚类中心位于样本集稠密区域,Kmeans聚类的结果稳定有效。使用优化K-means算法和新的聚类有效性指标确定数据集的簇数,通过在UCI数据集和人工模拟数据集上测试,证明文本算法在球形且噪声点较少的样本集中,能够有效地找出最佳的类数且算法运行速度快。 Aiming at the problem that the final number of clusters should be set in advance during the implementation of the clustering algorithm,a new clustering effectiveness index based on the intra-class tightness and inter-class dispersion of all samples of the same kind is proposed in this paper,which can effectively determine the optimal cluster number of data sets.Used in the process of the optimum clustering number K-means algorithm,in view of the K-means algorithm random initial clustering center of the defect,sample similarity of an Euclidean distance measure is put forward,based on the sample variance,select the minimum variance of K samples as the initial clustering center before,to avoid noise point as the initial clustering center,makes the choice of initial clustering center is located in the sample set is populated area,K-means clustering results of stable and effective.The optimized K-means algorithm and the new clustering validity index are used to determine the number of clusters of the data set.By testing on the UCI data set and the artificial simulation data set,it is proved that the text algorithm can effectively find the best number of classes in the spherical sample set with fewer noise points and the algorithm runs fast.
作者 王艳娥 梁艳 司海峰 丁心安 WANG Yan’e;LIANG Yan;SI Haifeng;DING Xin’an(School of Technology,Xi’an Siyuan University,Xi’an 710038,China)
出处 《电子设计工程》 2020年第24期52-56,共5页 Electronic Design Engineering
基金 陕西省教育厅科学研究计划项目(18JK1100) 陕西省高等教育科学研究项目(XGH19236)。
关键词 K-MEANS 聚类数 有效性指标 聚类分析 K-means cluster number validity index cluster analysis
  • 相关文献

参考文献7

二级参考文献184

  • 1王惠文.变量多重相关性对主成分分析的危害[J].北京航空航天大学学报,1996,22(1):65-70. 被引量:17
  • 2杨善林,李永森,胡笑旋,潘若愚.K-MEANS算法中的K值优化问题研究[J].系统工程理论与实践,2006,26(2):97-101. 被引量:188
  • 3Frey B J,Dueck D.Clustering by Passing Messages Between Data Points[J].Science,2007,315(5814):972-976.
  • 4Mézard M.Where Are the Exemplars?[J].Science,2007,315(5814):949-951.
  • 5Kapp A V,Tibshirani R.Are clusters found in one dataset pre-sent in another dataset?[J].Biostatistics,2007,8(1):9-31.
  • 6Dudoit S,Fridlyand J.A Prediction-based Resampling Method for Estimating the Number of Clusters in a Dataset[J].Genome Biology,2002,3(7):1-21.
  • 7Dembélé D,Kastner P.Fuzzy C-means method for clustering microarray data[J].Bioinformatics,2003,19(8):973-980.
  • 8Armstrong S A,Staunton J E,Silverman L B,et al.MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia[J].Nature Genetics,2002,30:41-47.
  • 9Anderberg M R. Cluster analysis for application[M]. New York: Academic Press, 1973.
  • 10Jain A K, Murty M N, Flynn P J. Data clustering: A review[J]. ACM Computing Survey, 1999, 31(3): 264-323.

共引文献561

同被引文献138

引证文献14

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部