期刊文献+

高维数据聚类数量可视化确定模式

Visualized determination mode for clustering quantity of high-dimensional data
下载PDF
导出
摘要 为了解决经典K-均值聚类算法要求用户事先知道待处理数据的聚类数量及聚类结果对算法的初始化很敏感的问题,提出一种对K-均值聚类算法的改进措施并可视化地确定聚类数量的综合方案。首先,对数据进行标准化,使其服从正态分布,利用主分量分析(princi‐palcomponentanalysis,PCA)抽取数据中最重要的特征以实现高维数据的降维;然后,采用最远质心选择和最小-最大距离规则对K-均值聚类算法的初始化进行修正,避免出现空聚类并确保数据的可分离性;在此基础上,采用统计经验法则估计聚类数量的可能范围,通过搜索在此范围内平方误差和(sum-of-squared-error,SSE)曲线的肘部估计最佳的聚类数量;最后,通过计算比较各个聚类的轮廓系数以评价算法的聚类质量,从而最终确定数据集固有的聚类数量。仿真结果表明,该方案不仅能可视化地确定数据集潜在的聚类数量,而且为大数据时代的高维数据分析提供了一种有效的方法。 In order to solve the problem that the classical K-means clustering algorithm reguired users to know the number of clusters in advance and the clustering results were sensitive to initialization of the algorithm,a comprehensive scheme was proposed to improve the random initial partitioning of K-means algorithm and visually determine the number of clusters.Firstly,the data was standardized to make it obey normal distribution,and the most important features were extracted by principal compo‐nent analysis to achieve dimensionality reduction of high-dimensional data.Then,the farthest centroid selection and min-max distance rule were used to modify the random initialization of K-means algo‐rithm to avoid empty clusters and ensure data separability.Based on these,the statistical empirical rule was used to estimate the range of the number of clusters,and the optimal number of clusters was as‐sessed by searching the elbow of sum-of-squared-error curve within this range.Finally,by calculating and comparing the silhouette coefficients of each cluster,the clustering quality of the algorithm was evaluated,thereby ultimately determining the inherent number of clusters in the data.The simulation re‐sults show that the proposed scheme can not only visually determine the potential number of clusters in the data,but also provide an effective method for high-dimensional data analysis in the era of big data.
作者 何选森 何帆 樊跃平 陈洪军 HE Xuansen;HE Fan;FAN Yueping;CHEN Hongjun(School of Information Technology and Engineering,Guangzhou College of Commerce,Guangzhou 511363,China;College of Information Science and Engineering,Hunan University,Changsha 410082,China;School of Management and Economics,Beijing Institute of Technology,Beijing 100081,China)
出处 《沈阳航空航天大学学报》 2024年第3期71-84,共14页 Journal of Shenyang Aerospace University
基金 广东省普通高校重点领域专项(项目编号:2021ZDZX1035)。
关键词 K-均值聚类算法 主分量分析 最远质心选择 最小-最大距离规则 统计经验法则 肘部法 轮廓分析 K-means clustering algorithm principal component analysis farthest centroid selection min-max distance rule statistical empirical rule elbow method silhouette analysis
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部