摘要
针对因使用基于距离的相似性度量,传统聚类内部指标随着数据维数的增加而性能下降的问题,提出了一种基于共享近邻相似度的聚类内部指标。首先,利用共享近邻相似度和k最近邻(kNN)方法,估计数据点的密度,构建融合密度的共享近邻相似度图。然后,根据融合密度的共享近邻相似度图,利用最大流算法,计算出类内相似度和类间分离度,并结合两者计算出聚类内部指标。通过对人工数据集和真实数据集的测试表明,与9个基于距离的传统聚类内部指标相比,该指标能更准确评估数据集的最佳划分和预测数据集的最佳类数。因此,该指标处理复杂类结构和高维数据的能力优于所对比的其他聚类内部指标。
In the use of distance-based similarity measures,the performance of traditional clustering internal indicators decreases with the increase of data dimensionality.To address this problem,a clustering internal index based on Shared Nearest-Neighbor similarity(SNN)was proposed.Firstly,the shared nearest neighbor similarity and k-Nearest Neighbor(kNN)method were used to estimate the density of the data points and construct a density-involved shared nearest neighbor similarity graph.Then,according to this similarity graph,intra-cluster compactness and inter-cluster separation were defined by a maximum flow algorithm and the clustering internal index was calculated.Compared with nine traditional clustering internal indexes,the experimental results on artificial datasets and real datasets show that this index can recognize the optimal partition of datasets more effectively and predict the optimal class number more accurately.Therefore,when dealing with high dimensional datasets and those with complex cluster structures,the proposed index has better performance than the other internal validity indexes.
作者
张龙义
钟才明
ZHANG Longyi;ZHONG Caiming(College of Information Science and Engineering,Ningbo University,Ningbo Zhejiang 315210,China;College of Science and Technology,Ningbo University,Ningbo Zhejiang 315210,China)
出处
《计算机应用》
CSCD
北大核心
2021年第S01期93-100,共8页
journal of Computer Applications
基金
国家自然科学基金面上项目(61976134)。
关键词
聚类内部指标
聚类
共享近邻相似度
高维诅咒
有效性指标
clustering internal index
clustering
Shared Nearest-Neighbor similarity(SNN)
curse of dimensionality
validity index