摘要
聚类算法是一种重要的数据挖掘方法,其目标是按照某种准则把一个数据集分割成不同的类或簇,使得同一类对象的相似度尽可能地大,不同类对象之间的相似度尽可能地小。所以,相似性度量是聚类分析的重要环节。为进一步改善传统聚类算法中,采用欧式距离进行相似性度量时,不能很好地反应非凸数据集的全局一致性的问题,在欧式距离基础上,提出一种基于密度和近邻通过构建近邻链的方式计算流形上两点间距离的度量方法,针对具有非凸结构的数据集,可以很好反应其局部和全局一致性。为验证方法的有效性,基于K-medoids和Affinity Propagat-ion聚类算法,在二维和三维数据集上对比采用不同距离度量时的聚类结果并取得了较好的实验效果。
The clustering algorithm is an important data mining method,and its goal is to divide a data set into different classes or clusters according to a certain criterion,so that the similarity between objects in the same class is as large as possible and the similarity between objects in different classes is as small as possible.Therefore,similarity measurement is an important part of cluster analysis.In order to further improve the problem that Euclidean distance is used for similarity measurement in traditional clustering algorithms does not reflect well the global consistency of non-convex data sets,this paper proposes a method to calculate the distance between two points on a manifold based on density and nearest neighbor by constructing a chain of nearest neighbors based on Euclidean distance,which can well reflect the global consistency of data set with manifold structure.The method can reflect the local and global consistency of the data set with non-convex structure.To verify the effectiveness of the method,the clustering results are compared on two-dimensional and three-dimensional data sets with different distance measures based on K-medoids and Affinity Propagation clustering algorithms,and good experimental results are achieved.Finally,some problems of the method and the follow-up research plan are summarized.
作者
刘佳伟
唐锦萍
LIU Jia-wei;TANG Jin-ping(School of Data Science and Technology,Heilongjiang University,Harbin Heilongjiang 150080,China)
出处
《计算机仿真》
北大核心
2023年第8期382-388,420,共8页
Computer Simulation
基金
国家自然科学基金(11701159)。
关键词
聚类
距离
密度
流形
非凸数据集
近邻
Clustering
Distance
Density
Manifold
Non-convex dataset
Neighbors