摘要
针对密度聚类算法(DBSCAN)难以体现各维度对聚类的差异化贡献,且算法准确性依赖人工设置距离阈值等问题,文中提出基于度量参数自学习的半监督DBSCAN,即SMP-SDBSCAN。设计基于logistic回归模型的距离参数训练方法,利用少量的标记数据训练各维度的聚类贡献权重;构建数据聚簇参数计算机制,将标记数据聚簇的平均类间距离和邻域密度设置为聚类参数,提升密度聚类算法对数据集的适应性。实验表明,所提方法能够选择合理的聚类参数,可有效提升密度聚类算法聚类精度。
In order to address the issue that the Density-Based Spatial Clustering of Application with Noise(DBSCAN)fails to reflect the differentiated contributions of each dimension to the clustering,and the accuracy of the algorithm depends on the manual setting of distance threshold parameters,a semi-supervised DBSCAN clustering algorithm called SMP-SDBSCAN is proposed,which is based on the self-learning of metric parameters.A distance parameter training method based on the logistic regression model is designed to train the clustering contribution weights of each dimension using a small amount of labeled data.A mechanism for calculating the cluster parameters of data clusters is constructed,where the average inter-cluster distance and neighborhood density of the labeled data clusters are calculated as the clustering parameters,thereby improving the adaptability of the density clustering algorithm to the data set.Experiment results show that the proposed method can select reasonable clustering parameters and effectively improve the clustering accuracy of the density-based clustering algorithm.
作者
袁国泉
赵新建
张颂
陈石
徐晨维
YUAN Guo-quan;ZHAO Xin-jian;ZHANG Song;CHEN Shi;XU Chen-wei(Information&Telecommunication Branch State Grid Jiangsu Electric Power Co.,Ltd.,Nanjing 210000,China)
出处
《信息技术》
2024年第11期77-83,91,共8页
Information Technology
基金
国网江苏省电力有限公司科技项目(J2022109)。
关键词
密度聚类
距离度量
LOGISTIC回归
半监督学习
自学习
density clustering
distance measurement
logistic regression
semi-supervised learning
self-learning