摘要
为了有效利用已知信息快速地进行数据聚类,提出了一种基于网格的半监督密度峰值聚类(GS-DPC)算法。利用统计信息网格对数据集进行划分,将落在网格内数据点的个数作为局部密度值,计算出每一个网格代表点;根据局部密度值和相对距离值确定聚类中心;利用成对约束集指导聚类过程后得到聚类结果。实验结果表明,GS-DPC算法进行数据聚类算法的平均消耗时间比密度峰值聚类算法(DPC)降低32百分点;GS-DPC算法在6个数据集上的平均精确度(ACC)约为0.84,平均调整互信息(AMI)约为0.68,平均调整兰德系数(ARI)约为0.67,因此GS-DPC算法可以快速且有效地进行数据聚类并获得较好的聚类结果。
In order to efficiently cluster data using known information,a Grid-based Semi-supervised Density Peak Clustering(GS-DPC)algorithm is proposed.The algorithm divides the dataset using statistical information grids,with the number of data points within each grid serving as the local density value to calculate a representative point for each grid.Clustering centers are determined based on local density values and relative distance values,and clustering results are obtained after guiding the clustering process using a pairwise constraint set.Experimental results show that the average time consumption of the GS-DPC algorithm for data clustering is 32 percentage points lower than that of the density peak clustering algorithm(DPC).The GS-DPC algorithm achieves an average accuracy(ACC)of about 0.84,an average Adjusted Mutual Information(AMI)of about 0.68,and an average Adjusted Rand Index(ARI)of about 0.67 on six datasets,demonstrating that it can efficiently and effectively cluster data while obtaining good clustering results.
作者
杨金瑞
刘继
YANG Jinrui;LIU Ji(School of Statistics&Data Science,Xinjiang University of Finance&Economics,Urumqi 830012,China;Xinjiang Social&Economic Statistics&Big Data Application Research Center,Xinjiang University of Finance&Economics,Urumqi 830012,China)
出处
《软件工程》
2024年第5期1-6,共6页
Software Engineering
基金
国家自然科学基金项目(大数据背景下网络舆情智能治理:共同体构建、协同演进与引导机制,72164034)。