摘要
为改善大规模数据集的处理性能,提出了基于改进K-means++和基于密度的含噪声应用空间聚类(DBSCAN)算法的大数据聚类方法。首先,将K-means++与局部搜索策略相结合,在数据集上进行初始化分区,然后利用DBSCAN算法在每个分组内单独执行数据聚类。利用改进K-means++算法提高数据预处理质量,并通过分区并行聚类的操作显著降低DBSCAN的计算负担,加快处理速度。最后,通过两阶段的剪枝策略对边缘聚类进行高效合并。实验结果表明,所提方法大幅降低了DBSCAN的执行时间,且聚类数据的质量与原DBSCAN算法非常接近,在UCI库的Bitcoin数据集上比其他比较方法的聚类效率提高了10倍以上,在处理时间和聚类数据质量之间实现了最优平衡。
In order to improve the processing performance of large-scale data sets, a big data clustering method based on improved K-means++ and DBSCAN algorithms is proposed. First, K-means++ is combined with a local search strategy to perform initialized partitioning on the data set, and then the DBSCAN algorithm is used to perform data clustering within each data partitions separately. The improved K-means++ algorithm is used to improve the quality of data pre-processing, and the computational burden of DBSCAN is significantly reduced through the operation of data partitioning and parallel clustering, thereby speeding up the overall processing speed. Finally, a two-step pruning strategy is proposed to merge the border clusters efficiently. The experimental results show that the proposed method greatly reduces the execution time of DBSCAN, and the quality of the clustered data is very close to the original DBSCAN algorithm. The clustering efficiency on the Bitcoin data set from the UCI library is more than 10 times higher than that of other comparison methods, and an optimal balance is achieved between processing time and clustering data quality.
作者
张玉琴
梁莉
张建亮
冯向东
Zhang Yuqin;Liang Li;Zhang Jianliang;Feng Xiangdong(College of the Engineering&Technical,Chengdu University of Technology,Leshan 614000,China;School of Mathematics and Physics,Chengdu University of Technology,Chengdu 610059,China)
出处
《国外电子测量技术》
北大核心
2022年第9期40-46,共7页
Foreign Electronic Measurement Technology
基金
四川省自然科学重点项目(18ZA0075,18ZA0073)
乐山市科技局重点研究项目(21GZD015)
成都理工大学工程技术学院基金(C122019027)项目资助。