摘要
DBSCAN算法是基于密度的聚类算法,可在有噪声点的数据集中发现任意形状类簇,得到广泛应用。但其存在大规模磁盘I/O导致计算速度慢,密度不均匀类簇和人工干预确定阈值导致聚类偏差等缺陷,基于此提出Spark内存迭代并行化SDKB-DBSCAN(Spark Density Division Kernel Density Estimation Boundary Stategy-Density-based Spatial Clustering of Applications with Noise)改进算法,设计Spark缓存机制结合不规则动态分区和边界合并以及核密度估计并行化。实验表明,改进算法一般适用不同形状类簇和较大规模数据聚类,在准确率和计算速率上有一定提升。
The DBSCAN algorithm is a density-based clustering algorithm,which can find clusters of arbitrary shapes in data sets with noisy points and is widely used.However,large-scale disk IO leads to slow calculation speed,uneven cluster density and manual intervention to determine thresholds lead to clustering deviations.Based on this,Spark memory iterative parallelization SDKB-DBSCAN(Spark Density Division Kernel Density Estimation Boundary Strategy-Density-based Spatial Clustering of Applications with Noise)Improve the algorithm,design Spark cache mechanism combined with irregular dynamic partitioning and boundary merging,and parallelization of kernel density estimation.Experiments show that the improved algorithm is generally suitable for clusters of different shapes and larger-scale data,and has a certain improvement in accuracy and calculation speed.
作者
史爱武
尹杰
范平
SHI Aiwu;YIN Jie;FAN Ping(School of Mathematics and Computer,Wuhan Textile University,Wuhan 430000;School of Computer Science and Technology,Hubei University of Science and Technology,Xianning 437000)
出处
《现代计算机》
2021年第14期14-20,37,共8页
Modern Computer
基金
湖北省自然科学基金青年项目(No.2018CFB109)。