期刊文献+

Spark并行化改进的SDKB-DBSCAN聚类算法

Spark Parallelization Improved SDKB-DBSCAN Clustering Algorithm
下载PDF
导出
摘要 DBSCAN算法是基于密度的聚类算法,可在有噪声点的数据集中发现任意形状类簇,得到广泛应用。但其存在大规模磁盘I/O导致计算速度慢,密度不均匀类簇和人工干预确定阈值导致聚类偏差等缺陷,基于此提出Spark内存迭代并行化SDKB-DBSCAN(Spark Density Division Kernel Density Estimation Boundary Stategy-Density-based Spatial Clustering of Applications with Noise)改进算法,设计Spark缓存机制结合不规则动态分区和边界合并以及核密度估计并行化。实验表明,改进算法一般适用不同形状类簇和较大规模数据聚类,在准确率和计算速率上有一定提升。 The DBSCAN algorithm is a density-based clustering algorithm,which can find clusters of arbitrary shapes in data sets with noisy points and is widely used.However,large-scale disk IO leads to slow calculation speed,uneven cluster density and manual intervention to determine thresholds lead to clustering deviations.Based on this,Spark memory iterative parallelization SDKB-DBSCAN(Spark Density Division Kernel Density Estimation Boundary Strategy-Density-based Spatial Clustering of Applications with Noise)Improve the algorithm,design Spark cache mechanism combined with irregular dynamic partitioning and boundary merging,and parallelization of kernel density estimation.Experiments show that the improved algorithm is generally suitable for clusters of different shapes and larger-scale data,and has a certain improvement in accuracy and calculation speed.
作者 史爱武 尹杰 范平 SHI Aiwu;YIN Jie;FAN Ping(School of Mathematics and Computer,Wuhan Textile University,Wuhan 430000;School of Computer Science and Technology,Hubei University of Science and Technology,Xianning 437000)
出处 《现代计算机》 2021年第14期14-20,37,共8页 Modern Computer
基金 湖北省自然科学基金青年项目(No.2018CFB109)。
关键词 DBSCAN算法 Spark并行化 动态分区 核密度估计 缓存机制 DBSCAN Algorithm Spark Parallelization Dynamic Partitioning Kernel Density Estimation Caching Mechanism
  • 相关文献

参考文献11

二级参考文献78

共引文献209

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部