摘要
DBSCAN算法是一种基于密度的优秀算法,能够对任意形状的数据进行聚类,且能够识别噪声数据。为了减少人工对输入参数Eps和MinPts的干预,提出了一种新的计算Eps参数的方法;同时,为了解决传统单机DBSCAN算法在大数据环境下的性能问题,基于Spark框架实现了DBSCAN算法的并行化。通过实验表明,提出的DBSCAN改进算法具有很高的准确度和稳定性;并行实现的DBSCAN算法具有很好的并行性能,适合用于处理海量数据聚类。
DBSCAN algorithm is an excellent algorithm based on density. It can cluster arbitrary shape data and recognize noise data. In order to reduce the intervention of the input parameters neighborhood radius Eps and Minimum number of Points (MinPts), a new meflaod of calculating the Eps parameters is proposed. At the same time, in order to solve the performance problem of the traditional single machine DBSCAN algorithm in large data environment, the parallelization of the DBSCAN algorithm is realized based on the Spark framework. The experimental results show that the proposed DBSCAN algorithm has high accuracy and stability, and the parallel implementation of the DBSCAN algorithm has good parallel performance and is suitable for processing mass data clustering.
作者
宋董飞
徐华
SONG Dongfei;XU Hua(School of Internet of Things Engineering,Jiangnan University,Wuxi,Jiangsu 214122,China)
出处
《计算机工程与应用》
CSCD
北大核心
2018年第24期52-56,122,共6页
Computer Engineering and Applications
基金
江苏省自然科学基金(No.BK20140165)
教育部-新华三集团"云数融合"基金(No.2017A13055)