摘要
面向位置大数据聚类,提出了一种简单但高效的快速密度聚类算法CBSCAN,以快速发现位置大数据中任意形状的聚类簇模式和噪声.首先,定义了Cell网格概念,并提出了基于Cell的距离分析理论,利用该距离分析,无需距离计算,可快速确定高密度区域的核心点和密度相连关系;其次,给出了网格簇定义,将基于位置点的密度簇映射成基于网格的密度簇,利用排他网格与相邻网格的密度关系,可快速确定网格簇的包含网格;第三,利用基于Cell的距离分析理论和网格簇概念,实现了一个快速密度聚类算法,将DBSCAN基于数据点的密度扩展聚类转换成基于Cell的密度扩展聚类,极大地减少高密度区域的距离计算,利用位置数据的内在特性提高了聚类效率;最后,在基准测试数据上验证了所提算法的聚类效果,在位置大数据上的实验结果统计显示,与DBSCAN、PR-Tree索引和Grid索引优化的DBSCAN相比,CBSCAN分别平均提升了525倍、30倍和11倍效率.
This paper proposes a simple but efficient density-based clustering, named CBSCAN, to fast discover cluster patterns with arbitrary shapes and noises from location big data effectively. Firstly, the notion of Cell is defined and a distance analysis principle based on Cell is proposed to quickly find core points in high density areas and density relationships with other points without distance computing. Secondly, a Cell-based cluster that maps point-based density cluster to grid-based density cluster is presented. By leveraging exclusion grids and relationships with their adjacent grids, all inclusion grids of Cell-based cluster can be rapidly determined. Furthermore a fast density-based algorithm based on the distance analysis principle and Cell-base cluster is implemented to transform DBSCAN of point-based expansion to Cell-based expansion clustering. The proposed algorithm improves clustering efficiency significantly by using inherent property of location data to reduce huge number of distance calculations. Finally, comprehensive experiments on benchmark datasets demonstrate the clustering effectiveness of the proposed algorithm. Experimental results on massive-scale real and synthetic location datasets show that CBSCAN improves 525 fold, 30 fold and 11 fold of efficiency compared with DBSCAN, DBSCAN with PR-Tree and Grid index optimization respectively.
作者
于彦伟
贾召飞
曹磊
赵金东
刘兆伟
刘惊雷
YU Yan-Wei;JIA Zhao-Fei;CAO Lei;ZHAO Jin-Dong;LIU Zhao-Wei;LIU Jing-Lei(School of Computer and Control Engineering,Yantai University,Yantai 264005,China;Computer Science and Artificial Intelligence Laboratory,Massachusetts institute of Technology,Cambridge 02139,USA)
出处
《软件学报》
EI
CSCD
北大核心
2018年第8期2470-2484,共15页
Journal of Software
基金
国家自然科学基金(61403328
61773331
61572419
61502410)
山东省重点研发计划(2015GSF115009)
山东省自然科学基金(ZR2013FM011
ZR2013FQ023
ZR2014FQ016)~~
关键词
聚类分析
密度聚类
位置大数据
Cell网格
网格簇
clustering analysis
density-based clustering
location big data
Cell grid
cell-based cluster