期刊文献+

基于弹性分布数据集的海量空间数据密度聚类 被引量:5

Density Based Clustering on Large Scale Spatial Data Using Resilient Distributed Dataset
下载PDF
导出
摘要 为了快速挖掘大规模空间数据的聚集特性,在cluster_dp密度聚类算法基础上,提出了一种基于弹性分布数据集的并行密度聚类方法 PClusterdp.首先,设计一种能平衡工作负载弹性分布数据集分区方法,根据数据在空间的分布情况,自动划分网格并分配数据,使得网格内数据量相对均衡,达到平衡运算节点负载的目的;接着,提出一种适用于并行计算的局部密度定义,并改进聚类中心的计算方式,解决了原始算法需要通过绘制决策图判断聚类中心对象的缺陷;最后,通过网格内及网格间聚簇合并等优化策略,实现了大规模空间数据的快速聚类处理.实验结果表明,借助Spark数据处理平台编程实现算法,本方法可以有效实现大规模空间数据的快速聚类,与传统的密度聚类方法相比具有较高的精确度与更好的系统处理性能. This paper proposed a density based parallel clustering algorithm to mine the feature of large scale spatial data.The proposed PClusterdp algorithm is based on the cluster-dp algorithm.First,we in-troduced a data object count based RDD partition algorithm for balancing the working load of each compute node in computing cluster.Second,we redefined the local density for each data point to suit the parallel computing.Meanwhile,in order to get rid of original algorithm's decision graph,we proposed a method to automatically determine the center point for each cluster.Finally,we discussed the cluster merge strata-gem to combine the partially clustered data together to generate the final clustering result.We implemen-ted our Resilient Distributed Dataset (RDD)based algorithm on Spark.The experiment result shows that the proposed algorithm can cluster large scale spatial data effectively,and meanwhile,the method has bet-ter performance than the traditional density clustering methods and can achieve the rapid clustering of mas-sive spatial data.
出处 《湖南大学学报(自然科学版)》 EI CAS CSCD 北大核心 2015年第8期116-124,共9页 Journal of Hunan University:Natural Sciences
基金 国家自然科学基金资助项目(61304199) 长沙理工大学特殊道路工程湖南省重点实验室开发基金资助项目~~
关键词 空间数据 聚类算法 弹性分布式数据集 SPARK Spark spatial data clustering algorithm resilient distributed dataset Spark
  • 相关文献

参考文献25

  • 1HAN J, KAMBER M, PEI J. Data mining concepts and techniques [M3. Third Edition. Singapore: Elsevier Pte Ltd, 2012.
  • 2TVRDIK J, KIIV" I. Differential evolUtion with competing strategies applied to partitional clustering [J]. Swarm and Ev- olutionary Computation, 2012, 7269(4): 136--144.
  • 3CARVALHO, A X Y, ALBUQUERQUE P, etal. Spatial hi- erarchical clustering [J]. Revista Brasileira de Biometria, 2009, 27(3): 411--442.
  • 4SANDER J, ESTER M, HANS P, et al. Density-based clus- tering in spatial databases: The algorithm gdbscan and its ap- plications [J]. Data Mining and Knowledge Discovery, 1998, 2(2): 169--194.
  • 5wANG S, CHEN Y. HASTA: A Hierarchical-grid clustering algorithm with data field [J]. International Journal of Data Warehousing and Mining, 2014, 10 (2): 39--54.
  • 6BOUVEYRON C C, BRUNET-SAUMARD. Model based clustering of high-dimensional data a review [J]. Computa- tional Statistics Data Analysis, 2014, 71 (6): 52--78.
  • 7KIRI W, CLAIRE1 C, SETH R, et al. Constrained k-means clustering with background knowledge [C]//Proceedings of the Eighteenth International Conference on Machine Learn- ing. USA, 2001: 577--584.
  • 8PARK HAE-SANG, CH[HYUCK JUN. A simple and fast algorithm for K-medoids clustering [J]. Expert Systems with Applications, 2009, 36 (2).. 3336--3341.
  • 9ARTHUR D, SERGEI V. k-means+ + : The advantages of careful seeding [C]//Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. USA, 2007: 1027-- 1035.
  • 10ZHANG Tian, RAGHU R, MIRON L. BIRCH: A new data clustering algorithm and its applications [J]. Data Mining and Knowledge Discovery, 1997, 1 (2) 141--182.

同被引文献29

引证文献5

二级引证文献20

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部