Scalable Varied Density Clustering Algorithm for Large Datasets

Scalable Varied Density Clustering Algorithm for Large Datasets

下载PDF

导出

摘要 Finding clusters in data is a challenging problem especially when the clusters are being of widely varied shapes, sizes, and densities. Herein a new scalable clustering technique which addresses all these issues is proposed. In data mining, the purpose of data clustering is to identify useful patterns in the underlying dataset. Within the last several years, many clustering algorithms have been proposed in this area of research. Among all these proposed methods, density clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. Moreover these methods often show good noise-handling capabilities, where clusters are defined as regions of typical densities separated by low or no density regions. In this paper, we aim at enhancing the well-known algorithm DBSCAN, to make it scalable and able to discover clusters from uneven datasets in which clusters are regions of homogenous densities. We achieved the scalability of the proposed algorithm by using the k-means algorithm to get initial partition of the dataset, applying the enhanced DBSCAN on each partition, and then using a merging process to get the actual natural number of clusters in the underlying dataset. This means the proposed algorithm consists of three stages. Experimental results using synthetic datasets show that the proposed clustering algorithm is faster and more scalable than the enhanced DBSCAN counterpart. Finding clusters in data is a challenging problem especially when the clusters are being of widely varied shapes, sizes, and densities. Herein a new scalable clustering technique which addresses all these issues is proposed. In data mining, the purpose of data clustering is to identify useful patterns in the underlying dataset. Within the last several years, many clustering algorithms have been proposed in this area of research. Among all these proposed methods, density clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. Moreover these methods often show good noise-handling capabilities, where clusters are defined as regions of typical densities separated by low or no density regions. In this paper, we aim at enhancing the well-known algorithm DBSCAN, to make it scalable and able to discover clusters from uneven datasets in which clusters are regions of homogenous densities. We achieved the scalability of the proposed algorithm by using the k-means algorithm to get initial partition of the dataset, applying the enhanced DBSCAN on each partition, and then using a merging process to get the actual natural number of clusters in the underlying dataset. This means the proposed algorithm consists of three stages. Experimental results using synthetic datasets show that the proposed clustering algorithm is faster and more scalable than the enhanced DBSCAN counterpart.

作者 Ahmed Fahim Abd-Elbadeeh Salem Fawzy Torkey Mohamed Ramadan Gunter Saake

机构地区不详

出处《Journal of Software Engineering and Applications》 2010年第6期593-602,共10页 软件工程与应用（英文）

关键词 EDBSCAN DATA CLUSTERING Varied DENSITY CLUSTERING CLUSTER ANALYSIS EDBSCAN Data Clustering Varied Density Clustering Cluster Analysis

分类号 R73 [医药卫生—肿瘤]

引文网络
相关文献

1Pablo Rivas-Perea,Juan Cota-Ruiz,David Garcia Chaparro,Jorge Arturo Perez Venzor,Abel Quezada Carreón,Jose Gerardo Rosiles.Support Vector Machines for Regression: A Succinct Review of Large-Scale and Linear Programming Formulations[J].International Journal of Intelligence Science,2013,3(1):5-14. 被引量：3
2Michael B. Richman,Andrew E. Mercer,Lance M. Leslie,Charles A. Doswell III,Chad M. Shafer.High Dimensional Dataset Compression Using Principal Components[J].Open Journal of Statistics,2013,3(5):356-366.
3Marco A. Aceves-Fernandez,Jesus Carlos Pedraza-Ortega,Artemio Sotomayor-Olmedo,Juan M. Ramos-Arreguín,Jose Emilio Vargas-Soto,Saul Tovar-Arriaga.Analysis of Key Features of Non-Linear Behavior Using Recurrence Plots. Case Study: Urban Pollution at Mexico City[J].Journal of Environmental Protection,2012,3(9):1147-1160.
4Berhanu F. Alemaw,J. M. Kileshye-Onema,D. Love.Regional Drought Severity Assessment at a Basin Scale in the Limpopo Drainage System[J].Journal of Water Resource and Protection,2013,5(11):1110-1116.
5Shan Zhong,Guihua Wang,Xiaohui Leng,Xiaona Wang,Lian Xue,Yue Gu.A Low Energy Consumption Clustering Routing Protocol Based on K-Means[J].Journal of Software Engineering and Applications,2012,5(12):1013-1015. 被引量：1
6Ye Li,Yiyan Chen.Research on Initialization on EM Algorithm Based on Gaussian Mixture Model[J].Journal of Applied Mathematics and Physics,2018,6(1):11-17. 被引量：4
7Ranganath Vallakati,Anupam Mukherjee,Prakash Ranganathan.Situational Awareness Using DBSCAN in Smart-Grid[J].Smart Grid and Renewable Energy,2015,6(5):120-127.
8Katharina Renner-Martin,Norbert Brunner,Manfred Kühleitner,Werner-Georg Nowak,Klaus Scheicher.A Model for the Mass-Growth of Wild-Caught Fish[J].Open Journal of Modelling and Simulation,2019,7(1):19-40. 被引量：1
9Jose Pena,Yumin Tan,Wuttichai Boonpook.Semantic Segmentation Based Remote Sensing Data Fusion on Crops Detection[J].Journal of Computer and Communications,2019,7(7):53-64. 被引量：1
10Gang-Guo Li,Zheng-Zhi Wang.Incorporating heterogeneous biological data sources in clustering gene expression data[J].Health,2009,1(1):17-23.

Journal of Software Engineering and Applications

2010年第6期

浏览历史

内容加载中请稍等...

Scalable Varied Density Clustering Algorithm for Large Datasets

相关作者

相关机构

相关主题

浏览历史