期刊文献+

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data 被引量:18

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
原文传递
导出
摘要 DBSCAN (density-based spatial clustering of ap- plications with noise) is an important spatial clustering tech- nique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel process- ing of complex data analysis such as DBSCAN becomes in- dispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to prop- erly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these al- gorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily de- signed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the crit- ical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential process- ing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation us- ing real large datasets with up to 1.2 billion points. The ex- periment results well confirm the efficiency and scalability of MR-DBSCAN. DBSCAN (density-based spatial clustering of ap- plications with noise) is an important spatial clustering tech- nique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel process- ing of complex data analysis such as DBSCAN becomes in- dispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to prop- erly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these al- gorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily de- signed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the crit- ical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential process- ing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation us- ing real large datasets with up to 1.2 billion points. The ex- periment results well confirm the efficiency and scalability of MR-DBSCAN.
出处 《Frontiers of Computer Science》 SCIE EI CSCD 2014年第1期83-99,共17页 中国计算机科学前沿(英文版)
关键词 data clustering parallel algorithm data mining data clustering, parallel algorithm, data mining,
  • 相关文献

参考文献21

  • 1Ester M, Kriegel H P, Sander J, Xu X. A densitybased algorithm for dis?covering clusters in large spatial databases. Data Mining and Knowl?edge Discovery, 1996,96: 226-231.
  • 2MacQueen J B. Some methods for classification and analysis of multi?variate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967,281-297.
  • 3Zhang T, Ramakrishnan R, Livny M. Birch: an efficient data cluster?ing method for very large databases. In: Proceedings of 1996 the ACM SIGMOD Conference on Managemnet of Data. 1996, lO3-114.
  • 4Dempster A P, Laird N M, Rubin D B. Maximum likelihood from in?complete data via the EM algorithm. Journal of the Royal Statisticai Societ, 1977,39(1): 1-38.
  • 5Wang W, Yang J, Muntz R R. Sting: A statistical information grid ap?proach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, 186-195.
  • 6Microsoft Academic Search. Top publications in data mining. http://academic.research.microsoft.com/CSDirectory/papeccategory_ 7.html. 2013.
  • 7Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. 2008, lO7-113.
  • 8White T. Hadoop: The Definitive Guide, 1st edition. O'Reilly Media, Inc., 2009.
  • 9Berger M, Bokhari S. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 1987,36: 570- 580.
  • 10Dai B R, Lin I C. Efficient map/reduce-based dbscan algorithm with optimized data partition. In: Proceedings of the 5th IEEE International Conference on Cloud Computing. 2012, 59--66.

同被引文献97

引证文献18

二级引证文献315

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部