DBSCAN (density-based spatial clustering of ap- plications with noise) is an important spatial clustering tech- nique that is widely adopted in numerous applications. As the size of datasets is extremely large nowad...DBSCAN (density-based spatial clustering of ap- plications with noise) is an important spatial clustering tech- nique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel process- ing of complex data analysis such as DBSCAN becomes in- dispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to prop- erly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these al- gorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily de- signed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the crit- ical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential process- ing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation us- ing real large datasets with up to 1.2 billion points. The ex- periment results well confirm the efficiency and scalability of MR-DBSCAN.展开更多
This paper presents a cloud-based multiple-route recommendation system, xGo, that enables smartphone users to choose suitable routes based on knowledge discovered in real taxi trajectories. In modern cities, GPS-equip...This paper presents a cloud-based multiple-route recommendation system, xGo, that enables smartphone users to choose suitable routes based on knowledge discovered in real taxi trajectories. In modern cities, GPS-equipped taxicabs report their locations regularly, which generates a huge volume of trajectory data every day. The optimized routes can be learned by mining these massive repositories of spatio-temporal information. We propose a system that can store and manage GPS log files in a cloud-based platform, probe traffic conditions, take advantage of taxi driver route-selection intelligence, and recommend an optimal path or multiple candidates to meet customized requirements. Specifically, we leverage a Hadoop-based distributed route clustering algorithm to distinguish different routes and predict traffic conditions through the latent traffic rhythm. We evaluate our system using a real-world dataset(〉100 GB) generated by about 20 000 taxis over a 2-month period in Shenzhen, China. Our experiments reveal that our service can provide appropriate routes in real time and estimate traffic conditions accurately.展开更多
文摘DBSCAN (density-based spatial clustering of ap- plications with noise) is an important spatial clustering tech- nique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel process- ing of complex data analysis such as DBSCAN becomes in- dispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to prop- erly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these al- gorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily de- signed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the crit- ical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential process- ing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation us- ing real large datasets with up to 1.2 billion points. The ex- periment results well confirm the efficiency and scalability of MR-DBSCAN.
文摘This paper presents a cloud-based multiple-route recommendation system, xGo, that enables smartphone users to choose suitable routes based on knowledge discovered in real taxi trajectories. In modern cities, GPS-equipped taxicabs report their locations regularly, which generates a huge volume of trajectory data every day. The optimized routes can be learned by mining these massive repositories of spatio-temporal information. We propose a system that can store and manage GPS log files in a cloud-based platform, probe traffic conditions, take advantage of taxi driver route-selection intelligence, and recommend an optimal path or multiple candidates to meet customized requirements. Specifically, we leverage a Hadoop-based distributed route clustering algorithm to distinguish different routes and predict traffic conditions through the latent traffic rhythm. We evaluate our system using a real-world dataset(〉100 GB) generated by about 20 000 taxis over a 2-month period in Shenzhen, China. Our experiments reveal that our service can provide appropriate routes in real time and estimate traffic conditions accurately.