摘要
Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of gee-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions.
Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of gee-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions.