Apache Spark provides a well-known Map Reduce computing framework,aiming to fast-process big data analytics in data-parallel manners.With this platform,large input data are divided into data partitions.Each data parti...Apache Spark provides a well-known Map Reduce computing framework,aiming to fast-process big data analytics in data-parallel manners.With this platform,large input data are divided into data partitions.Each data partition is processed by multiple computation tasks concurrently.Outputs of these computation tasks are transferred among multiple computers via the network.However,such a distributed computing framework suffers from system overheads,inevitably caused by communication and disk I/O operations.System overheads take up a large proportion of the Job Completion Time(JCT).We observed that excessive computational resources incurs considerable system overheads,prolonging the JCT.The over-allocation of individual jobs not only prolongs their own JCTs,but also likely makes other jobs suffer from under-allocation.Thus,the average JCT is suboptimal,too.To address this problem,we propose a prediction model to estimate the changing JCT of a single Spark job.With the support of the prediction method,we designed a heuristic algorithm to balance the resource allocation of multiple Spark jobs,aiming to minimize the average JCT in multiple-job cases.We implemented the prediction model and resource allocation method in Re B,a Resource-Balancer based on Apache Spark.Experimental results showed that Re B significantly outperformed the traditional max-min fairness and shortest-job-optimal methods.The average JCT was decreased by around 10%–30%compared to the existing solutions.展开更多
基金supported in part by the National Key R&D Program of China(No.2018YFB2101100)the National Natural Science Foundation of China(Nos.61932001 and61872376)Hunan Provincial Innovation Foundation For Postgraduate.
文摘Apache Spark provides a well-known Map Reduce computing framework,aiming to fast-process big data analytics in data-parallel manners.With this platform,large input data are divided into data partitions.Each data partition is processed by multiple computation tasks concurrently.Outputs of these computation tasks are transferred among multiple computers via the network.However,such a distributed computing framework suffers from system overheads,inevitably caused by communication and disk I/O operations.System overheads take up a large proportion of the Job Completion Time(JCT).We observed that excessive computational resources incurs considerable system overheads,prolonging the JCT.The over-allocation of individual jobs not only prolongs their own JCTs,but also likely makes other jobs suffer from under-allocation.Thus,the average JCT is suboptimal,too.To address this problem,we propose a prediction model to estimate the changing JCT of a single Spark job.With the support of the prediction method,we designed a heuristic algorithm to balance the resource allocation of multiple Spark jobs,aiming to minimize the average JCT in multiple-job cases.We implemented the prediction model and resource allocation method in Re B,a Resource-Balancer based on Apache Spark.Experimental results showed that Re B significantly outperformed the traditional max-min fairness and shortest-job-optimal methods.The average JCT was decreased by around 10%–30%compared to the existing solutions.