期刊文献+

决策树模型预测Spark SQL作业执行时间的方法

METHOD OF PREDICTING SPARK SQL JOB EXECUTION TIME BY DECISION TREE MODEL
下载PDF
导出
摘要 Spark SQL在超大规模集群和数据集上存在易用性问题,如Catalyst最优执行计划的选择,Shuffle Partition的配置对性能有较大的影响,数据倾斜往往导致集群性能变差。为了在作业执行之前准确预测执行时间,更加充分地使用运行时数据,选择最优执行计划,提出通过决策树及其组合算法的回归模型预测作业执行时间的方法。采用交叉验证方法优化模型超参数,通过剪枝和组合算法优化过度拟合问题,选择相关指标评估机器学习模型预测的准确性。实验表明,梯度提升树回归模型预测作业执行时间的R 2超过0.8,且能够满足在线预测的实时性要求,模型评估指标达到预期效果,相对于线性回归模型的评估指标具有一定的优势。 Spark SQL implements high-speed computing and complex data mining,but there are problems with ease of use on very large clusters and datasets.As with the choice of Catalyst optimal execution plan,the configuration of Shuffle Partition has a large impact on performance,and data skew often leads to poor cluster performance.The purpose of this paper is to accurately predict execution time before the job is executed,to use the runtime data more fully,and to select the best execution plan.A regression model for predicting job execution time by decision tree and its combination algorithm is proposed.The cross validation method was used to optimize the model parameters.The pruning and combination algorithm was used to optimize the over-fitting problem,and the relevant indicators were selected to evaluate the accuracy of the machine learning model.The experiment shows that Gradient Boosting decision tree model predicts that the R 2 of the execution time of the job exceeds 0.8,and it can meet the real-time requirements of online prediction.The model evaluation index achieves the expected effect,and has certain advantages over the evaluation index of the linear regression model.
作者 吴恩慈 Wu Enci(Shanghai Qiyu Information Technology Co.,Ltd.,Shanghai 200120,China)
出处 《计算机应用与软件》 北大核心 2021年第4期24-31,123,共9页 Computer Applications and Software
关键词 任务调度 计算引擎 作业特征 执行时间 预测模型 决策树 Task scheduling Calculation engine Job characteristics Execution time Prediction model Decision tree
  • 相关文献

参考文献4

二级参考文献10

共引文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部