摘要
超级计算机的规模不断扩大,与此同时,科学应用的复杂性也在不断增加,这导致了超级计算机上许多作业失败。作业失败会造成资源浪费,排队作业等待时间延长,严重影响系统的执行效率。提前预测作业失败,就可以采取必要的措施提升系统资源利用率和系统执行效率,这对未来的E级超级计算机至关重要。为此,尝试研究从已知的传统特征和构建特征中预测作业失败,发现能够反映用户工作行为模式和提交行为模式的特征及处理方式。通过结合行为特征和传统特征,提出基于树结构模型的综合框架来预测作业失败。实验结果表明,预测效果优于其他相关方法。
The scale of supercomputers is expanding.Meanwhile,the complexity of scientific applications is also increasing,which leads to many job failures on supercomputers.These failed jobs causes a waste of resources and prolong the waiting time of queuing jobs,which seriously affects the reliability of the system.If these failed jobs can be predicted in advance,necessary measures can be taken to improve the system resource utilization and system execution efficiency,which is very important for the future exascale supercomputers.Therefore,this paper attempts to predict these job failures from the known traditional features and construction features,and find the features and processing methods that can reflect the user’s work behavior patterns and submission behavior patterns.By combining behavior features and traditional features,a comprehensive framework based on tree structure model is proposed to predict job failure.The prediction experimental results show that the comprehensive prediction framework is better than the single model prediction,and the comparative experimental results show that the prediction effect is better than other related methods.
作者
唐阳坤
鲜港
杨文祥
喻杰
张晓蓉
王耀彬
TANG Yang-kun;XIAN Gang;YANG Wen-xiang;YU Jie;ZHANG Xiao-rong;WANG Yao-bin(School of Computer Science and Technology,Southwest University of Science and Technology,Mianyang 621010;Computational Aerodynamics Institute,China Aerodynamics Research and Development Center,Mianyang 621050;College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
出处
《计算机工程与科学》
CSCD
北大核心
2022年第10期1753-1761,共9页
Computer Engineering & Science
基金
国家自然科学基金(61872304,61802320)
空气动力学国家重点实验室基金(SKLA20200203)。
关键词
系统执行效率
作业日志分析
用户行为
作业失败预测
机器学习
system execution efficiency
job log analysis
user behavior
job failure prediction
machine learning