摘要
【背景】在高性能计算系统中,作业运行一段时间后可能失败或者异常退出,导致计算资源被占用但未得到满意结果。【目的】对计算作业异常运行状态的检测和预警可以帮助用户、管理人员提前介入干预,减少资源浪费,更早和更好地跟踪分析异常原因。【方法】本文基于大型超级计算集群真实监控数据,从作业运行状态和特征的角度,采用XGBoost算法对各类型作业的运行状态进行异常检测,并对作业是否失败进行预测。【结果】通过对算法的比较和分析,发现XGBoost能够较准确地预测作业失败。【结论】本文研究为高性能计算作业的异常检测和预警拓展了一种新的研究思路,对帮助用户更高效使用昂贵的超级计算资源具有积极意义。
[Background]In high-performance computing systems,jobs may fail or exit abnormally after running for a period of time,resulting in computational resources being utilized without satisfactory results.[Objective]Detection and early warning of abnormal operation status of computing jobs can help users and managers to intervene in advance,reduce the waste of resources,and track and analyze the causes of abnormalities earlier and better.[Methods]Based on real monitoring data of large supercomputing clusters,the XGBoost algorithm is used to detect anomalies in the operation status of each type of job and predict whether the job fails or not from the operation status and characteristics of the job.[Results]By comparing and analyzing the algorithms,it is found that XGBoost can predict job failure more accurately.[Conclusions]The research in this paper explores a new research idea for anomaly detection and early warning of high performance computing jobs,which is of positive significance to help users to use expensive supercomputing resources more efficiently.
作者
纪鹏
牛铁
危婷
彭亮
JI Peng;NIU Tie;WEI Ting;PENG Liang(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100083,China;School of Computer Science and Technology,Chinese Academy of Sciences,Beijing 100049,China)
基金
中国科学院网络安全和信息化专项(CAS-WX2022GC-0103)。