摘要
高性能集群的作业调度通常使用作业调度系统来实现,准确填写作业运行时间能在很大程度上提升作业调度效率。现有的研究通常使用机器学习的预测方式,在预测精度和实用性上还存在一定的提升空间。为了进一步提高集群作业运行时间预测的准确率,考虑先对集群作业日志进行聚类,将作业类别信息添加到作业特征中,再使用基于注意力机制的NR-Transformer网络对作业日志数据建模和预测。在数据处理上,根据与预测目标的相关性、特征的完整性和数据的有效性,从历史日志数据集中筛选出7维特征,并按作业运行时间的长度将其划分为多个作业集,再对各作业集分别进行训练和预测。实验结果表明,相比于传统机器学习和BP神经网络,时序神经网络结构有更好的预测性能,其中NR-Transformer在各作业集上都有较好的性能。
Job scheduling of high-performance clusters is usually implemented by the job scheduling system.Filling in the job running time accurately can greatly improve the efficiency of job scheduling.Existing research usually uses machine learning for prediction,and the prediction accuracy and practicality can be further improved.In order to further improve the accuracy of cluster job running time prediction,cluster job logs are firstly clustered,and job category information is added to job features.Secondly,the job log data is modeled and predicted using the attention-based NR-Transformer network.In data processing,according to the correlation with the prediction target,the integrity of the feature and the validity of the data,7-dimensional features are selected from the historical log dataset,the dataset is divided into multiple job sets according to the length of the job running time,and then each job set is trained and predicted separately.The experimental results show that,compared with traditional machine learning and BP neural network,its timing neural network structure has better prediction performance,and NR-Transformer has better performance on each job set.
作者
陈奉贤
CHEN Feng-xian(Office of Network Security and Information,Lanzhou University,Lanzhou 730000,China)
出处
《计算机工程与科学》
CSCD
北大核心
2022年第7期1181-1190,共10页
Computer Engineering & Science
关键词
高性能计算
并行作业调度
用户聚类
时序神经网络
注意力机制
high performance computing
parallel job scheduling
user clustering
timing neural network
attention mechanism