期刊文献+

基于深度学习的集群系统故障预测方法

A Cluster System Failure Prediction Approach Based on Deep Learning
下载PDF
导出
摘要 在面对集群系统故障预测时,长时间序列预测中存在因关键特征信息丢失而导致梯度消失或爆炸问题,从而影响了故障预测模型的准确性。基于此,提出一种新的基于深度学习的集群系统故障预测方法。该方法采用双向门控循环网络(bidirectional gate recurrent unit, BiGRU)来捕捉局部时序特征,同时采用Transformer来提高全局特征提取能力。通过BiGRU层中双向的信息传递获得集群系统日志上时序特征的动态变化,以获取集群事件中的潜在因果关系和局部时间特征,使用Transformer层并行处理BiGRU层输出的时间序列,得到全局的时间依赖性,继而由全连接神经网络层得到预测结果。通过由Blue Gene/L系统产生的真实日志所构建的公共数据集来验证方法的有效性,结果表明,所提方法优于对比方法,其最佳正确率和F1值分别达到91.69%和92.74%。 In the clustered system failure prediction,the long-time series prediction was accompanied by problem such as gradient disappearance or explosion,due to the loss of key feature information,which would affect the accuracy of the model for failure prediction.For this reason,a new model of cluster system fault prediction method based on deep learning was proposed.The method adopted bidirectional gate recurrent unit(BiGRU)to capture local timing features while employing Transformer to improve the global feature extraction capability.The dynamic changes of timing features on the cluster system logs were obtained through bidirectional information transfer in the BiGRU layer to capture the potential causality and local temporal features in the cluster events.The Transformer layer was used to process the time series output from the BiGRU layer in parallel to obtain the global temporal dependence,which followed by the fully connected neural network layer to obtain the prediction results.The effectiveness of the method was validated on a public dataset constructed from real logs generated by the Blue Gene/L system.The results showed that the proposed method outperformed the comparison methods with a best-correct rate and F1 value of 91.69%and 92.74%,respectively.
作者 姬莉霞 张庆开 周洪鑫 党依萍 张晗 JI Lixia;ZHANG Qingkai;ZHOU Hongxin;DANG Yiping;ZHANG Han(School of Cyber Science and Engineering,Zhengzhou University,Zhengzhou 450002,China;College of Computer Science,Sichuan University,Chengdu 610065,China)
出处 《郑州大学学报(理学版)》 CAS 北大核心 2024年第5期71-79,共9页 Journal of Zhengzhou University:Natural Science Edition
基金 国家自然科学基金项目(52179144) 河南省重大科技专项(201300210500) 郑州市重大科技创新专项(2020CXZX0053)。
关键词 故障预测 集群系统 特征提取 循环神经网络 TRANSFORMER 深度学习 failure prediction cluster system feature extraction recurrent neural network Transformer deep learning
  • 相关文献

参考文献3

二级参考文献2

共引文献347

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部