期刊文献+

E级高性能计算机的维护故障诊断系统研究 被引量:6

Research on Maintenance Fault Diagnosis System for E-class High-Performance Computer
下载PDF
导出
摘要 E级计算机系统规模巨大,使得故障异常总量随之增多,导致诊断发现的难度增加,因此,迫切需要一套更加准确高效的实时维护故障诊断系统,对硬件系统进行全面的异常及故障信息实时检测、故障诊断及故障预测。传统故障诊断系统在面对数万节点规模的诊断时存在执行效率低、异常检测误报率高的问题,异常检测及故障诊断的覆盖率不足。对异常及故障检测、故障诊断与故障预测相关技术进行研究,分析技术原理及适用性,并结合E级高性能计算机实际工程需求,设计一套满足数E级高性能计算机需求的维护故障诊断系统。基于维护系统的结构组成设计可扩展的边缘诊断架构,将高性能计算机系统知识、专家知识与数理统计、机器学习相融合给出故障检测、诊断及预测算法,并针对专用场景建立预测模型。实验结果表明,该系统具有较好的可扩展性,能在10 s内完成对十万个节点规模系统的故障诊断,与传统故障诊断系统相比,异常检测某特定指标误报率从3.3%降低到几乎为0,硬件故障检测覆盖率从90.2%提升至96%以上,硬件故障诊断覆盖率从71%提升至约94%,能较准确地预测多个重要应用场景下的故障。 E-class computer systems typically have huge scales.Consequently,the total number of abnormal faults is bound to increase,resulting in difficulty in fault diagnosis.Thus,there is an urgent need for the development of a more accurate and efficient real-time maintenance fault diagnosis system that is able to perform comprehensive real-time detection,fault diagnosis,and fault prediction of hardware systems using abnormal and fault information.The traditional fault diagnosis system faces the problems of low execution efficiency and high false-positive rate of abnormal detection in the face of tens of thousands of nodes.Additionally,the coverage rate of abnormal detection and fault diagnosis is insufficient.In this study,abnormal and fault detection,fault diagnosis,and fault prediction of related technologies are evaluated,the principle and applicability of the main methods are analyzed,and a set of maintenance fault diagnosis systems that can meet the needs of E-class high-performance computers are designed in combination with the actual engineering requirements of E-class highperformance computers.Moreover,a scalable edge diagnosis architecture is designed based on the structural composition of the maintenance system,and the high-performance computer system knowledge and expert knowledge are integrated using mathematical statistics and machine learning to design fault detection,diagnosis,and prediction algorithms.Finally,prediction models are established for special scenarios.The experimental results show that the system has good scalability compared with the traditional fault diagnosis system,and can complete the fault diagnosis of 100000 nodes within 10 s.Additionally,the false positive rate of the specific indicator used in anomaly detection is reduced from 3.3%to almost zero,the hardware fault detection coverage rate is increased from 90.2%to more than 96%,the hardware fault diagnosis coverage is increased from 71%to 94%,and the fault prediction can accurately predict the faults in several important application scenarios.
作者 建澜涛 任秀江 张祯 石嵩 黄益明 张春林 JIAN Lantao;REN Xiujiang;ZHANG Zhen;SHI Song;HUANG Yiming;ZHANG Chunlin(Jiangnan Institute of Computing Technology,Wuxi,Jiangsu 214083,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)
出处 《计算机工程》 CAS CSCD 北大核心 2022年第12期24-37,共14页 Computer Engineering
基金 “十四五”国家重点研发计划(2021YFB0300900,2021YFB0301000)。
关键词 高性能计算 维护系统 异常检测 故障诊断 故障预测 high-performance computing maintenance system anomaly detection fault diagnosis fault prediction
  • 相关文献

参考文献13

二级参考文献121

共引文献146

同被引文献36

引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部