期刊文献+

国家高性能计算环境运行状态诊断系统

A Monitoring and Diagnosis System for CNGrid
下载PDF
导出
摘要 【目的】本文介绍了一种在大规模分布式运行环境中建立运行状态诊断系统的方法。【应用背景】为保障高性能计算环境的稳定运行,分析日志等环境数据是一种获取环境状态侧写和发现异常的重要途经。然而分析结果通常是文本和数字,对运维人员来讲缺乏直观印象,不利于快速理解。【方法】我们建设了国家高性能计算环境运行状态诊断系统,它是一种对于目标计算环境的运行状态进行量化和可视化评判的系统,通过对于目标环境的信息收集、整理,进行不同角度的分项分析。【结果】各分析结果被集成为统一的环境运行状态分值,并采用可视化方法将其立体地表现出来,以便相关运维人员能够直观地获取环境信息和快速定位问题。【结论】整个环节绝大部分处理分析工作是由程序自动完成,环境运行状态诊断系统极大减少了人工操作量,为运维工作起到有效的支撑作用。 [Objective]This paper proposes a monitoring and diagnosing system in the large-scale distributed computing environment.[Context]To improve the services and support the stable operation of the high-performance computing environment,as well as to avoid malfunction resulting from errors and failures,it is necessary to collect information such as logs from the environment so that profiling of the program execution and anomalies can be found.However,the data analyzed are usually in the form of text and numbers,which are not easily understandable to humans.[Methods]This paper demonstrates the monitoring and diagnosis system of CNGrid,which can assess the operation status of the monitored environment through quantification and visualization methods.It gathers data from CNGrid and performs analyses from several angles.[Results]The analyzed results are transformed into rating numbers and visualized figures to enable the operators quickly identify the causes and locations of anomalies.[Conclusions]The major process are automatically performed by the system,thus it greatly reduces manual effort and successfully supports the operation and maintenance works.
作者 赵一宁 肖海力 ZHAO Yining;XIAO Haili(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100083,China)
出处 《数据与计算发展前沿》 CSCD 2024年第1期57-67,共11页 Frontiers of Data & Computing
基金 国家重点研发计划项目“国家高性能计算环境服务化机制与支撑体系研究(二期)”(2018YFB0204000)。
关键词 状态诊断 数据处理 量化 可视化应用 高性能计算环境 system diagnosing data processing quantification visualization HPC environment
  • 相关文献

参考文献9

二级参考文献22

共引文献71

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部