摘要
随着科技的进步,高性能计算机作为重要的科研基础设施为各行各业的发展提供了有力的支撑保障。确保高性能计算机稳定高效的运行是系统管理员的希冀也是职责所在。主要介绍了以“魔方-3”高性能计算机为对象开发的运维管理平台,包括平台架构设计、底层数据采集接口和方式,以及该平台实现的系统监控、自动巡检、数据分析等多种功能。借助这个平台系统管理员能直观清晰地了解计算机运行状况,及时发现并处置故障,通过多角度的数据挖掘分析影响当前运行效率的瓶颈所在,为后续软硬件优化升级提供科学的决策依据。
With the progress of science and technology,high-performance computers,as important infrastructure for scientific research,have provided strong support for the development of various indu-stries.It is administrators’wishes and responsibilities to guarantee that high-performance computers can operate stably and efficiently.This paper mainly introduces the maintenance and management system powered by“magic cube-3”supercomputer.The introduction includes platform structure design,underlying data collection interface and methods,and various functions achieved by the platform including system monitoring,automatic detection and data analysis.This platform enables administrators to directly know the operation status of computers and timely find and handle malfunction.Through collecting and analyzing data from multiple perspectives,administrators can find out bottlenecks that slow down the operation efficiency,thus offering scientific decision-making basis for subsequent optimization and upgrading.
作者
赵奇奇
ZHAO Qi-qi(Shanghai Supercomputer Center,Shanghai 201203,China)
出处
《计算机工程与科学》
CSCD
北大核心
2020年第10期1807-1814,共8页
Computer Engineering & Science
关键词
高性能计算机
运维管理
系统监控
数据分析
high-performance computer
maintenance and management
system monitoring
data ana-lysis