摘要
A Distributed Monitoring System(NGOP)that will scale to the anticipated requirements for RUn II computing has been under development at Fermilab.NGOP [1] provides a framework to create Monitoring Agents for monitoring the overall state of computers and software that are running on them.Several Monitoring Agents are available within NGOP that are capable of analyzing log files,and checking existence of system daemons,CPU and memory utilization,etc,NGOP also provides customizable graphical hierarchical representations of these monitored systems.NGOP is able to generate events when serious problems have occurred as well as raising alarms when potential problems have been detected.NGOP allows performing correctiv actions or sending notifications,NGOP provides persistent storage for collected events,alarms and actions.A first implementation of NGOP was recently deployed at Fermilab.This is a fully functional prototype that satisfies most of the existing requirements.For the time being the NGOP prototype is monitoring 512 nodes.During the first few months of running NGOP has proved to be a useful tool.Multiple problems such as node resets,offline CPUs,and dead system daemons have been detected.NGOP provided system administrators with information required for better system tuning and configuration.The current state of deployment and future steps to improve the prototype and to implement some new features will be presented.