Conference Report: 2007 LISA: Nagios

In the second session, I attended and scribed Carson Gaspar's invited talk, "Deploying Nagios in a Large Enterprise Environment." Carson discussed how a project went from skunk-works to production and how monitoring was explicitly delayed until after an incident. Their Nagios (version 1.x) installation had several initial problems:

Performance problems — By default, Nagios (pre 3.x) performs active checks and can't exceed about 3 checks per second and did a fork()/exec() for every statistical sample. Also, the web UI for large or complex configurations take a long time to display (fixed in 2.x).
Configuration — Configuration files are verbose, even with templates. It's too easy to make typos in the configuration files. Keeping up with a high churn rate in monitored servers was very expensive.
Availability — Hardware and software failures, building power-downs, patches and upgrades, and who monitors the monitoring system when it's down?
Integration and automation — Alarms need to integrate to the existing alerting and escalation systems, and need to be suppressed in certain situations (e.g., "building is intentionally powered down"). Provisioning needed to be automatic and integrated with the existing provisioning system.

They solved or worked around these problems by switching from active to passive checks (which gets them from 3 to 1800 possible checks per second), splitting the configuration to allow multiple instances of Nagios to run on the same server, deployed highly-available Nagios servers (to reduce any single points of failure), and generated the configuration files from the canonical data sources (for example, so any new server is automatically monitored). They also created a custom notification back end to integrate with their Netcool infrastructure and to intelligently suppress alarms (such as during known maintenance windows or during scheduled building-wide power-downs).

The monitoring system design criteria specified that it had to be lightweight, with easy to write and easy to deploy additional agents, avoid using the expensive fork()/exec() calls as much as possible, support callbacks to avoid blocking, support proxy agents to monitor other devices (such as those where the Nagios agent can't run, like NetApps), and evaluate all thresholds locally and batch the server updates.

The clients evolved over time; some added features included multiple agent instances, agent instance-to-server mapping, auto reloading of confifutation and modules on update, automatically re-exec the Nagios agent on update, collecting statistics instead of just alarms, and SASL authentication to monqueue. The servers evolved as well: Split off instances based on administrative domain (such as production application groups versus developers), high availability, SASL authentication and aauthorization, and service dependencies.

This project started for one project with less than 200 hosts and was eventually for large sections of the environment. Documentation and internal consultancy are critical for user acceptance. Therefore, architect for the eventual adoption in production for the enterprise. For example, one HP DL385G1 (2x2.6GHz with 4GB RAM) is running 11 instances with 27000 service checks on 6600 hosts and it's using no more than 10% CPU and 500MB RAM.

Back to my conference reports page

Back to my professional organizations page

Back to my work page

Back to my home page

Last update Feb01/20 by Josh Simon (<jss@clock.org>).