In the second session, I attended and scribed Carson Gaspar's invited talk, "Deploying Nagios in a Large Enterprise Environment." Carson discussed how a project went from skunk-works to production and how monitoring was explicitly delayed until after an incident. Their Nagios (version 1.x) installation had several initial problems:
They solved or worked around these problems by switching from active to passive checks (which gets them from 3 to 1800 possible checks per second), splitting the configuration to allow multiple instances of Nagios to run on the same server, deployed highly-available Nagios servers (to reduce any single points of failure), and generated the configuration files from the canonical data sources (for example, so any new server is automatically monitored). They also created a custom notification back end to integrate with their Netcool infrastructure and to intelligently suppress alarms (such as during known maintenance windows or during scheduled building-wide power-downs).
The monitoring system design criteria specified that it had to be lightweight, with easy to write and easy to deploy additional agents, avoid using the expensive fork()/exec() calls as much as possible, support callbacks to avoid blocking, support proxy agents to monitor other devices (such as those where the Nagios agent can't run, like NetApps), and evaluate all thresholds locally and batch the server updates.
The clients evolved over time; some added features included multiple agent instances, agent instance-to-server mapping, auto reloading of confifutation and modules on update, automatically re-exec the Nagios agent on update, collecting statistics instead of just alarms, and SASL authentication to monqueue. The servers evolved as well: Split off instances based on administrative domain (such as production application groups versus developers), high availability, SASL authentication and aauthorization, and service dependencies.
This project started for one project with less than 200 hosts and was eventually for large sections of the environment. Documentation and internal consultancy are critical for user acceptance. Therefore, architect for the eventual adoption in production for the enterprise. For example, one HP DL385G1 (2x2.6GHz with 4GB RAM) is running 11 instances with 27000 service checks on 6600 hosts and it's using no more than 10% CPU and 500MB RAM.