8/8/2019 Gaspar Talk
1/28
Deploying Nagios in a
Large EnterpriseCarson GasparGoldman Sachs
8/8/2019 Gaspar Talk
2/28
or...
If you strap enoughrockets to a brick, you
can make it fly
8/8/2019 Gaspar Talk
3/28
In the beginning...
New Linux HPC skunkworks project Catastrophic success
Need monitoring added yesterday
8/8/2019 Gaspar Talk
4/28
Looking for Solutions
Why not use what we already had?
Stability problems Resource utilization problems
Custom agents were hard
No support for our new Linux platform
8/8/2019 Gaspar Talk
5/28
Why Nagios?
Purchasing a new commercial solution waspolitically difficult
At the time (2003) nagios was the most matureof the open source solutions
Good community support
8/8/2019 Gaspar Talk
6/28
The Naive Approach
(and why it didnt work)
Performance Problems Configuration Management Problems
Availability Problems Integration / Automation Requirements
8/8/2019 Gaspar Talk
7/28
Performance Problems
State check performance Active checks: ~3 checks / second maximum
Statistics performance fork()/exec() for every sample
Web UI performance Large configurations take a long time to display(much improved in 2.x)
8/8/2019 Gaspar Talk
8/28
Configuration
Management Problems
Configuration files are very verbose (even withtemplates)
Syntax errors are easy Keeping up with a high churn rate in monitoredservers is expensive
8/8/2019 Gaspar Talk
9/28
Hardware / software failures Building power-downs Patches / upgrades
Who watches the watchers?
Availability Problems
8/8/2019 Gaspar Talk
10/28
Integration / Automation
Requirements
Alarms need to be dispatched to our existingalerting and escalation system Alarms need to be suppressed by existing
maintenance tools
Provisioning should flow from our existingprovisioning system
8/8/2019 Gaspar Talk
11/28
Solving the Problems
Move to passive checks
Run multiple nagios instances Deploy HA nagios servers
Use data-driven configuration file generation
Create a custom notification back end
8/8/2019 Gaspar Talk
12/28
Passive Checks
Move most of the work to the clients
Batch server updates unless a state changeoccurs Randomize server update times to avoid spikes
Queue the results on the server Send statistics to a different back end
8/8/2019 Gaspar Talk
13/28
Passive Data FlowClient 1 Client n
...
... Server n
monqueue
stats-catcher
... nagios n
nagios 1
Server 1
nagios-agent nagios-agent
monqueue
stats-catcher
... nagios n
nagios 1
8/8/2019 Gaspar Talk
14/28
Multiple Nagios Instances
Run many copies of nagios on one server Improve web UI performance Show each group only their own servers, so the
top level dashboard is more useful
Allow per-group customizations
8/8/2019 Gaspar Talk
15/28
HA Nagios Servers
Run multiple nagios servers on differentmachines in different buildings
All clients update all servers A heartbeat is published through each server to
its counterpart
Notifications are suppressed on slaves if theheartbeat service is OK Partitioned masters can cause duplicate alerts
8/8/2019 Gaspar Talk
16/28
HA Data Flowprimary server secondary server
client
nagios-agent
nagios
monqueueheartbeat
notify
monqueue
nagios
ping
notify
heartbeat
ping
8/8/2019 Gaspar Talk
17/28
Data-driven Configuration
File Generation
Leverage our existing host database andprovisioning tools
Generate client and server configurations viacfengine based on templates and databaselookups
Mostly driven by database data, with some per-server threshold overrides managed in cfenginemaster files
8/8/2019 Gaspar Talk
18/28
Custom Notification
Back End
Custom code integrates with our Netcoolinfrastructure
Alarm suppression based on external criteria
Also supports email alerts, optionally batched
8/8/2019 Gaspar Talk
19/28
Design Trade-offs
Batch updates mean slow detection of zombiehosts (ping-able, but not running user processes)
nagios notification escalation doesnt work wellwithout active checks, especially if updates arebatched
Requires configuration management More complexity = more bugs
8/8/2019 Gaspar Talk
20/28
nagios-agent
Design Criteria Lightweight
Easy to write and deploy additional agents Avoid fork()/exec() where possible Support agent callbacks to avoid blocking
Support proxy agents to monitor other deviceswhere we cant deploy nagios-agent Evaluate all thresholds locally and batch server
updates
8/8/2019 Gaspar Talk
21/28
nagios-agentnagios-agent
Agent
Class: Ping
Instance: db
Categories: appdev, sa
Args:
hostlist => /my/dblist,
latency => [ 50, 100],
losspct => [ 66, 100 ],
count => 3
Agent
Class: Ping
Instance: LDN
Categories: sa
Args:
hostlist => /my/ldnlist,
latency => [ 100, 200],
losspct => [ 66, 100 ],
count => 3
Server
Class: monqueue
Categories: appdevArgs:
auth => GSSAPI
keytab => mykeytab
server => myserver,
queue => 0
Server
Class: monqueue
Categories: saArgs:
auth => GSSAPI
keytab => mykeytab
server => myserver,
queue => 1
8/8/2019 Gaspar Talk
22/28
monqueue
Design Criteria
Fast
Secure Accept data from clients, and dispatch to multiple
output queues
Supply heartbeats to nagios Supply queue depth stats to stats-catcher
8/8/2019 Gaspar Talk
23/28
Client Evolution
nagios-agent slowly grew features as they becamerequired
multiple agent instances
agent instance to server mapping auto reload of configuration, modules on
update
auto re-exec of nagios-agent on update stats collection SASL authentication to monqueue
8/8/2019 Gaspar Talk
24/28
Server Evolution
Started as one monolithic instance As deployment spread, split into multiple
instances based on administrative domain
added HA added SASL authentication and authorization
added monitoring of monqueue itself, and servicedependencies (so a monqueue failure didnttrigger alarms for all services)
8/8/2019 Gaspar Talk
25/28
Kudzu
Originally for one project, fewer than 200 hosts
Eventually used for large sections of theenvironment
Documentation and internal consultancy are
critical for user acceptance
8/8/2019 Gaspar Talk
26/28
One of Our Servers
A single HP DL385 G1 (dual 2.6GHz Opteron,4GB RAM), running RHEL4 U4, nagios 2.9
11 nagios instances 27,000+ services (mostly 2 minute intervals) 6,600+ hosts
~10% CPU ~500 MB RAM
8/8/2019 Gaspar Talk
27/28
Still to Come
Source code release Encryption of nagios-agent / monqueue
communications
Support for pulling status from nagios-agent tobetter support DMZ environments Statistical analysis of multiple data samples to
determine service status
Yet more agent plugins nagios-agent support for traditional nagios
plugins
8/8/2019 Gaspar Talk
28/28
Deploying Nagios in a
Large EnterpriseCarson GasparGoldman Sachs