Nagios Core 4 News and improvements
About me
32 years old
Programming since I was seven
Work as “core architect” at op5
Nagios Core co-maintainer since 2009
Will be found at the bar in the evenings
Nagios Core 4
Goals
Algorithm analysis crash course
Bottleneck analysis of Nagios Core 3
Bottleneck solutions in Nagios Core 4
New features
Future possibilities
Nagios Core 4 goals
Stability
low complexity
testing
Scalability
efficient, reusable, well-tested algorithms
efficient resource usage
Simplicity
useful api's
no “magic” and no bloat
Algorithm analysis – big Oh
n = 100, one operation = 1 microsecond
O(1) = 0.0000001 second
O(lg n) = 0.0000046 seconds
O(n) = 0.00001 second
O(n * lg n) = 0.00046 seconds
O(n^2) = 0.01 second
O(2^n) = 4*10^16 years
O(n!) = 2.96*10^144 years
Conclusion:
Good algorithms > beefy hardware
Algorithm analysis – big Oh
1 2 3 4 5 6 7 8 9 10 0
20
40
60
80
100
120
O(lg n)
O(n)
O(n^2)
I/O media comparison
HDD seektime: 5.11ms
SSD seektime: 0.24ms
RAM seektime: 0.000013ms (13ns)
SSD is 21.3 times faster than SCSI
RAM is 393077 times faster than SCSI
RAM is 18461 times faster than SSD
Conclusion
All types of disk access is bad
Bottleneck analysis - Test setup
3000 hosts
200 000 services
5 minute check interval
really (really) stupid plugin: check_aok
Nagios 3 bottlenecks
configuration parsing
event queue insertion
add_event() runs in O(n) time 676 times per second, but lowest bound is O(lg n)
macro resolution
strcmp() ~3700 times/sec to handle checks
job spawning and check reaping
heavy on cache-line fills and disk I/O
insufficient parallelization
Nagios 3 check flowchart
Nagios fork()s a child
child writes half a checkresult file
child fork()s and runs shell
child completes checkresult file
Nagios reads spooldir
“ok-to-read”? child reads status and output
shell fork()s and runs plugin
child creates an “ok-to-read” file
Nagios finds a checkresult file
shell parses commandline
child exits Nagios parses checkresult
cache miss
remove result and “ok-to-read”
Nagios reads scheduling queue
read the file
Nagios 3 check flowchart - hotspots
Nagios fork()s a child
child writes half a check-result file
child fork()s and runs shell
child completes checkresult file
Nagios reads spooldir
“ok-to-read”? child reads status and output
shell fork()s and runs plugin
child creates an “ok-to-read” file
Nagios finds a checkresult file
shell parses commandline
child exits
read the file cache miss
remove result and “ok-to-read”
Nagios reads scheduling queue
Nagios parses checkresult
depth-first search for host and service dependencies
O(n^2) -> O(n): 400000000 -> 20000 operations for 20000 dependencies
group members no longer duplicated
Verify exactly once
Effect: Nagios loads configurations really fast
Config parsing solution
Move to priority queue on binary heap
Insertion: O(n) -> O(lg n)
Extract: O(1) -> O(lg n)
43000000 -> 9460 operations per second
Effect: Main nagios process uses (a lot) less CPU
Kudos: libpqueue author Volkan Yazici
Event queue solution
Macro names sorted on startup
Lookups: O(n) -> O(lg n)
65360 -> 3010 operations per second
Effect: Main nagios process uses less CPU
Todo: Cache resolved check commands (when configured to)
Macros solution
Checks Solutions
Worker processes run all helper apps (checks, notification, eventhandlers)
fork()'s/sec increased (800 with 300MB process, 13900 with 1MB process)
Effects:
Drastically reduced I/O load (100% -> 1%)
Drastically reduced CPU usage
Up to ~300000 checks / 5 minutes
Kudos: Sven Nierlein, William Leibzon & Jean Gabès
Worker processes breakdown
Workers are spawned by Nagios
Chosen in round-robin fashion
Workers communicate with Nagios using libnagios api's exclusively
Todo:
Special-purpose workers calling in
Zero fork()'s
Experimental implementation in op5 labs
Remote workers
Nagios 4 check flowchart - hotspots
Nagios tells worker to run check
worker parses commandline
plugin runs
shell fork()s
worker reads status and output
“Simple” commandline?
worker sends data back to Nagios Nagios parses check result
Nagios reads scheduling queue
worker fork()s
worker receives command
With special-purpose workers
Nagios tells worker to run check worker parses commandline
Voodoo
worker sends data back to Nagios Nagios parses check result
Nagios reads scheduling queue worker receives command
Nagios 3 check flowchart - hotspots
Nagios fork()s a child
child writes half a check-result file
child fork()s and runs shell
child completes checkresult file
Nagios reads spooldir
“ok-to-read”? child reads status and output
shell fork()s and runs plugin
child creates an “ok-to-read” file
Nagios finds a checkresult file
shell parses commandline
child exits
read the file cache miss
remove result and “ok-to-read”
Nagios reads scheduling queue
Nagios parses checkresult
Check engine performance comparison
0
50000
100000
150000
200000
250000
300000
350000
Centreon
Icinga
Nagios 3
gearman
Shinken
Nagios 4
Nagios 4 – New features
Major:
libnagios
Query handler
NERD
Minor:
service parents
hourly_value + minimum_value
$CHECKSOURCE$
libnagios
iobroker – multiplexing library
iocache - bulk reading and writing
kvvec – key value vector handling
dkhash – dual-key hash api
bitmap – set-operations for large sets
squeue – fast scheduling queue
pqueue – priority queue (from Apache)
skiplist – previously in Nagios core only
nsock – simple socket library
libnagios – usage example
#include <nagios/lib/libnagios.h> #define QH_SOCKET_PATH "/opt/monitor/var/nagios.qh" int main(int argc, char **argv) { int sd, r; char *buf[4096]; sd = nsock_unix(QH_SOCKET_PATH, NSOCK_TCP | NSOCK_CONNECT, 0); if(sd < 0) { printf("Failed to connect to '%s': %s: %m\n", argv[1], nsock_strerror exit(1); } if (nsock_printf("@nerd subscribe opathchecks") > 0) { while((r = read(sd, buf, sizeof(buf))) > 0) write(fileno(stdout), buf, r); } close(sd); return 0; }
Query handler
General purpose handler for addressable queries in Nagios Core
query: “@<address><SP><query>\0”
“echo” service built in
query_socket=/path/to/nagios.qh in nagios.cfg
Kudos for inspiration: Mathias Kettner
NERD
Nagios Event Radio Dispatcher
Provides real-time data to outside addons
Can reduce I/O load of current addons
Queried as 'nerd' via query-handler
Example queries:
@nerd subscribe hostchecks
@nerd subscribe servicechecks
Todo: Macro support, 'alerts' channel
demo time :)
Other features
Service parents
servicedependencies made easy
hourly_value + minimum_value
$CHECKSOURCE$
Useful when adding remote checking modules
“make dox” and look in Documentation/html
Easter eggs / micro-features
The /dev/null hack
object_cache_file
status_file
nagios-devel package available
libnagios and Nagios Core headers
Addon status
Works:
mod_gearman
modpnp
livestatus (from http://github.com/ageric/livestatus)
merlin
Known bugs, issues and ToDo's
Host latency calculation is messed up
If use_aggressive_host_checks=1, on-demand host checks are still run synchronously
Environment macros are currently not supported
Deprecation notices
Command line
-o (don't verify objects) is removed and will throw an error
-x (don't verify object paths) is deprecated and will produce a warning
Deprecation notices, continued
Object configuration in nagios.cfg is now officially unsupported. Do not rely on it to work
Embedded perl has been removed
Too many reports on memory leaks
Performance improved in workers by removing it, due to smaller memory footprint
Deprecation notices, continued
nagios.cfg:
sleep_time - we now poll until it's time to run the next event
command_check_interval – commands are always handled immediately
last_command_check – as per above
failure_prediction* - never implemented
Everything relating to embedded perl
Deprecation notices, continued
objects
failure_prediction* - this was never implemented
group member exclusions no longer inherited by group-in-group inclusion
group1->members = A,B
group1->group_members = group2
group2->group_members = !B,C
group1 has A,B,C as members in Nagios 4
group1 had A,C as members in Nagios 3
Special thanks
Ethan Galstad
Daniel Wittenberg
Armin Wolfermann
Joerg Linge
Sven Nierlein
Mark Frost
Robin Sonefors
William Leibzon
Everyone who sent me configs for testing
Questions?
Look me up between sessions
Check out the 'make dox' thingie
Online resources
http://www.github.com/ageric
http://www.op5.com