Writing Nagios Plugins in Python

Post on 03-Jul-2015

14,047 views 3 download

description

I introduced Nagios to an organisation in 2004 to track the availability of various servers and network resources. It has since grown into a system validity tool that takes the stress out of help desk. Using Python as a scripting language, I have created a suite of additional Nagios plugins that ensures: * real-time entry of market rates * end of day rate integrity * common errors in manual spreadsheets * success of backup processes * validity conditions in MS SQL databases * routine tracking of known chronic errors

transcript

Enhancing Nagioswith Python Plugins

Maurice ManeschiAssociate Director, Risk Management Systems

Oakvale Capital Limited

Presentation Outline

● Risk Management Systems● What is Nagios● Why Python● What is a plug in● Specific Risks being monitored● Analysing reports and logs● Where to next

Risk Management Systems

● A division of five staff● Supporting three key applications● Running on eight servers● Depending on 15+ other boxes spread over 3 LANs● Five key vendors

Risk Management System

● Divisional goals

– Key goal is application management

– Some customer support

– Product innovation

– Project management

– No time for nasty surprises

What is Nagios

● Host, service, network monitoring program● Open source● Written in C● Runs on Linux and Apache

What is Nagios

● Configured with the hosts of a network

– How the hosts are networked

– What key services are on the hosts● “PING”, SMTP, HTTP etc.

● Application polls these at specified intervals

– From the results of the polls, determines the state of hosts, services and networks

– Alerts sent by email

– Escalation, reporting, statistics and more

Why Python

● Flexible● Efficient● Managable● Numerous, diverse libraries● Cross-platform● Huge number of code samples across the network

What is a plugin

● Executable file

– Takes parameters (preferable)

– Prints a short status message● Returns an exit status of

– 0 – all OK

– 1 – warning

– 2 – critical● Stateless

What is a plugin

● Executable Python script

● Code the test● Print the status line● Return a status● Easy!

Specific risks being monitored

● Customer email to the help desk system has stopped

– User issues email in directly into our help desk system for prioritisation, action and eventually billing

– Spam periodically breaks the import agent

– Its proprietary, so no fix in sight

– Nagios watches the queue using POP3

Specific risks being monitored

Specific risks being monitored

Specific risks being monitored

● Ratefeed is missing some rates

– Rates feed into our system from Reuters via MS Excel

– Some rates are critical, and human intervention is required if they are missing

– Other rates are important, but are just tracked when missing

– Nagios watches MS Excel file sheet with the “unreliable rates”

Specific risks being monitored

Specific risks being monitored

Specific risks being monitored

● Rates must be inserted regularly

– Insertion process has numerous dependencies

– Moving target – causes of failure change over time

– Focus on the end point – are the rates in the database?

– Nagios the databases and alerts to old or missing rates

Specific risks being monitored

Specific risks being monitored

Specific risks being monitored

● External source of dealing information

– Fed in through the FIX protocol

– Numerous failure points being monitored on a (Windows) server

– Monitor process must check in with Nagios every 10 minutes

– Using passive and active checks

Specific risks being monitored

Specific risks being monitored

Specific risks being monitored

● Quick passive check

Specific risks being monitored

● Successful backups● Successful scheduled tasks● Database comparisons● Common errors

– Password server on web site

– Known failure point on an MS Excel worksheet

Extra enhancements to Nagios

● High level view to systems health● Audio alerts and SMSes from UTbox.net● Status screen on monitor PC● Syslogd for firewall● Script reuse for rate checks● Ad hoc system problems

– Currently tracking WAN failures

Analysing reports and logs

● Screen saver often sufficient● Summary views

Where to next

● Low spec-ed PC● Nagios is in several distro repositories

– I compile from the source● Allow a day at least to configure Nagios

– Don't expect to install and switch it on● Tuning Nagios is an ongoing job

Further information

● Nagios: http://www.nagios.org● Python: http://www.python.org

– pyexcelerator, pymssql, freetds from Sourceforge● Oakvale Capital: http://www.oakvale.com● Code samples:

http://www.redwaratah.com/wiki/index.php?title=Nagios_and_Python● Maurice Maneschi: MauriceM@oakvale.com