Automating On-Call Duties for Red Hat IT with Ansible and Ansible Tower
Lauren SantiagoMauricio Teixeira
Agenda
● How you got started with Ansible
● Ansible Tower Infrastructure Setup
● On-Call Process Automation
● Nagios Event Handlers and Ansible Tower
● Q&A
1. How did you get started with Ansible?
2. How long have you been using it?
3. What's your favorite thing to do when you Ansible?
Ansible Tower Infrastructure Setup
Steps that were automated with On-Call Process
Steps that were automated with On-Call Process
Steps that were automated with On-Call Process
Configuration and Infrastructure Management
- Red Hat IT uses a wide range of technologies, in this context, the most notable are:
- Puppet - Nagios - Ansible Engine- Ansible Tower
- Integration between Nagios and Ansible Tower did not exist, so we had to develop our own (and we open sourced it!)
Operating Principle of Nagios
“Operating principle of Nagios” by DiglinksPublic Domain (Source: https://en.wikipedia.org/wiki/Nagios)
Example of Host Monitored by Nagios
“Example of host and service status details page in Nagios Core 4.0.8” by Jhom526CC BY-SA 4.0 (Source: https://en.wikipedia.org/wiki/Nagios)
Standard Monitoring Workflow
- Nagios check #1 = OK, queue next check 5 minutes later- Nagios check #2 = CRITICAL, waits 1 minute to try again- Nagios check #3 = CRITICAL, waits 1 minute to try again- Nagios check #4 = CRITICAL, alerts on-call person (using
whatever methods have been defined)- On-call person = ACK, silences the alert- On-call person = looks for documentation, and perform
necessary actions- Nagios check #N = OK, queue next check 5 minutes later
Monitoring Workflow with Ansible Tower
- Nagios check #1 = OK, queue next check 5 minutes later- Nagios check #2 = CRITICAL, triggers the event handler, but
script skips the run (prevents false positives)- Nagios check #3 = CRITICAL, the event handler makes a call to
the Ansible Tower API that triggers a job:- Sets downtime on the service in Nagios- Takes the node out of load balancing- Takes corrective action and validates- Puts the node back into load balancing- Creates a blog post about the event- Sends a notification to the IT on-call person
- Nagios check #N = OK, queue next check 5 minutes later
Nagios Event Handlers and Ansible Tower
- Developers deploy services, and define their Nagios checks using Puppet modules developed by IT, with very little code (some are already pre-defined)
- Ansible Tower automatically generates a “generic” host inventory from the list of hosts monitored by Nagios
- IT built a set of standard systems repair playbooks that have been loaded in Ansible Tower as Job Templates
- Developers are welcomed to build their own repair playbooks and host inventories
Nagios Event Handlers and Ansible Tower
- Nagios event handler definition
define command { command_name tower-handler-self-full command_line $USER1$/extra/eventhandlers/tower_handler.py --state '$SERVICESTATE$' --attempt '$SERVICEATTEMPT$' --downtime '$SERVICEDOWNTIME$' --service '$SERVICEDESC$' --hostname '$HOSTADDRESS$' --template '$ARG1$' --inventory '$ARG2$' --extra_vars '$ARG3$' --limit '$HOSTADDRESS$'}
Nagios Event Handlers and Ansible Tower
- Nagios service check definition (with handler)
define service { use generic-service host_name server01.example.com service_description HTTPd service_groups prod_webservice contact_groups it-on-call check_command check_nrpe_command!check_proc_httpd event_handler tower-handler-self-full!Restart Service!generic!service_name: httpd}
Nagios Event Handlers and Ansible Tower
- Handler call to Ansible Tower (simplified version)
job_data['extra_vars'] = args.extra_varsjob_started = client.post('/job_templates/%s/launch/' % template_number, data=job_data)if(job_started.json()['id'] and job_started.json()['job']): log_run("OK: job started")else: log_run("ERROR: start job failed")
Nagios Event Handlers and Ansible Tower
- Log of Ansible Tower being called
/var/log/messagesOct 02 14:00:00 nagios01 tower_handler.py[840]: job_number=69817 job_status="STARTED" service="HTTPd" hostname="server01.example.com" service_state="CRITICAL" service_attempt=2 service_downtime=0 template="Restart Service" inventory="generic" extra_vars="service_name: httpd" limit="server01.example.com" handler_message="OK: job started"
Nagios/Tower handler script
- Integration between Nagios and Tower has been developed by Red Hat IT, and open sourced so the world can use
- https://github.com/ansible/tower-nagios-integration
- Contributions are welcome
- Over 100 Pager Playbooks were automated since February/2018- 50 of them were converted into a single generic service handler template- 15% of production alerts are automatically handled by Tower, and never page the
on-call person
- Moving more application release processes out of pure Ansible Engine into Tower- Planning to migrate from Jenkins to Ansible Tower for base OS image build, and from
Ansible Engine for VM deployments
- Multiple Teams in IT began implementing their own automation needs using our Ansible Tower setup
Success with Ansible and Ansible Tower for Red Hat IT
Q&A