Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by...

Automating On-Call Duties for Red Hat IT with Ansible and Ansible Tower

Lauren SantiagoMauricio Teixeira

Agenda

● How you got started with Ansible

● Ansible Tower Infrastructure Setup

● On-Call Process Automation

● Nagios Event Handlers and Ansible Tower

● Q&A

1. How did you get started with Ansible?

2. How long have you been using it?

3. What's your favorite thing to do when you Ansible?

Ansible Tower Infrastructure Setup

Steps that were automated with On-Call Process



Configuration and Infrastructure Management

- Red Hat IT uses a wide range of technologies, in this context, the most notable are:

- Puppet - Nagios - Ansible Engine- Ansible Tower

- Integration between Nagios and Ansible Tower did not exist, so we had to develop our own (and we open sourced it!)

Operating Principle of Nagios

“Operating principle of Nagios” by DiglinksPublic Domain (Source: https://en.wikipedia.org/wiki/Nagios)

Example of Host Monitored by Nagios

“Example of host and service status details page in Nagios Core 4.0.8” by Jhom526CC BY-SA 4.0 (Source: https://en.wikipedia.org/wiki/Nagios)

Standard Monitoring Workflow

- Nagios check #1 = OK, queue next check 5 minutes later- Nagios check #2 = CRITICAL, waits 1 minute to try again- Nagios check #3 = CRITICAL, waits 1 minute to try again- Nagios check #4 = CRITICAL, alerts on-call person (using

whatever methods have been defined)- On-call person = ACK, silences the alert- On-call person = looks for documentation, and perform

necessary actions- Nagios check #N = OK, queue next check 5 minutes later

Monitoring Workflow with Ansible Tower

- Nagios check #1 = OK, queue next check 5 minutes later- Nagios check #2 = CRITICAL, triggers the event handler, but

script skips the run (prevents false positives)- Nagios check #3 = CRITICAL, the event handler makes a call to

the Ansible Tower API that triggers a job:- Sets downtime on the service in Nagios- Takes the node out of load balancing- Takes corrective action and validates- Puts the node back into load balancing- Creates a blog post about the event- Sends a notification to the IT on-call person

- Nagios check #N = OK, queue next check 5 minutes later

Nagios Event Handlers and Ansible Tower

- Developers deploy services, and define their Nagios checks using Puppet modules developed by IT, with very little code (some are already pre-defined)

- Ansible Tower automatically generates a “generic” host inventory from the list of hosts monitored by Nagios

- IT built a set of standard systems repair playbooks that have been loaded in Ansible Tower as Job Templates

- Developers are welcomed to build their own repair playbooks and host inventories


- Nagios event handler definition

define command { command_name tower-handler-self-full command_line $USER1$/extra/eventhandlers/tower_handler.py --state '$SERVICESTATE$' --attempt '$SERVICEATTEMPT$' --downtime '$SERVICEDOWNTIME$' --service '$SERVICEDESC$' --hostname '$HOSTADDRESS$' --template '$ARG1$' --inventory '$ARG2$' --extra_vars '$ARG3$' --limit '$HOSTADDRESS$'}


- Nagios service check definition (with handler)

define service { use generic-service host_name server01.example.com service_description HTTPd service_groups prod_webservice contact_groups it-on-call check_command check_nrpe_command!check_proc_httpd event_handler tower-handler-self-full!Restart Service!generic!service_name: httpd}


- Handler call to Ansible Tower (simplified version)

job_data['extra_vars'] = args.extra_varsjob_started = client.post('/job_templates/%s/launch/' % template_number, data=job_data)if(job_started.json()['id'] and job_started.json()['job']): log_run("OK: job started")else: log_run("ERROR: start job failed")


- Log of Ansible Tower being called

/var/log/messagesOct 02 14:00:00 nagios01 tower_handler.py[840]: job_number=69817 job_status="STARTED" service="HTTPd" hostname="server01.example.com" service_state="CRITICAL" service_attempt=2 service_downtime=0 template="Restart Service" inventory="generic" extra_vars="service_name: httpd" limit="server01.example.com" handler_message="OK: job started"

Nagios/Tower handler script

- Integration between Nagios and Tower has been developed by Red Hat IT, and open sourced so the world can use

- https://github.com/ansible/tower-nagios-integration

- Contributions are welcome

https://github.com/ansible/tower-nagios-integration

- Over 100 Pager Playbooks were automated since February/2018- 50 of them were converted into a single generic service handler template- 15% of production alerts are automatically handled by Tower, and never page the

on-call person

- Moving more application release processes out of pure Ansible Engine into Tower- Planning to migrate from Jenkins to Ansible Tower for base OS image build, and from

Ansible Engine for VM deployments

- Multiple Teams in IT began implementing their own automation needs using our Ansible Tower setup

Success with Ansible and Ansible Tower for Red Hat IT

Q&A

Date post:	23-Jun-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by...

Documents