+ All Categories
Home > Documents > Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by...

Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by...

Date post: 23-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
20
Automating On-Call Duties for Red Hat IT with Ansible and Ansible Tower Lauren Santiago Mauricio Teixeira
Transcript
Page 1: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Automating On-Call Duties for Red Hat IT with Ansible and Ansible Tower

Lauren SantiagoMauricio Teixeira

Page 2: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Agenda

● How you got started with Ansible

● Ansible Tower Infrastructure Setup

● On-Call Process Automation

● Nagios Event Handlers and Ansible Tower

● Q&A

Page 3: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

1. How did you get started with Ansible?

2. How long have you been using it?

3. What's your favorite thing to do when you Ansible?

Page 4: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Ansible Tower Infrastructure Setup

Page 5: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Steps that were automated with On-Call Process

Page 6: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Steps that were automated with On-Call Process

Page 7: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Steps that were automated with On-Call Process

Page 8: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Configuration and Infrastructure Management

- Red Hat IT uses a wide range of technologies, in this context, the most notable are:

- Puppet - Nagios - Ansible Engine- Ansible Tower

- Integration between Nagios and Ansible Tower did not exist, so we had to develop our own (and we open sourced it!)

Page 9: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Operating Principle of Nagios

“Operating principle of Nagios” by DiglinksPublic Domain (Source: https://en.wikipedia.org/wiki/Nagios)

Page 10: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Example of Host Monitored by Nagios

“Example of host and service status details page in Nagios Core 4.0.8” by Jhom526CC BY-SA 4.0 (Source: https://en.wikipedia.org/wiki/Nagios)

Page 11: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Standard Monitoring Workflow

- Nagios check #1 = OK, queue next check 5 minutes later- Nagios check #2 = CRITICAL, waits 1 minute to try again- Nagios check #3 = CRITICAL, waits 1 minute to try again- Nagios check #4 = CRITICAL, alerts on-call person (using

whatever methods have been defined)- On-call person = ACK, silences the alert- On-call person = looks for documentation, and perform

necessary actions- Nagios check #N = OK, queue next check 5 minutes later

Page 12: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Monitoring Workflow with Ansible Tower

- Nagios check #1 = OK, queue next check 5 minutes later- Nagios check #2 = CRITICAL, triggers the event handler, but

script skips the run (prevents false positives)- Nagios check #3 = CRITICAL, the event handler makes a call to

the Ansible Tower API that triggers a job:- Sets downtime on the service in Nagios- Takes the node out of load balancing- Takes corrective action and validates- Puts the node back into load balancing- Creates a blog post about the event- Sends a notification to the IT on-call person

- Nagios check #N = OK, queue next check 5 minutes later

Page 13: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Nagios Event Handlers and Ansible Tower

- Developers deploy services, and define their Nagios checks using Puppet modules developed by IT, with very little code (some are already pre-defined)

- Ansible Tower automatically generates a “generic” host inventory from the list of hosts monitored by Nagios

- IT built a set of standard systems repair playbooks that have been loaded in Ansible Tower as Job Templates

- Developers are welcomed to build their own repair playbooks and host inventories

Page 14: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Nagios Event Handlers and Ansible Tower

- Nagios event handler definition

define command { command_name tower-handler-self-full command_line $USER1$/extra/eventhandlers/tower_handler.py --state '$SERVICESTATE$' --attempt '$SERVICEATTEMPT$' --downtime '$SERVICEDOWNTIME$' --service '$SERVICEDESC$' --hostname '$HOSTADDRESS$' --template '$ARG1$' --inventory '$ARG2$' --extra_vars '$ARG3$' --limit '$HOSTADDRESS$'}

Page 15: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Nagios Event Handlers and Ansible Tower

- Nagios service check definition (with handler)

define service { use generic-service host_name server01.example.com service_description HTTPd service_groups prod_webservice contact_groups it-on-call check_command check_nrpe_command!check_proc_httpd event_handler tower-handler-self-full!Restart Service!generic!service_name: httpd}

Page 16: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Nagios Event Handlers and Ansible Tower

- Handler call to Ansible Tower (simplified version)

job_data['extra_vars'] = args.extra_varsjob_started = client.post('/job_templates/%s/launch/' % template_number, data=job_data)if(job_started.json()['id'] and job_started.json()['job']): log_run("OK: job started")else: log_run("ERROR: start job failed")

Page 17: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Nagios Event Handlers and Ansible Tower

- Log of Ansible Tower being called

/var/log/messagesOct 02 14:00:00 nagios01 tower_handler.py[840]: job_number=69817 job_status="STARTED" service="HTTPd" hostname="server01.example.com" service_state="CRITICAL" service_attempt=2 service_downtime=0 template="Restart Service" inventory="generic" extra_vars="service_name: httpd" limit="server01.example.com" handler_message="OK: job started"

Page 18: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Nagios/Tower handler script

- Integration between Nagios and Tower has been developed by Red Hat IT, and open sourced so the world can use

- https://github.com/ansible/tower-nagios-integration

- Contributions are welcome

Page 19: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

- Over 100 Pager Playbooks were automated since February/2018- 50 of them were converted into a single generic service handler template- 15% of production alerts are automatically handled by Tower, and never page the

on-call person

- Moving more application release processes out of pure Ansible Engine into Tower- Planning to migrate from Jenkins to Ansible Tower for base OS image build, and from

Ansible Engine for VM deployments

- Multiple Teams in IT began implementing their own automation needs using our Ansible Tower setup

Success with Ansible and Ansible Tower for Red Hat IT

Page 20: Automating On-Call Duties for Red Hat IT with Ansible and ... · using Puppet modules developed by IT, with very little code (some are already pre-defined) - Ansible Tower automatically

Q&A


Recommended