+ All Categories
Home > Engineering > Failure Friday: Start Injecting Failure Today!

Failure Friday: Start Injecting Failure Today!

Date post: 17-Jul-2015
Category:
Upload: pagerduty
View: 345 times
Download: 5 times
Share this document with a friend
Popular Tags:
32
9/15/14 @dougbarth DEVOPSDAYS TORONTO 2014 Failure Friday!
Transcript
Page 1: Failure Friday: Start Injecting Failure Today!

9/15/14

@dougbarth

DEVOPSDAYS TORONTO 2014

Failure Friday!

Page 2: Failure Friday: Start Injecting Failure Today!

9/15/14FAILURE FRIDAY!

Dev

Ops

Page 3: Failure Friday: Start Injecting Failure Today!

9/15/14FAILURE FRIDAY!

DevOps Engineer

Page 5: Failure Friday: Start Injecting Failure Today!

9/15/14FAILURE FRIDAY!

How is babby PagerDuty formed?

Page 6: Failure Friday: Start Injecting Failure Today!

9/15/14FAILURE FRIDAY!

Page 7: Failure Friday: Start Injecting Failure Today!

9/15/14

Designed for reliability

FAILURE FRIDAY!

Downstream providers fail

3 phone providers

3 email providers

6 SMS providers

PagerDuty providers fail

2 cloud providers

3 data centers

Page 8: Failure Friday: Start Injecting Failure Today!

9/15/14

Hung up on details

FAILURE FRIDAY!

Bugs in exceptional code paths

Systems not recovering as quickly as expected

What is normal when things are abnormal?

Page 9: Failure Friday: Start Injecting Failure Today!

9/15/14FAILURE FRIDAY!

Page 10: Failure Friday: Start Injecting Failure Today!

9/15/14

Simian Army

FAILURE FRIDAY!

Chaos Monkey

Latency Monkey

Chaos Gorilla

Chaos Kong

“WP7WALLPAPER_EVIL_MONKEY_09” BY SKYLER817

Page 11: Failure Friday: Start Injecting Failure Today!

9/15/14

Keep it simple

FAILURE FRIDAY!

“KISS BAND MEMBER CUPCAKES” BY CLEVER CUPCAKES

Page 14: Failure Friday: Start Injecting Failure Today!

9/15/14

Schedule

FAILURE FRIDAY!

1 hour recurring meeting

Developers & Operations

List of attacks and identify victim

Finish as much as possible

Page 15: Failure Friday: Start Injecting Failure Today!

9/15/14

Before starting

FAILURE FRIDAY!

Disable cron jobs & CM system

Announce the start

Open up relevant dashboards

Leave alarms enabled

Page 16: Failure Friday: Start Injecting Failure Today!

9/15/14

Attacks

FAILURE FRIDAY!

Test a single host and then DC

5 minutes

Return to a working state

Stop if things break

Page 17: Failure Friday: Start Injecting Failure Today!

9/15/14

Keep a log

FAILURE FRIDAY!

Keep track of actions taken

Times are super important

Also track discoveries and TODOs

Share dashboards/metrics

Chat rooms make this easy

Page 18: Failure Friday: Start Injecting Failure Today!

9/15/14

Graphs are awesome

FAILURE FRIDAY!

Page 19: Failure Friday: Start Injecting Failure Today!

9/15/14

Finishing up

FAILURE FRIDAY!

Sound the all clear

Enable crons & CM

Move TODOs to issue tracker

Page 20: Failure Friday: Start Injecting Failure Today!

9/15/14

Attack Strategies

FAILURE FRIDAY!

“UNICORN ATTACK!” BY SAM HOWZIT

Page 21: Failure Friday: Start Injecting Failure Today!

9/15/14FAILURE FRIDAY!

SERVICE STOP CASSANDRA

Page 22: Failure Friday: Start Injecting Failure Today!

9/15/14FAILURE FRIDAY!

SHUTDOWN -R NOW

Page 23: Failure Friday: Start Injecting Failure Today!

9/15/14FAILURE FRIDAY!

IPTABLES -I INPUT 1 -P TCP --DPORT 9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP

!IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP

Page 24: Failure Friday: Start Injecting Failure Today!

9/15/14FAILURE FRIDAY!

TC QDISC ADD DEV ETH0 ROOT NETEM DELAY 500MS 100MS

LOSS 5%

Page 26: Failure Friday: Start Injecting Failure Today!

9/15/14

Issues fixed

FAILURE FRIDAY!

Aggressive restarts by monit

Large files on ext3 volumes

Failing to restart due to bad /etc/fstab file

High latency from network isolated cache

Low capacity with a lost DC

Missing alerts/metrics

Page 27: Failure Friday: Start Injecting Failure Today!

9/15/14

Cultural impact

FAILURE FRIDAY!

Knowledge sharing

Highlights untestable systems

Keeps failure handling on everyone’s mind

Page 29: Failure Friday: Start Injecting Failure Today!

9/15/14

Break more things

FAILURE FRIDAY!

Start testing whole DC outages

Break multiple services at once

Distribute failure testing to teams

Automate

Page 30: Failure Friday: Start Injecting Failure Today!

9/15/14

Break more things

FAILURE FRIDAY!

Start testing whole DC outages

Break multiple services at once

Distribute failure testing to teams

Automate

Page 31: Failure Friday: Start Injecting Failure Today!

9/15/14

Summary

FAILURE FRIDAY!

Failures will happen

Proactively test failure handling now

Choose something easy: app server, cache

Automate later

Page 32: Failure Friday: Start Injecting Failure Today!

9/15/14

pagerduty.com/jobs

Thank you.


Recommended