9/15/14
@dougbarth
DEVOPSDAYS TORONTO 2014
Failure Friday!
9/15/14FAILURE FRIDAY!
Dev
Ops
9/15/14FAILURE FRIDAY!
DevOps Engineer
9/15/14
“DO NOT FEAR FAILURE” BY TOMASZ STASIUK
9/15/14FAILURE FRIDAY!
How is babby PagerDuty formed?
9/15/14FAILURE FRIDAY!
9/15/14
Designed for reliability
FAILURE FRIDAY!
Downstream providers fail
3 phone providers
3 email providers
6 SMS providers
PagerDuty providers fail
2 cloud providers
3 data centers
9/15/14
Hung up on details
FAILURE FRIDAY!
Bugs in exceptional code paths
Systems not recovering as quickly as expected
What is normal when things are abnormal?
9/15/14FAILURE FRIDAY!
9/15/14
Simian Army
FAILURE FRIDAY!
Chaos Monkey
Latency Monkey
Chaos Gorilla
Chaos Kong
“WP7WALLPAPER_EVIL_MONKEY_09” BY SKYLER817
9/15/14
Keep it simple
FAILURE FRIDAY!
“KISS BAND MEMBER CUPCAKES” BY CLEVER CUPCAKES
9/15/14
Process
FAILURE FRIDAY!
“HOW TO DRAW AN OWL” BY CHESTER
9/15/14
Get buy in
FAILURE FRIDAY!
“ANGRY BOSS” BY KAUSHAL KARKHANIS
9/15/14
Schedule
FAILURE FRIDAY!
1 hour recurring meeting
Developers & Operations
List of attacks and identify victim
Finish as much as possible
9/15/14
Before starting
FAILURE FRIDAY!
Disable cron jobs & CM system
Announce the start
Open up relevant dashboards
Leave alarms enabled
9/15/14
Attacks
FAILURE FRIDAY!
Test a single host and then DC
5 minutes
Return to a working state
Stop if things break
9/15/14
Keep a log
FAILURE FRIDAY!
Keep track of actions taken
Times are super important
Also track discoveries and TODOs
Share dashboards/metrics
Chat rooms make this easy
9/15/14
Graphs are awesome
FAILURE FRIDAY!
9/15/14
Finishing up
FAILURE FRIDAY!
Sound the all clear
Enable crons & CM
Move TODOs to issue tracker
9/15/14
Attack Strategies
FAILURE FRIDAY!
“UNICORN ATTACK!” BY SAM HOWZIT
9/15/14FAILURE FRIDAY!
SERVICE STOP CASSANDRA
9/15/14FAILURE FRIDAY!
SHUTDOWN -R NOW
9/15/14FAILURE FRIDAY!
IPTABLES -I INPUT 1 -P TCP --DPORT 9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP
!IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP
9/15/14FAILURE FRIDAY!
TC QDISC ADD DEV ETH0 ROOT NETEM DELAY 500MS 100MS
LOSS 5%
9/15/14
“RESULTS READER BOARD” BY ROSA SAY
9/15/14
Issues fixed
FAILURE FRIDAY!
Aggressive restarts by monit
Large files on ext3 volumes
Failing to restart due to bad /etc/fstab file
High latency from network isolated cache
Low capacity with a lost DC
Missing alerts/metrics
9/15/14
Cultural impact
FAILURE FRIDAY!
Knowledge sharing
Highlights untestable systems
Keeps failure handling on everyone’s mind
9/15/14
Future plans
“ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY
9/15/14
Break more things
FAILURE FRIDAY!
Start testing whole DC outages
Break multiple services at once
Distribute failure testing to teams
Automate
9/15/14
Break more things
FAILURE FRIDAY!
Start testing whole DC outages
Break multiple services at once
Distribute failure testing to teams
Automate
9/15/14
Summary
FAILURE FRIDAY!
Failures will happen
Proactively test failure handling now
Choose something easy: app server, cache
Automate later