Home >Engineering >Failure Friday: Start Injecting Failure Today!

Failure Friday: Start Injecting Failure Today!

Date post:17-Jul-2015
Category:
View:342 times
Download:5 times
Share this document with a friend
Transcript:
  • 9/15/14

    @dougbarth

    DEVOPSDAYS TORONTO 2014

    Failure Friday!

  • 9/15/14FAILURE FRIDAY!

    Dev

    Ops

  • 9/15/14FAILURE FRIDAY!

    DevOps Engineer

  • 9/15/14

    DO NOT FEAR FAILURE BY TOMASZ STASIUK

  • 9/15/14FAILURE FRIDAY!

    How is babby PagerDuty formed?

  • 9/15/14FAILURE FRIDAY!

  • 9/15/14

    Designed for reliability

    FAILURE FRIDAY!

    Downstream providers fail 3 phone providers 3 email providers 6 SMS providers

    PagerDuty providers fail 2 cloud providers 3 data centers

  • 9/15/14

    Hung up on details

    FAILURE FRIDAY!

    Bugs in exceptional code paths Systems not recovering as quickly as expected What is normal when things are abnormal?

  • 9/15/14FAILURE FRIDAY!

  • 9/15/14

    Simian Army

    FAILURE FRIDAY!

    Chaos Monkey Latency Monkey Chaos Gorilla Chaos Kong

    WP7WALLPAPER_EVIL_MONKEY_09 BY SKYLER817

  • 9/15/14

    Keep it simple

    FAILURE FRIDAY!

    KISS BAND MEMBER CUPCAKES BY CLEVER CUPCAKES

  • 9/15/14

    Process

    FAILURE FRIDAY!

    HOW TO DRAW AN OWL BY CHESTER

  • 9/15/14

    Get buy in

    FAILURE FRIDAY!

    ANGRY BOSS BY KAUSHAL KARKHANIS

  • 9/15/14

    Schedule

    FAILURE FRIDAY!

    1 hour recurring meeting Developers & Operations List of attacks and identify victim Finish as much as possible

  • 9/15/14

    Before starting

    FAILURE FRIDAY!

    Disable cron jobs & CM system Announce the start Open up relevant dashboards Leave alarms enabled

  • 9/15/14

    Attacks

    FAILURE FRIDAY!

    Test a single host and then DC 5 minutes Return to a working state Stop if things break

  • 9/15/14

    Keep a log

    FAILURE FRIDAY!

    Keep track of actions taken Times are super important Also track discoveries and TODOs Share dashboards/metrics Chat rooms make this easy

  • 9/15/14

    Graphs are awesome

    FAILURE FRIDAY!

  • 9/15/14

    Finishing up

    FAILURE FRIDAY!

    Sound the all clear Enable crons & CM Move TODOs to issue tracker

  • 9/15/14

    Attack Strategies

    FAILURE FRIDAY!

    UNICORN ATTACK! BY SAM HOWZIT

  • 9/15/14FAILURE FRIDAY!

    SERVICE STOP CASSANDRA

  • 9/15/14FAILURE FRIDAY!

    SHUTDOWN -R NOW

  • 9/15/14FAILURE FRIDAY!

    IPTABLES -I INPUT 1 -P TCP --DPORT 9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP

    !IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP

  • 9/15/14FAILURE FRIDAY!

    TC QDISC ADD DEV ETH0 ROOT NETEM DELAY 500MS 100MS

    LOSS 5%

  • 9/15/14

    RESULTS READER BOARD BY ROSA SAY

  • 9/15/14

    Issues fixed

    FAILURE FRIDAY!

    Aggressive restarts by monit Large files on ext3 volumes Failing to restart due to bad /etc/fstab file High latency from network isolated cache Low capacity with a lost DC Missing alerts/metrics

  • 9/15/14

    Cultural impact

    FAILURE FRIDAY!

    Knowledge sharing Highlights untestable systems Keeps failure handling on everyones mind

  • 9/15/14

    Future plans

    ROBOT SWORDSMAN FIGHT. BY PATRICK GAGE KELLEY

  • 9/15/14

    Break more things

    FAILURE FRIDAY!

    Start testing whole DC outages Break multiple services at once Distribute failure testing to teams Automate

  • 9/15/14

    Break more things

    FAILURE FRIDAY!

    Start testing whole DC outages Break multiple services at once Distribute failure testing to teams Automate

  • 9/15/14

    Summary

    FAILURE FRIDAY!

    Failures will happen Proactively test failure handling now Choose something easy: app server, cache Automate later

  • 9/15/14

    pagerduty.com/jobs

    Thank you.

Popular Tags:

Click here to load reader

Embed Size (px)
Recommended