Reliability Patterns for Distributed Applications

transcript

Andrew Hamilton

Reliability Patterns for Web Applications

Andrew Hamilton

$ whoami

$ whoamiSite Reliability Engineer

Development and Operations but NOT a DevOps Engineer

Developer productivity

Zefr, Prevoty, Twitter, Eucalyptus, CSUN PTG

What is reliability?

What is reliability?Your application working when your users need it

A user’s #1 unstated feature request

Your application telling you when things aren’t

working and being able to fix things quickly

Reliability does not completely remove failure

Reliability does not completely remove failureFailure will happen no matter what you do

Perfection is not an obtainable goal

Deal with failure gracefully and reduce the impact of failures

Reducing the chance of failure by building repeatable and

reliable automated processes

Where should you begin?

Where should you begin? Build your appBuild packages for your code (zips/tarballs, RPMs/Debs,

container)

Automate builds with a CI environment (Jenkins, TravisCI)

Where should you begin? Test your appAutomate testing of your app

Unit tests should be easy to run and quick (< 10m)

Functional tests can take longer, can become less reliable

Manual testing can also be done but not much

Where should you begin? App deploymentAutomate the entire process from VM/Container setup to app

deployment

Make it multi environment (dev, stage, prod)

Make it one command

Needs to be repeatable and reliable

Where should you begin? ConfigurationApp configurations should be easy to change

Don’t hardcode values that should be configurable

12 factor apps

Config files (YAML, JSON, key:value)

Where should you begin? DevOpsCommunication is key for reliability

Make sure that people in development and operations know

what’s happening with your app

But really this isn’t enough...

Where should you begin? DevOpsCommunication is key for reliability

Make sure that people in development, operations, product

management, testing, security, design, marketing, management

know what’s happening with your app

Make sure that other teams know when something is happening

that may affect their app

What’s next?

What’s next? LoggingFind a logging format and standardize

Try to find an easy to understand, structured logging format

Make sure your logger is leveled (Debug, Info, Error, Panic)

Expect to use log messages at 3am

What’s next? Loggingfunc myFunc() {

rtn, err := doSomething(val1, val2)if err != nil {

log.Print(err) // Don’t do this!}

What’s next? Loggingfunc myFunc() {

rtn, err := doSomething(val1, val2)if err != nil {

log.Printf(“doSomething call failed in myFunc: %s”, err)}

What’s next? Loggingtime=2012:11:24T17:32:23.3435 type=error func=myFunc host=host1 line=4

msg=”doSomething call failed in myFunc: Error marshaling JSON”

“time”: “2012:11:24T17:32:23.3435”,

“host”: “host1”,

“type”: “error”,

“func”: “myFunc”,

“line”: 4,

“msg”: ”doSomething call failed in myFunc: Error marshaling JSON”,

What’s next? Aggregate LoggingOne place to view all of your app’s logs

With structured logging can pull out metrics

ELK stack - Elasticsearch, Logstash, Kibana

Splunk

What’s next? Monitoring

https://twitter.com/sadserver/status/689588269047132160

What’s next? MonitoringNeeds to be relatively real time (sub 15s)

Start with standard metrics on all requests (counts, latencies)

Add more metrics where you need them

Create a dashboard with important into

statsd/graphite/graphana, Prometheus, DataDog, Netuitive

Nagios is not sufficient for application monitoring

What’s next? Monitoring

What’s next? Monitoring@app.before_requestdef before_request(): g.request_time = time()

@app.after_requestdef after_request(response): total_time = (time() - g.request_time) * 1000 statsd.timing(“app.latency”, total_time, [“name:app”], 1) statsd.increment(“app.request”, 1, [“name:app”, “status_code:{0}”.format(response.status_code)], 1)

What’s next? AlertingUses the monitoring system’s data to make sure the app is

healthy

Sends our emails to on-call dev or ops when issues occur

Requires knowledge of an app to create

Pagerduty, Big Panda, VictorOps

Area that still needs some work

What’s next? Remove stateState is something like session information

Move to an external store all servers can access

Memory based stores the norm (memcache, redis)

Allows you to horizontally scale your app behind a LB

What’s next? Have more than 1 of everythingYou need more than one instance of your service

It shouldn’t just be a primary/backup either

Remove your single points of failures as quickly as possible

What’s next? Retries and backoffThings can fail from time to time

Resending a request can be helpful

Be careful not to DDOS another app because it went down and

came back

Exponential backoff if good

What’s next? Retries and backoffdef my_func(val1, val2): data = None err = None for n in range(10): data, err = get_data(val1, val2) if err is None: break time.sleep((2**n)/1000) // sleep for 2^n milliseconds

if err != None: return None, err

return do_something(data)

I’m bored! What’s cool?

I’m bored! What’s cool? Canary deploys“Canary in the coal mine”

Deploy new code to a single instance

Watch that instance with your monitoring stack

Add more new instances, remove old instances gradually

Helps assure that a release is good before taking all traffic

Can be automated

I’m bored! What’s cool? MicroservicesThe Unix philosophy brought to apps

Each service does only one thing

Requires a good build and deployment system

Requires monitoring, logging, alerting

Monolith → microservices

I’m bored! What’s cool? Feature flagsAllows for features to be turned on and off inside the code base

Start off with a configuration file

Make sure to read configuration to memory

Can be left in after testing or removed

Can be dynamic eventually

I’m bored! What’s cool? Feature flagsdef my_func(): rtn = do_something() print(rtn)

def do_something(): // run some code

I’m bored! What’s cool? Feature flagsdef my_func(): rtn = do_something() print(rtn)

def do_something(): // new code added here...YOLO

I’m bored! What’s cool? Feature flagsff = read_config(os.getenv(“FLAGS_CONF”, “flags.json”))

def my_func(): if ff[“do_something_ver”] == 2: rtn = do_something_2() else: rtn = do_something() print(rtn)

def do_something(): // run some code

def do_something_2(): // new way to do something

I’m bored! What’s cool? Dark deploysTest new features and functionality with real users

They won’t know that anything new has changed

Runs the old and new code and checks output

Great with easy concurrency

Feature flags can be useful

I’m bored! What’s cool? Dark deploysff = read_config(os.getenv(“FLAGS_CONF”, “flags.json”))

def my_func(): rtn = do_something()

if ff[“run_do_something_2”]: rtn2 = do_something_2() if rtn != rtn2: log.Error(“do_something and do_something_2 do not match! {0} != {1}”.format(rtn, rtn2))

print(rtn)

I’m bored! What’s cool? Loose couplingGraceful degradation

Services continue to run when dependency services fail

Output might not be complete but will be as complete as possible

Third party apps with issues won’t take down your app

Important for both backend and frontend

Common with data stores

I’m bored! What’s cool? Circuit breakersKeep track of issues with external services and short circuit calls

to them

Design pattern that’s becoming more popular

Netflix Hystrix -- Java

I’m bored! What’s cool? Chaos engineeringInject faults into your production traffic to test your app

Tests how your apps truly cope with issues before the happen

Helps make sure that devs and ops understand app

Only runs during business hours

Reliability doesn’t magically happen!

Reliability doesn’t magically happenIt must be worked on

It must be prioritized properly and not just assumed

to happen organically

Reliability Patterns for Distributed Applications

Software