+ All Categories
Home > Software > Reliability Patterns for Distributed Applications

Reliability Patterns for Distributed Applications

Date post: 16-Feb-2017
Category:
Upload: andrew-hamilton
View: 176 times
Download: 0 times
Share this document with a friend
49
Reliability Patterns for Distributed Applications Andrew Hamilton
Transcript
Page 1: Reliability Patterns for Distributed Applications

Reliability Patterns for Distributed Applications

Andrew Hamilton

Page 2: Reliability Patterns for Distributed Applications

Reliability Patterns for Web Applications

Andrew Hamilton

Page 3: Reliability Patterns for Distributed Applications

$ whoami

Page 4: Reliability Patterns for Distributed Applications

$ whoamiSite Reliability Engineer

Development and Operations but NOT a DevOps Engineer

Developer productivity

Zefr, Prevoty, Twitter, Eucalyptus, CSUN PTG

Page 5: Reliability Patterns for Distributed Applications

What is reliability?

Page 6: Reliability Patterns for Distributed Applications

What is reliability?Your application working when your users need it

A user’s #1 unstated feature request

Your application telling you when things aren’t

working and being able to fix things quickly

Page 7: Reliability Patterns for Distributed Applications

Reliability does not completely remove failure

Page 8: Reliability Patterns for Distributed Applications

Reliability does not completely remove failureFailure will happen no matter what you do

Perfection is not an obtainable goal

Deal with failure gracefully and reduce the impact of failures

Reducing the chance of failure by building repeatable and

reliable automated processes

Page 9: Reliability Patterns for Distributed Applications

Where should you begin?

Page 10: Reliability Patterns for Distributed Applications

Where should you begin? Build your appBuild packages for your code (zips/tarballs, RPMs/Debs,

container)

Automate builds with a CI environment (Jenkins, TravisCI)

Page 11: Reliability Patterns for Distributed Applications

Where should you begin? Test your appAutomate testing of your app

Unit tests should be easy to run and quick (< 10m)

Functional tests can take longer, can become less reliable

Manual testing can also be done but not much

Page 12: Reliability Patterns for Distributed Applications

Where should you begin? App deploymentAutomate the entire process from VM/Container setup to app

deployment

Make it multi environment (dev, stage, prod)

Make it one command

Needs to be repeatable and reliable

Page 13: Reliability Patterns for Distributed Applications

Where should you begin? ConfigurationApp configurations should be easy to change

Don’t hardcode values that should be configurable

12 factor apps

Config files (YAML, JSON, key:value)

Page 14: Reliability Patterns for Distributed Applications

Where should you begin? DevOpsCommunication is key for reliability

Make sure that people in development and operations know

what’s happening with your app

Page 15: Reliability Patterns for Distributed Applications

But really this isn’t enough...

Page 16: Reliability Patterns for Distributed Applications

Where should you begin? DevOpsCommunication is key for reliability

Make sure that people in development, operations, product

management, testing, security, design, marketing, management

know what’s happening with your app

Make sure that other teams know when something is happening

that may affect their app

Page 17: Reliability Patterns for Distributed Applications

What’s next?

Page 18: Reliability Patterns for Distributed Applications

What’s next? LoggingFind a logging format and standardize

Try to find an easy to understand, structured logging format

Make sure your logger is leveled (Debug, Info, Error, Panic)

Expect to use log messages at 3am

Page 19: Reliability Patterns for Distributed Applications

What’s next? Loggingfunc myFunc() {

rtn, err := doSomething(val1, val2)if err != nil {

log.Print(err) // Don’t do this!}

}

Page 20: Reliability Patterns for Distributed Applications

What’s next? Loggingfunc myFunc() {

rtn, err := doSomething(val1, val2)if err != nil {

log.Printf(“doSomething call failed in myFunc: %s”, err)}

}

Page 21: Reliability Patterns for Distributed Applications

What’s next? Loggingtime=2012:11:24T17:32:23.3435 type=error func=myFunc host=host1 line=4

msg=”doSomething call failed in myFunc: Error marshaling JSON”

{

“time”: “2012:11:24T17:32:23.3435”,

“host”: “host1”,

“type”: “error”,

“func”: “myFunc”,

“line”: 4,

“msg”: ”doSomething call failed in myFunc: Error marshaling JSON”,

}

Page 22: Reliability Patterns for Distributed Applications

What’s next? Aggregate LoggingOne place to view all of your app’s logs

With structured logging can pull out metrics

ELK stack - Elasticsearch, Logstash, Kibana

Splunk

Page 23: Reliability Patterns for Distributed Applications

What’s next? Monitoring

https://twitter.com/sadserver/status/689588269047132160

Page 24: Reliability Patterns for Distributed Applications

What’s next? MonitoringNeeds to be relatively real time (sub 15s)

Start with standard metrics on all requests (counts, latencies)

Add more metrics where you need them

Create a dashboard with important into

statsd/graphite/graphana, Prometheus, DataDog, Netuitive

Nagios is not sufficient for application monitoring

Page 25: Reliability Patterns for Distributed Applications

What’s next? Monitoring

Page 26: Reliability Patterns for Distributed Applications

What’s next? [email protected]_requestdef before_request(): g.request_time = time()

@app.after_requestdef after_request(response): total_time = (time() - g.request_time) * 1000 statsd.timing(“app.latency”, total_time, [“name:app”], 1) statsd.increment(“app.request”, 1, [“name:app”, “status_code:{0}”.format(response.status_code)], 1)

Page 27: Reliability Patterns for Distributed Applications

What’s next? AlertingUses the monitoring system’s data to make sure the app is

healthy

Sends our emails to on-call dev or ops when issues occur

Requires knowledge of an app to create

Pagerduty, Big Panda, VictorOps

Area that still needs some work

Page 28: Reliability Patterns for Distributed Applications

What’s next? Remove stateState is something like session information

Move to an external store all servers can access

Memory based stores the norm (memcache, redis)

Allows you to horizontally scale your app behind a LB

Page 29: Reliability Patterns for Distributed Applications

What’s next? Have more than 1 of everythingYou need more than one instance of your service

It shouldn’t just be a primary/backup either

Remove your single points of failures as quickly as possible

Page 30: Reliability Patterns for Distributed Applications

What’s next? Retries and backoffThings can fail from time to time

Resending a request can be helpful

Be careful not to DDOS another app because it went down and

came back

Exponential backoff if good

Page 31: Reliability Patterns for Distributed Applications

What’s next? Retries and backoffdef my_func(val1, val2): data = None err = None for n in range(10): data, err = get_data(val1, val2) if err is None: break time.sleep((2**n)/1000) // sleep for 2^n milliseconds

if err != None: return None, err

return do_something(data)

Page 32: Reliability Patterns for Distributed Applications

I’m bored! What’s cool?

Page 33: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Canary deploys“Canary in the coal mine”

Deploy new code to a single instance

Watch that instance with your monitoring stack

Add more new instances, remove old instances gradually

Helps assure that a release is good before taking all traffic

Can be automated

Page 34: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? MicroservicesThe Unix philosophy brought to apps

Each service does only one thing

Requires a good build and deployment system

Requires monitoring, logging, alerting

Monolith → microservices

Page 35: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Feature flagsAllows for features to be turned on and off inside the code base

Start off with a configuration file

Make sure to read configuration to memory

Can be left in after testing or removed

Can be dynamic eventually

Page 36: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Feature flagsdef my_func(): rtn = do_something() print(rtn)

def do_something(): // run some code

Page 37: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Feature flagsdef my_func(): rtn = do_something() print(rtn)

def do_something(): // new code added here...YOLO

Page 38: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Feature flagsff = read_config(os.getenv(“FLAGS_CONF”, “flags.json”))

def my_func(): if ff[“do_something_ver”] == 2: rtn = do_something_2() else: rtn = do_something() print(rtn)

def do_something(): // run some code

def do_something_2(): // new way to do something

Page 39: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Dark deploysTest new features and functionality with real users

They won’t know that anything new has changed

Runs the old and new code and checks output

Great with easy concurrency

Feature flags can be useful

Page 40: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Dark deploysff = read_config(os.getenv(“FLAGS_CONF”, “flags.json”))

def my_func(): rtn = do_something()

if ff[“run_do_something_2”]: rtn2 = do_something_2() if rtn != rtn2: log.Error(“do_something and do_something_2 do not match! {0} != {1}”.format(rtn, rtn2))

print(rtn)

Page 41: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Loose couplingGraceful degradation

Services continue to run when dependency services fail

Output might not be complete but will be as complete as possible

Third party apps with issues won’t take down your app

Important for both backend and frontend

Common with data stores

Page 42: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Circuit breakersKeep track of issues with external services and short circuit calls

to them

Design pattern that’s becoming more popular

Netflix Hystrix -- Java

Page 43: Reliability Patterns for Distributed Applications

I’m bored! What’s cool? Chaos engineeringInject faults into your production traffic to test your app

Tests how your apps truly cope with issues before the happen

Helps make sure that devs and ops understand app

Only runs during business hours

Page 44: Reliability Patterns for Distributed Applications

Reliability doesn’t magically happen!

Page 45: Reliability Patterns for Distributed Applications

Reliability doesn’t magically happenIt must be worked on

It must be prioritized properly and not just assumed

to happen organically

Page 46: Reliability Patterns for Distributed Applications

Further reading

Page 47: Reliability Patterns for Distributed Applications

Further ReadingContinuous Delivery: Reliable Software Releases through Build, Test and Deployment

Automation (Humble and Farley)

http://www.amazon.com/Continuous-Delivery-Deployment-Automation-Addison-

Wesley/dp/0321601912

Page 48: Reliability Patterns for Distributed Applications

Further readingThe Practice of Cloud System Administration: Designing and Operating Large

Distributed Systems, Vol 2 (Limoncelli, Chalup, Hogan)

http://www.amazon.com/Practice-Cloud-System-Administration-

Distributed/dp/032194318X

Page 49: Reliability Patterns for Distributed Applications

Further readinghttp://martinfowler.com/

http://www.devopsweekly.com/ (weekly newsletter of articles)

https://blog.cloudflare.com/

https://blog.twitter.com/engineering

http://highscalability.com/


Recommended