Post on 16-Feb-2017
transcript
Reliability Patterns for Distributed Applications
Andrew Hamilton
Reliability Patterns for Web Applications
Andrew Hamilton
$ whoami
$ whoamiSite Reliability Engineer
Development and Operations but NOT a DevOps Engineer
Developer productivity
Zefr, Prevoty, Twitter, Eucalyptus, CSUN PTG
What is reliability?
What is reliability?Your application working when your users need it
A user’s #1 unstated feature request
Your application telling you when things aren’t
working and being able to fix things quickly
Reliability does not completely remove failure
Reliability does not completely remove failureFailure will happen no matter what you do
Perfection is not an obtainable goal
Deal with failure gracefully and reduce the impact of failures
Reducing the chance of failure by building repeatable and
reliable automated processes
Where should you begin?
Where should you begin? Build your appBuild packages for your code (zips/tarballs, RPMs/Debs,
container)
Automate builds with a CI environment (Jenkins, TravisCI)
Where should you begin? Test your appAutomate testing of your app
Unit tests should be easy to run and quick (< 10m)
Functional tests can take longer, can become less reliable
Manual testing can also be done but not much
Where should you begin? App deploymentAutomate the entire process from VM/Container setup to app
deployment
Make it multi environment (dev, stage, prod)
Make it one command
Needs to be repeatable and reliable
Where should you begin? ConfigurationApp configurations should be easy to change
Don’t hardcode values that should be configurable
12 factor apps
Config files (YAML, JSON, key:value)
Where should you begin? DevOpsCommunication is key for reliability
Make sure that people in development and operations know
what’s happening with your app
But really this isn’t enough...
Where should you begin? DevOpsCommunication is key for reliability
Make sure that people in development, operations, product
management, testing, security, design, marketing, management
know what’s happening with your app
Make sure that other teams know when something is happening
that may affect their app
What’s next?
What’s next? LoggingFind a logging format and standardize
Try to find an easy to understand, structured logging format
Make sure your logger is leveled (Debug, Info, Error, Panic)
Expect to use log messages at 3am
What’s next? Loggingfunc myFunc() {
rtn, err := doSomething(val1, val2)if err != nil {
log.Print(err) // Don’t do this!}
}
What’s next? Loggingfunc myFunc() {
rtn, err := doSomething(val1, val2)if err != nil {
log.Printf(“doSomething call failed in myFunc: %s”, err)}
}
What’s next? Loggingtime=2012:11:24T17:32:23.3435 type=error func=myFunc host=host1 line=4
msg=”doSomething call failed in myFunc: Error marshaling JSON”
{
“time”: “2012:11:24T17:32:23.3435”,
“host”: “host1”,
“type”: “error”,
“func”: “myFunc”,
“line”: 4,
“msg”: ”doSomething call failed in myFunc: Error marshaling JSON”,
}
What’s next? Aggregate LoggingOne place to view all of your app’s logs
With structured logging can pull out metrics
ELK stack - Elasticsearch, Logstash, Kibana
Splunk
What’s next? Monitoring
https://twitter.com/sadserver/status/689588269047132160
What’s next? MonitoringNeeds to be relatively real time (sub 15s)
Start with standard metrics on all requests (counts, latencies)
Add more metrics where you need them
Create a dashboard with important into
statsd/graphite/graphana, Prometheus, DataDog, Netuitive
Nagios is not sufficient for application monitoring
What’s next? Monitoring
What’s next? Monitoring@app.before_requestdef before_request(): g.request_time = time()
@app.after_requestdef after_request(response): total_time = (time() - g.request_time) * 1000 statsd.timing(“app.latency”, total_time, [“name:app”], 1) statsd.increment(“app.request”, 1, [“name:app”, “status_code:{0}”.format(response.status_code)], 1)
What’s next? AlertingUses the monitoring system’s data to make sure the app is
healthy
Sends our emails to on-call dev or ops when issues occur
Requires knowledge of an app to create
Pagerduty, Big Panda, VictorOps
Area that still needs some work
What’s next? Remove stateState is something like session information
Move to an external store all servers can access
Memory based stores the norm (memcache, redis)
Allows you to horizontally scale your app behind a LB
What’s next? Have more than 1 of everythingYou need more than one instance of your service
It shouldn’t just be a primary/backup either
Remove your single points of failures as quickly as possible
What’s next? Retries and backoffThings can fail from time to time
Resending a request can be helpful
Be careful not to DDOS another app because it went down and
came back
Exponential backoff if good
What’s next? Retries and backoffdef my_func(val1, val2): data = None err = None for n in range(10): data, err = get_data(val1, val2) if err is None: break time.sleep((2**n)/1000) // sleep for 2^n milliseconds
if err != None: return None, err
return do_something(data)
I’m bored! What’s cool?
I’m bored! What’s cool? Canary deploys“Canary in the coal mine”
Deploy new code to a single instance
Watch that instance with your monitoring stack
Add more new instances, remove old instances gradually
Helps assure that a release is good before taking all traffic
Can be automated
I’m bored! What’s cool? MicroservicesThe Unix philosophy brought to apps
Each service does only one thing
Requires a good build and deployment system
Requires monitoring, logging, alerting
Monolith → microservices
I’m bored! What’s cool? Feature flagsAllows for features to be turned on and off inside the code base
Start off with a configuration file
Make sure to read configuration to memory
Can be left in after testing or removed
Can be dynamic eventually
I’m bored! What’s cool? Feature flagsdef my_func(): rtn = do_something() print(rtn)
def do_something(): // run some code
I’m bored! What’s cool? Feature flagsdef my_func(): rtn = do_something() print(rtn)
def do_something(): // new code added here...YOLO
I’m bored! What’s cool? Feature flagsff = read_config(os.getenv(“FLAGS_CONF”, “flags.json”))
def my_func(): if ff[“do_something_ver”] == 2: rtn = do_something_2() else: rtn = do_something() print(rtn)
def do_something(): // run some code
def do_something_2(): // new way to do something
I’m bored! What’s cool? Dark deploysTest new features and functionality with real users
They won’t know that anything new has changed
Runs the old and new code and checks output
Great with easy concurrency
Feature flags can be useful
I’m bored! What’s cool? Dark deploysff = read_config(os.getenv(“FLAGS_CONF”, “flags.json”))
def my_func(): rtn = do_something()
if ff[“run_do_something_2”]: rtn2 = do_something_2() if rtn != rtn2: log.Error(“do_something and do_something_2 do not match! {0} != {1}”.format(rtn, rtn2))
print(rtn)
I’m bored! What’s cool? Loose couplingGraceful degradation
Services continue to run when dependency services fail
Output might not be complete but will be as complete as possible
Third party apps with issues won’t take down your app
Important for both backend and frontend
Common with data stores
I’m bored! What’s cool? Circuit breakersKeep track of issues with external services and short circuit calls
to them
Design pattern that’s becoming more popular
Netflix Hystrix -- Java
I’m bored! What’s cool? Chaos engineeringInject faults into your production traffic to test your app
Tests how your apps truly cope with issues before the happen
Helps make sure that devs and ops understand app
Only runs during business hours
Reliability doesn’t magically happen!
Reliability doesn’t magically happenIt must be worked on
It must be prioritized properly and not just assumed
to happen organically
Further reading
Further ReadingContinuous Delivery: Reliable Software Releases through Build, Test and Deployment
Automation (Humble and Farley)
http://www.amazon.com/Continuous-Delivery-Deployment-Automation-Addison-
Wesley/dp/0321601912
Further readingThe Practice of Cloud System Administration: Designing and Operating Large
Distributed Systems, Vol 2 (Limoncelli, Chalup, Hogan)
http://www.amazon.com/Practice-Cloud-System-Administration-
Distributed/dp/032194318X
Further readinghttp://martinfowler.com/
http://www.devopsweekly.com/ (weekly newsletter of articles)
https://blog.cloudflare.com/
https://blog.twitter.com/engineering
http://highscalability.com/