Post on 28-Nov-2014
description
transcript
Operations Driven Web Services-A Case Study of Service Evolution at Rent the Runway
Camille Fournier, Head of Engineering @skamille
Carlo Barbara, Senior Systems Engineer @CarloBarbara
In The Beginning, There Was Drupal
Product Details
Filtering
View UsersProduct Creation
Order ManagementReservations
Login
There was also all of these folks…
ViewProduct Creation
Order ManagementReservations
Filtering
Product Details
Users
Login
Can’t Just Burn the World Down
ViewProduct Creation
Order ManagementReservations
Filtering
Product Details
Users
Login
Hollow It Out!
ViewProduct Creation
Order ManagementFiltering
Product Details
Users
Login
Hollow It Out!
ViewProduct Creation
Order ManagementFiltering
Users
Login
Hollow It Out!
ViewProduct Creation
Order Management
Users
Login
Hollow It Out!
Complexity
Dec-1
1
Feb-
12
Apr-1
2
Jun-
12
Aug-
12
Oct-1
2
Dec-1
2
Feb-
13
Apr-1
3
Jun-
1302468
101214
Number of Services in Production
Operations first…
Availability and performance of our services is critical to running our business
The software we develop has to make delivering on our SLAs possible
How (besides sane design): Healthchecks + Nagios Measurements Historical Data with Graphs
Metrics
Gauges – instantaneous value
Counters – counter with +/-
Meters – rate over time (mean, 1, 5, & 15 moving avg.)
Histograms – distribution of data (mean, median, max, std. div., 75th, 90th, 95th, 98th, 99th, & 99.9th percentiles)
Timers – Meter of requests & Histogram of duration (frequency & latency)
Metrics - Healthchecks
Verify that your service is running correctly
Metrics - Reporting
HTTP
JMX
Graphite
Dropwizard: What is it?
Quality open source Java webservice components glued together in a modular way
Eliminates the need for picking a platform stack, it’s all there
It’s opinionated. If you don’t like a Dropwizard core component, that’s too bad, don’t use Dropwizard
Developers focus on business logic, not framework
It’s easy, maintainable, and it works!
A Few Words from Coda…
“I had no one I had to toss a WAR to. I had no one to stand up a Tomcat server and fiddle with it until their eyes bled. I had no one who didn't trust me to spin up my own threads or connection pools. So I wrote something which worked as simply and in as straight-forward a manner as possible because my own ass was on the line if it didn't work.”
Dropwizard: The Ingredients
Jersey for REST
Jackson for JSON
Jetty for a webserver
Metrics for measuring
YAML for configuring
Dropwizard for weaving everything together
Dropwizard – Healthchecks
Register hooks that check the health of your app
An HTTP endpoint that iterates over all the hooks
“The meaning of healthy” is decided by you (i. e. Database Connections, Client Connections, DeadLock Count)
Dropwizard + Metrics
Dropwizard has lots of platform instrumentation baked in using Metrics, happens for free! (i.e. Jetty, JVM, Log Counts, etc…)
Ability to add Timers to your endpoints with @Timed
Ability to add arbitrary metrics as you see fit
Other Frameworks
Play 1.X Abandonware for Play 2.X, which was still beta Magic
Glassfish OSGI hell “standards”
Spring Everything and the kitchen sink Also I hate XML
What do I get out of it? Dev agenda
Story telling: causation & correlation
Integral piece of the operational excellence puzzle
State of the world – Dashboards
Developers focus on features, operations is mostly free lunch
Code review & demo
Disclaimer: You need graphite to really harness the value
Story telling
The grid is slow why? Is it load? Is it dependent service latency? How does that compare to yesterday
JVM throws out of memory, what’s the problem? What does the GC jigsaw look? When did it change? Is it correlated with increased load?
How is that new ‘performance’ tweak? If you never measured, then you didn’t tune. True story! What does my 5XX graph look like?
Operational Excellence: The ingredients
Application Instrumentation (Dropwizard)
Time Series Data & Graphing (Graphite, D3)
Centralized logging & log parsing (Rsyslog, Logstash, Nagios)
Automated alerting & escalation (Pagerduty)
DW & Graphite will get you very far, but if you want total control & visibility you need the rest. This is the stack that RTR is moving towards, rather than relying on basic java logging smtp appenders
OMG, we are on GMA, are we OK?
10+ services
Each services runs in a cluster behind an LB
‘OK’ is somewhat service specific
Basically you need a lot of info at your fingertips. Pictures are worth a thousand words. Get yourself some dashboards!
Graphite Dashboard
Tasseo dashboard (D3)
• Red, Yellow, & Green Lights• Realtime• Endless cool things: graphite + D3
If we see yellow or red, start diagnosing
Free Lunch? Not really
DB connection pool monitoring
Http client connection pool monitoring
JVM Heap & GC info
Http Server response counts
Http Server connection info
Endpoint duration & throughput stats
Where do I sign up?
You install Graphite, one time hit + some TLC. Medium Difficulty
You annotate your endpoints and maybe add finer telemetry. Easy
You configure so your service is feeding into graphite. Hopefully consistently across services, via a ‘Bundle’. Easy
Demo
Show a simple dropwizard codebase
Do some curls
Show the admin endpoints
References
dropwizard.codahale.com
metrics.codahale.com
graphite.wikidot.com
Presenters
@CarloBarbara (www.cabkata.com)
@Skamille (whilefalse.blogspot.com)
Rent The Runway is hiring! (renttherunway.com/careers)