Post on 02-Jul-2015
description
transcript
Why did we think large scale distributed systems would be
easy? Gordon Rowell
PuppetConf San Francisco 2013
gordonr@google.com
Background
Site Reliability Engineering runs many services The same rules always apply:
● Make the service scale ● Make the deployment consistent ● Understand all layers of the system ● Monitor everything ● Plan for failure ● Break things, under controlled conditions
Scaling is fun
We don't deploy "a server" • Servers break, power fails • Clients/DNS need to be reconfigured
We don't deploy "a cluster"
• Networks break, servers break, power fails • Clients/DNS need to be reconfigured
We deploy redundant clusters
• Attempt to send clients to nearest serving cluster • Anycast allows for unified client configuration
But client DoS is not
Poorly written code... ● on small numbers of clients... ● is annoying
Poorly written code...
● on a huge number of clients... ● can cause serious infrastructure pain
Write good code and stage your releases
● Work with the service owners ● Stage rollouts, allow soak time ● Have a rollback plan for clients and test it ● Have DoS limits for services, test them
Load balancing is fun
Do you have enough capacity? • How many backends do you need? • What happens if half of your backends lose power? • What about when half are already out for repairs?
How do you send clients to the right cluster?
• Client configuration • DNS round-robin (simple global load balancing) • DNS views (give best answer for client IP) • Anycast (portable IP, routed to "nearest" cluster) • Consider: DNS views plus Anycast
But global outages are not
Monitor everything ● Health check failures bring down your service ● ...by design
Test everything
● You should expect (and test) data center outages ● A global outage can ruin your day ● Cascading failures are unpleasant
Learn from outages
● Write postmortems ● Focus on the facts! ● What went wrong and what can be better? ● A postmortem is not about blame
Thundering herds are not
For Puppet • "Lots" of Mac desktops and laptops • "Lots" of Ubuntu desktops, laptops and servers • "Some" others
What if they all want to do a puppet run?
• What about every hour? • What about every five minutes?
Randomize your cron jobs! (and test it) How can you shed load on the server?
Anycast is fun
Anycast is "coarse-grain" load balancing • Routes traffic to the “nearest”, “serving” cluster
Networks break
• Physical issues • Routing issues • Configuration issues • Load balancer bugs
Anycast monitoring is hard
Anycast directed to one site is not fun
Anycast directed to one site is not fun All clients could be sent to the same cluster
• Be ready for that • Can a single cluster handle worldwide traffic? • What do you do if it can't?
Have a mitigation strategy to shed load
● Include load calculations early in health checks ● Consider DNS views to redirect some traffic ● Drop traffic if you have to
Diversity is good...for people
Be ruthless against platform diversity If you can’t automate it, don’t do it
● “Could we bring up another 50 today, please?” ● “That backend was just a little different and...oops”
Anycast helps you be consistent
● Traffic could go anywhere Every OS upgrade is a time to refactor and clean
Questions?
Gordon Rowell gordonr@google.com