+ All Categories
Home > Documents > Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who...

Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who...

Date post: 19-Jun-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
60
Lies, Damned lies, and timeouts Coping with failures in datacenters Yao Yue
Transcript
Page 1: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Lies, Damned lies, and timeouts

Coping with failures

in datacenters

Yao Yue

Page 2: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve
Page 3: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Who am I?

Cache @ Twitter

Now working on performance in general

@thinkingfish

Page 4: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

I’ve been telling my coworkers the same things for years…

Why do I want to give this talk?

Page 5: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

What do I know?

▸ My views are heavily influenced by two things:

▸ In-memory, datacenter-scale caching

▸ Twitter’s environment

Page 6: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Cache in datacenters

▸ Distributed in-memory KV store

▸ Serving many, many requests

▸ with very tight latency expectation

Page 7: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Twitter’s runtime

▸ Architecture: monolithic -> SOA/Microservices

▸ Owns datacenters

▸ Services mostly on JVMs

▸ Jobs deployed with containers

▸ Scale: up to many thousands of nodes per service

Page 8: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Living in an imperfect world

Page 9: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Consensus over unreliable link is UNSOLVABLE

Two unhappy generals

?

Two General Problem / Paradox

Page 10: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

timeouts & retries

The engineering approach to cope with failures

Page 11: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

timeouts & retries

The engineering approach to cope with failures

Page 12: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Redundant requests

The engineering approach to cope with failures

Page 13: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Further mitigation

The engineering approach to cope with failures

YOU KNOW WHAT? FORGET ABOUT IT…

Page 14: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Coping with failureTimeouts, Retries, and preventions

Page 15: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Coping with failure

Request

Retry (N)

Failure Prevention

Page 16: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout and retry

▸ Timeout: an approximation of failure

▸ False positive is possible

▸ Retry: an effort to reduce failure

▸ Has a cost

▸ Has an effect on system

Page 17: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout and retry get close to the heart of the difficulty of distributed systems, and yet we treat them casually because they’re often presented as configuration.

Page 18: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

What works intuitively may lead to catastrophes.

Page 19: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout

Page 20: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout could be misleading

TIMEOUT INFORMS ME ABOUT COMMUNICATION TO REMOTE SERVICE

SERVICE A SERVICE B

???

Page 21: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

One service

SERVICE

LIBRARY+VM

KERNEL

HARDWARE

NETWORK

Page 22: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

SERVICE A

LIBRARY+VM

KERNEL

HARDWARE

NETWORK

SERVICE B

LIBRARY+VM

KERNEL

HARDWARE

One service Another service

Page 23: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

SERVICE A

LIBRARY+VM

KERNEL

HARDWARE

NETWORK

SERVICE B

DO NOT

CARE

One service Another service

Page 24: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

One serviceSERVICE

LIBRARY+VM

KERNEL

HARDWARE

NETWORK

resource contention, head-of-line blocking, locking…

garbage collection, function calls with indeterministic timing…

system calls with indeterministic timing, background tasks, unfair scheduling…

blocking IO, congestion, cycle stealing…

packet drop, queue backup…

Page 25: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Multitenant runtime

SERVICE A

LIBRARY+VM

KERNEL

HARDWARE

OTHER SERVICE

OTHER SERVICE

Page 26: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Multi-tenancy with Noisy Neighbors

OTHER SERVICE

OTHER SERVICE

KERNEL

HARDWARE

SERVICE A

LIBRARY+VM

Page 27: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout cascade

…SERVICE A

SERVICE B

SERVICE C

Accountability Gap

Accountability Gap

Page 28: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Inconvenient truths about timeout

▸ Timeouts often do not indicate remote service health

▸ The optimal timeout is a moving target

▸ Often less predictable in shared environment

▸ Have gaps in overall timeline

Page 29: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Chained dependencies

SERVICE ASERVICE C

SERVICE B

Q: A’s requests time out, but not B’s, which on-call engineer(s) should be paged?

Page 30: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Chained dependencies

SERVICE ASERVICE C

SERVICE B

Q: A’s requests time out, so do B’s, which on-call engineer(s) should be paged?

Page 31: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Oftentimes modern application architecture is a meshMess

Page 32: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

correlation != causality

Page 33: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

timeouts != causality

Page 34: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry

Page 35: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retries carry risks

Request

Retry (N)

RETRIES IMPROVE THE SUCCESS RATE OF SERVICE???

Page 36: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry is extra work

2x requests

Page 37: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry is extra work

2x requests 2x responses

Page 38: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry != Replay

▸ Potential behavioral change

▸ State change

▸ Hidden retries in lower stack

▸ e.g. TCP

Page 39: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Other retry key decisions

▸ How many times?

▸ How to set timeout for each retry?

▸ How much delay between retries?

▸ Where to send it?

Page 40: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Inconvenient truths about retry

▸ Retries can create positive feedback loop

▸ Retries can change system state and behavior

▸ Sophisticated retry configuration has many knobs

Page 41: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Chained dependencies

SERVICE ASERVICE C

SERVICE B

If C is slow…

Page 42: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Chained dependencies

SERVICE ASERVICE C

SERVICE B

If B is slow…

Page 43: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry at scale- “How to DDoS A cache?”

▸ Timeout ⇒ connection teardown

▸ Large number of clients

▸ Stateful backend, fixed route

▸ Full connectivity mesh

▸ Variable connections per route

▸ Container-based deploy

Page 44: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Transient hotspot ⇒ initial timeouts (closing connections) ⇒ connection storm (fixed route) ⇒ massive timeouts (full mesh) ⇒ bigger storm (multiple, variable connections) ⇒ NIC saturation / backend offline ⇒ frontend failing, site down

Retry at scale- “How to DDoS A cache?”

Page 45: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Service mesh can make retries significantly worse

▸ Top-level retries affect whole system

▸ Bigger multipliers toward the bottom

▸ Hard to predict bottleneck

Page 46: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Wrong implicit assumption, single input, single feedback loop

Problems with naive failure coping mechanisms

Page 47: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

How to improve?

Page 48: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Address the positive feedback loop

Timeout

Retry++

Page 49: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Slow down, retry less

Timeout

Retry+ +

Page 50: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Introduce other feedback loop

Timeout

Retry+ +

-

Page 51: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Good timeouts

▸ Minimize false positives caused by local stack

▸ Attempt to root cause with other signals

Page 52: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Good retries

▸ Customize for purpose of request

▸ Be as conservative as possible

▸ Circuit breakers for timeout induced retries

▸ State-based rules

▸ Back-pressure

Page 53: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Manage impact across services

▸ Enforce top-down budget

▸ Apply and act on back pressure

Using micro services? Your task just got much harder…

▸ Tracing helps

Page 54: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

drill, baby, drill

▸ Test common failure scenarios

▸ Test failures at different parts

▸ Test partial failures

▸ Test relaxed timeout/retry/escalation config

Page 55: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Example: a typical cache config

(Guide w/ examples, 2K+ words excl. code)

▸ Overall timeout: 500ms* ▸ Request timeout: 150ms* ▸ Connect timeout: 200ms ▸ Pipelining: request timeout does not reset connection

▸ Read retry: 2 tries*, no write-back, no backoff ▸ Write retry: 3 tries*, random backoff (5-200ms) ▸ Overall retry budget: 20% of requests, minimum 10 retries per second (10-sec credit window)

▸ Blackhole: 5 consecutive failures*, revive after 30 seconds ▸ Centralized topology manager, changes dampened >1min

CACHE SLO: P999 <5MS

Page 56: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Reduce varianceMaking timeout/retry easier

Page 57: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Example: Making cache more predictable

▸ Staying off shared hosts

▸ Rate limiting new connections with iptables

▸ Setting CPU affinity for sirq, application

▸ Removing shared locks

▸ All blocking IO on background threads

Page 58: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

There’s no magic- Important to know when to give up

?

Page 59: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Understand common anomalies, Break the loop, Form a bigger picture,

and you’ll be happy most of the time.

Page 60: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Questions?


Recommended