+ All Categories
Home > Technology > Cassandra: One (is the loneliest number)

Cassandra: One (is the loneliest number)

Date post: 11-Apr-2017
Category:
Upload: datastax-academy
View: 578 times
Download: 1 times
Share this document with a friend
42
2015-12-09 One (is the loneliest number) [email protected] & [email protected]
Transcript
Page 1: Cassandra: One (is the loneliest number)

2015-12-09

One (is the loneliest number)[email protected] & [email protected]

Page 2: Cassandra: One (is the loneliest number)

2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC

Page 3: Cassandra: One (is the loneliest number)

2015-12-09ONE (IS THE LONELIEST NUMBER)

Page 4: Cassandra: One (is the loneliest number)

2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC

Failure

Page 5: Cassandra: One (is the loneliest number)

2015-12-08

Background

ONE (IS THE LONELIEST NUMBER)

• Shared cluster, 5 machines (with replication factor = 5) • 10s of GBs of data • In-flight data: 10s of MBs, maybe 100s

Page 6: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Casssandra Replication

Client

R1

R2

R3

Page 7: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Casssandra Replication - Failure

Client

R1

R2

R3X

Page 8: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Foreshadowing• Series of small outages / degradations • Repair process started • High load, high latency • Response: disable thrift, turn off nodes

Page 9: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Coordinator Read Latency (in ms, by host)

6 seconds

~25 ms

Page 10: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Coordinator Read Latency (in ms, by host)

Page 11: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Coordinator Read Latency (in ms, by host)

Page 12: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Coordinator Read Latency (in ms, by host)

Page 13: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Coordinator Read Latency (in ms, by host)

Page 14: Cassandra: One (is the loneliest number)

2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC

The next day…

Page 15: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

The Plan• Trigger repair… … with lots of people watching • Use our load shedding strategies for any problems:

• Proactively disable non-critical services • Disable thrift

Page 16: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Surprise!• Cron triggers a repair of a different keyspace • Plus a compaction for a large CF

Page 17: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Outgoing Notification Backlog Size

Normal

Bad

Horrible

Page 18: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Outgoing Notification Backlog Size

NormalBad

Horrible

:(

Page 19: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Cassandra Pending Tasks: ReadStage (by host)

Over 9000

Page 20: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Cassandra CPU (by host)

100%

Page 21: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

Factory ResetSuccess… kind of

Page 22: Cassandra: One (is the loneliest number)

2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC

What went wrong?

Page 23: Cassandra: One (is the loneliest number)

2015-12-08ONE (IS THE LONELIEST NUMBER)

or: What can we learn from Aimee Mann?

One is the loneliest number that you'll ever do Two can be as bad as one It's the loneliest number since the number one

No, is the saddest experience you'll ever know Yes, it's the saddest experience you'll ever know

Page 24: Cassandra: One (is the loneliest number)

2015-12-09

No, is the saddest experience you’ll ever know

•Cassandra sheds load when overloaded •Shedding drops “stale” requests •Clients see timeouts and have trouble making progess

ONE (IS THE LONELIEST NUMBER)

•Sheds load if clients abandon the failed requests •But if clients retry those requests…

Page 25: Cassandra: One (is the loneliest number)

2015-12-09

Event ProcessingEvent Processing

So I heard you like retries…

ONE (IS THE LONELIEST NUMBER)

Notification Management

App HostApp HostApp Host

Cassandra Cluster

Cassandra Cluster

Cassandra Cluster

Cass Client retries (S)

Service client retries (T)

Load balancer retries (H)

Retries are multiplicative

Total # of retries: O(S*H*T)

Interactive Request (from user)

Load Balancer

Page 26: Cassandra: One (is the loneliest number)

2015-12-09

Yes, it’s the saddest experience you’ll ever know

•Dropped requests were retried •…causing load amplification •…causing more dropped requests •…causing even more retries •…causing misery. •i.e. too much load leads to much too much load

ONE (IS THE LONELIEST NUMBER)

Page 27: Cassandra: One (is the loneliest number)

2015-12-09

How does overload get started?

•Unpredictable workloads •Could be from request volume •In our case, from batch-style processes •Repairs, compaction, application-level tasks (e.g. archiving)

ONE (IS THE LONELIEST NUMBER)

Page 28: Cassandra: One (is the loneliest number)

2015-12-09

PagerDuty system architecture

Cassandra Cluster

ONE (IS THE LONELIEST NUMBER)

Inbound Event Buffer

Data Access

Notification Management

Message Delivery

Monitoring Events SMS, Phone Calls

App Host

Interactive Requests (from users)

Load Balancer

Page 29: Cassandra: One (is the loneliest number)

2015-12-09

+

=Workload A + B

Workload A Workload B

…and more bursts are more worst

ONE (IS THE LONELIEST NUMBER)

Page 30: Cassandra: One (is the loneliest number)

2015-12-09

One (cluster) is the loneliest number that you’ll ever do

•How many ops are A vs. B? •Must reverse engineer the contributions •Build (constantly evolving) models •Hard to reason about system behaviour •…and gets substantially harder when your entire production stack is overloaded

ONE (IS THE LONELIEST NUMBER)

Page 31: Cassandra: One (is the loneliest number)

2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC

How we fixed it

Page 32: Cassandra: One (is the loneliest number)

2015-12-09

Stop poking the bear

•Only retry when necessary - is failure an option? •Less risky to retry user-initiated requests •Don’t retry retries (much) •Specifically:

•Only try a single fallback C* host at the driver level, not N-1 •Only try a single fallback service host, not M-1

ONE (IS THE LONELIEST NUMBER)

Page 33: Cassandra: One (is the loneliest number)

2015-12-09

Prepare for the worst case

•To avoid overload, must provision for the worst case •So either scale for the (bursty) stars aligning… •…or prevent stars from aligning in the first place

ONE (IS THE LONELIEST NUMBER)

Page 34: Cassandra: One (is the loneliest number)

2015-12-09

Preventing star-bursts, part 1: coordinate

•Explicit scheduling to interleave bursts •Repairs, compactions, batch jobs - Cassandra & services •Automation can help… •…but still error prone

ONE (IS THE LONELIEST NUMBER)

Page 35: Cassandra: One (is the loneliest number)

2015-12-09

Preventing star-bursts, part 2: smooth, not chunky

•Jobs can be done more frequently •But with smaller batch size

•In the limit, aims for continuous & constant intensity workload •Some Cassandra options too:

•Compaction, transfer, and other throttle limits •Levelled compaction vs. size-tiered compaction

ONE (IS THE LONELIEST NUMBER)

Page 36: Cassandra: One (is the loneliest number)

2015-12-09

Preventing star-bursts, part 3: isolation

•Air gap between each workload •Distinct Cassandra cluster for each service/workload •Cons:

•More infrastructure •More configuration management

•Pros: •Easy to monitor, reason about, diagnose, and scale •Reduces the blast radius when failures happen (and they will)

ONE (IS THE LONELIEST NUMBER)

Page 37: Cassandra: One (is the loneliest number)

2015-12-09

PagerDuty system architecture: today

ONE (IS THE LONELIEST NUMBER)

Inbound Event Buffer

Notification Management

Message Delivery

Cassandra Cluster

Cassandra Cluster

Cassandra Cluster

Page 38: Cassandra: One (is the loneliest number)

2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC

Lessons learned

Page 39: Cassandra: One (is the loneliest number)

2015-12-09

What have we learned?

• Retries: the devil’s in the details • Variable workloads: bad, especially if unpredictable • Workload peaks: additive, and bad in multiples • Isolation: the gift that keeps on giving

ONE (IS THE LONELIEST NUMBER)

Page 40: Cassandra: One (is the loneliest number)

2015-12-09

One is the loneliest number that you'll ever do Two can be as bad as one It's the loneliest number since the number one

No, is the saddest experience you'll ever know Yes, it's the saddest experience you'll ever know

ONE (IS THE LONELIEST NUMBER)


Recommended