Post on 05-Dec-2014
description
transcript
10/20/14
Watching Your Cassandra Cluster Melt
10/20/14
What is PagerDuty?
WATCHING YOUR CASSANDRA CLUSTER MELT
10/20/14
Cassandra at PagerDuty
WATCHING YOUR CASSANDRA CLUSTER MELT
• Used to provide durable, consistent read/writes in a critical pipeline of
service applications
• Scala, Cassandra, Zookeeper.
• Receives ~25 requests a sec
• Each request is a handful of operations then processed asynchronously
• Never lose an event. Never lose a message.
• This has HUGE implications around our design and architecture.
10/20/14
Cassandra at PagerDuty
WATCHING YOUR CASSANDRA CLUSTER MELT
• Cassandra 1.2
• Thrift API
• Using Hector/Cassie/Astyanax
• Assigned tokens
• Putting off migrating to vnodes
• It is not big data
• Clusters ~10s of GB
• Data in the pipe is considered ephemeral
10/20/14
Cassandra at PagerDuty
WATCHING YOUR CASSANDRA CLUSTER MELT
DC-C
DC-A DC-B
~20 MS ~5 MS
~20 MS
• Five (or ten) nodes in three regions
• Quorum CL
• RF = 5
10/20/14
Cassandra at PagerDuty
WATCHING YOUR CASSANDRA CLUSTER MELT
• Operations cross the WAN and take inter-DC latency hit.
• Since we use it as our pipeline without much of a user-facing front,
we’re not latency sensitive, but throughput sensitive.
• We get consistent read/write operations.
• Events aren’t lost. Messages aren’t repeated.
• We get availability in the face of a loss of entire DC-region.
10/20/14
What Happened?
WATCHING YOUR CASSANDRA CLUSTER MELT
• Everything fell apart and our critical pipeline began refusing new events and
halted progress on existing ones.
• Created degraded performance and a three-hour outage in PagerDuty
• Unprecedented flush of in-flight data
• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/
2014/06/outage-post-mortem-june-3rd-4th-2014/
10/20/14
What Happened…
WATCHING YOUR CASSANDRA CLUSTER MELT
• It was just a semi-regular day…
• …no particular changes in traffic
• …no particular changes in volume
• We had an incident the day before
• Repairs and compactions had been taking longer and longer. They
were starting to overlap on machines.
• We used ‘nodetool disablethrift' to mitigate load on nodes that
couldn’t handle being coordinators.
• We even disabled nodes and found odd improvements with a
smaller 3/5 cluster (any 3/5).
• The next day, we started a repair that had been foregone…
10/20/14
What happened…
WATCHING YOUR CASSANDRA CLUSTER MELT
1 MIN SYSTEM LOAD
10/20/14
What we did…
WATCHING YOUR CASSANDRA CLUSTER MELT
• Tried a few things to mitigate the damage
• Stopped less critical tenants.
• Disabled thrift interfaces
• Disabled nodes
• No discernible effect.
• Left with no choice, we blew away all data and restarted Cassandra fresh
• This only took 10 minutes after committing to do this.
sudo rm -r /var/lib/cassandra/commitlog/*
sudo rm -r /var/lib/cassandra/saved_caches/*
sudo rm -r /var/lib/cassandra/data/*
• Then everything was fine and dandy, like sour candy.
10/20/14
So, what happened…?
WATCHING YOUR CASSANDRA CLUSTER MELT
WHAT WENT HORRIBLY WRONG?
• Multi-tenancy in the Cassandra cluster.
• Operational ease isn’t worth the transparency.
• Underprovisioning
• AWS m1.larges
• 2 cores
• 8 GB RAM <—definitely not enough.
• Poor monitoring and high-water marks
• A twisted desire to get everything out of our little cluster
10/20/14
Why we didn’t see it coming…
WATCHING YOUR CASSANDRA CLUSTER MELT
OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER.
• Everything was fine 99% of the time.
• Read/write latencies close to the inter-DC latencies.
• Despite load being relatively high sometimes.
• Cassandra seems to have two modes: fine and catastrophe
• We thought, “we don’t have much data, it should be able to handle this.”
• Thought we must have misconfigured something. We didn’t need to scale up…
10/20/14
What we should have seen…
WATCHING YOUR CASSANDRA CLUSTER MELT
CONSTANT MEMORY PRESSURE
This is bad
This is good
10/20/14
What we should have seen…
WATCHING YOUR CASSANDRA CLUSTER MELT
• Consistent memtable flushing
• “Flushing CFS(…) to relieve memory pressure”
• Slower repair/compaction times
• Likely related to the memory pressure
• Widening disparity between median and p95 read/write latencies
10/20/14
What we changed…
WATCHING YOUR CASSANDRA CLUSTER MELT
THE AFTERMATH WAS ROUGH…
• Immediately replaced all nodes with m2.2xlarges
• 4 cores
• 32 GB RAM
• No more multi-tenancy.
• Required nasty service migrations
• Began watching a lot of pending task metrics.
• Flushed blocker writers
• Dropped messages
10/20/14
Lessons Learned
WATCHING YOUR CASSANDRA CLUSTER MELT
• Cassandra has a steep performance degradation.
• Stay ahead of the scaling curve.
• Jump on any warning signs
• Practice scaling. Be able to do it on quick notice.
• Cassandra performance deteriorates with changes in the data set and
asynchronous, eventual consistency.
• Just because your latencies were one way doesn’t mean they’re
supposed to be that way.
• Don’t build for multi tenancy in your cluster.
10/20/14
PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence teams.
Thank you.
http://www.pagerduty.com/company/work-with-us/
http://bit.ly/1ym8j9g