Austin Cassandra Meetup re: Atomic Counters

Post on 13-Jul-2015

306 views 3 download

Tags:

transcript

Cassandra at 46 Labs: Idempotent CountersJuly 17, 2014

Who is this guy?

I’m also the Founder, which in Latin means “everyone else gets paid before me.”

~ Literal Translation

Founded in 2012

Currently handle around 1/2 Billion call billing records per day.

What is 46 Labs?

We build realtime telecom analytics and security solutions for Carriers and Enterprises

Shout Outs#Cassandra IRC Channel

“Unbelievable resource”

!“Thumbs up for the Startup Program”

Nate McCall “Helped us in our time of need”

To all of you who aren’t in that ballpark…feel free to take the pitch and swing away.

Patent Warning

So…we the have parts of this process related to the handling of telecom analytics and billing records patented.

!Fair Warning to the telecom folks in the room.

You can do an operation several times without changing the result as a function of performing the operation.

Simple Answer:

What is idempotence?

Example:

For example, as “set” is idempotent. An “increment or decrement” isn’t. Not just with Cassandra, but with anything, by definition.

But why?

Because counters are NOT atomic in Cassandra.

Why does it matter?

Because it is really, really, really hard to do anything atomic and distributed, especially counters.

Since counters aren’t idempotent, by definition, and not atomic in Cassandra, it means that if you repeated the

same counter operation 100 times….you might get different results on each run.

So…

???

It means that you can’t use Cassandra counters for anything requiring precision….like billing balances, voting, statistical

analysis or any time-series data that must be exact.

The higher the volume and the more nodes you have, the more inaccurate the counters become.

And…?

If you are wanting atomic counters inside of a database as of today’s date, then maybe.

Hint: We have tried both (and a lot more). They are slow. Like…really slow for this type of operation and have hurdles way beyond just being slow.

So I should use Mysql or Couchbase?

Is there a chance that a better alternative exists that will allow me to use Cassandra and have atomic and

idempotent counters?

So, All is Lost?

Yeap.!!

But it involves some helpers.

+

How we do it

+

=

Our call billing records come off our infrastructure and go into a RabbitMQ cluster.

!Hint: you could use Kafka, Redis, 0MQ, etc.

The RabbitMQ queues are a nice and safe place for our messages to sit and wait to be processed.

RabbitMQ

With RabbitMQ ACKs, we can be sure the messages are fully processed before they are removed.

We wrote Java workers, who’s sole job in life is to:

1. Consume Messages from Rabbit!!

2. Perform In-memory atomic increment operations (increment/decrement).!!

3. Persist the message to Cassandra.!!

4. Push a static counter value into Cassandra (i.e. a set instead of an increment) every X seconds.!!

5. ACK that the operation is complete back to Rabbit.

Workers

(You can use whatever language you prefer)

1. You can stream analytics in realtime. !

2. Being in-memory, it is ridiculously fast and lightweight. !

3. Its atomic because each counter constituent is in a single thread. !

4. Cassandra can be used to atomically persist the counter. !

5. The counter data matches the underlying data used to generate it exactly.

Why is this special?

What happens if the worker crashes…its all in memory!!!

Refer to step 4 in what our worker’s job is to do: “Push a static counter value into Cassandra (i.e. a set instead of an increment) every second.”

Wait…

Since we push a static counter value into Cassandra, we now have an idempotent way to recover gracefully in the event of a crash. The worker fires up, asks Cassandra what

it should have in its memory, then starts its atomic operations again. This backup worker can come up (Zookeeper) on a different physical or virtual host if needed.

Since you are limited to a single thread processing a single counter….once you run out of memory or saturate the CPU for that counter you can’t grow!!

!Yeap. This is why we shard our data at the application layer and not the worker layer. We abstract

scalability further out knowing we have a finite amount of memory and processing power to play with at the worker level.

You cant grow!

We can atomically handle 1M ops/sec from a single worker on a single moderately powered server. If you are taxing that single server you need to re-think your

architecture.!

Sure it does.

Does it work?

We currently process over 2 million counter operations per second using this method.

Questions?

If you think of any ones that you forgot to ask, you can email me at trevor@46labs.com.