Realtime Analytics on the Twitter Firehose with Cassandra

transcript

Realtime Analytics with Cassandra

or: How I Learned to Stopped Worrying and

Love Counting

What is Realtime Analytics?eg “show me the number of mentions of

‘Acunu’ per day, between May and November 2011, on Twitter”

Batch (Hadoop) approach would require processing ~30 billion tweets,

or ~4.2 TB of datahttp://blog.twitter.com/2011/03/numbers.html

Introduction

Live & historicalaggregates...

Realtime trends...

Drill downsand roll ups

Okay, so how are we going to do it?

For each tweet,

increment a bunch of counters,

such that answering a query

is as easy as reading some counters

Preparing the dataStep 1: Get a feed of

the tweets

Step 2: Tokenise the tweet

Step 3: Increment countersin time buckets for each token

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

[1234, man] +1[1234, acunu] +1[1234, rock] +1

Querying

Step 1: Do a range query

Step 2: Result table

Step 3: Plot pretty graph

start: [01/05/11, acunu]end: [30/05/11, acunu]

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

May Jun Jul Aug Sept Oct Nov

Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,

so not possible to range queries on rows

• Could manually work out each row in range, do lots of point gets

• This would suck - each query would be 100’s of random IOs on disk

• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

So instead of this...

We do thisKey 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket

Demo./painbird.py -u tom_wilkie

Now its your turn.....

1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/O9hkv

3. Cluster them up

4. Get the code - http://goo.gl

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)

http://goo.gl/O9hkv

Get some Cassandra VMs

Cluster them up

• SSH in, set password (on both!)

• Check you can connect to the UI

• Use UI (click add host)

Get the codeSSH into one of the VMs:

# curl https://acunu-oss.s3.amazonaws.com/painbird.tar.gz | tar zxf -

# curl -o pycassa.rpm https://acunu-oss.s3.amazonaws.com/pycassa.rpm

# rpm -i pycassa.rpm

# cd release

# ./painbird.py -u tom_wilkie

Implement the “core”

• In core.py

• def insert_tweet(cassandra, tweet):

• def do_query(cassandra, term, start, finish):

Check you data-bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type 'help;' or '?' for help.Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)

Extensions

• Pretty graphs

• Automatically periodically update

• Search multiple terms

Painbird

• mentions of multiple terms

• sentiment analysis - http://www.nltk.org/

Realtime Analytics on the Twitter Firehose with Cassandra

Technology