Realtime Analytics on the Twitter Firehose with Cassandra

Post on 12-May-2015

2,079 views 1 download

Tags:

description

Tutorial given by Tom Wilkie at Progressive NoSQL conference, 11/5/12

transcript

Realtime Analytics with Cassandra

or: How I Learned to Stopped Worrying and

Love Counting

1

What is Realtime Analytics?eg “show me the number of mentions of

‘Acunu’ per day, between May and November 2011, on Twitter”

Batch (Hadoop) approach would require processing ~30 billion tweets,

or ~4.2 TB of datahttp://blog.twitter.com/2011/03/numbers.html

2

Introduction

3

Live & historicalaggregates...

3

4

Realtime trends...

4

5

Drill downsand roll ups

5

Okay, so how are we going to do it?

For each tweet,

increment a bunch of counters,

such that answering a query

is as easy as reading some counters

6

Preparing the dataStep 1: Get a feed of

the tweets

Step 2: Tokenise the tweet

Step 3: Increment countersin time buckets for each token

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

[1234, man] +1[1234, acunu] +1[1234, rock] +1

7

Querying

Step 1: Do a range query

Step 2: Result table

Step 3: Plot pretty graph

start: [01/05/11, acunu]end: [30/05/11, acunu]

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

0

45

90

May Jun Jul Aug Sept Oct Nov

8

Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,

so not possible to range queries on rows

• Could manually work out each row in range, do lots of point gets

• This would suck - each query would be 100’s of random IOs on disk

• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation

9

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

So instead of this...

We do thisKey 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket

10

Demo./painbird.py -u tom_wilkie

11

Now its your turn.....

12

1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/O9hkv

3. Cluster them up

4. Get the code - http://goo.gl

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)

13

http://goo.gl/O9hkv

Get some Cassandra VMs

14

Cluster them up

• SSH in, set password (on both!)

• Check you can connect to the UI

• Use UI (click add host)

15

Get the codeSSH into one of the VMs:

# curl https://acunu-oss.s3.amazonaws.com/painbird.tar.gz | tar zxf -

# curl -o pycassa.rpm https://acunu-oss.s3.amazonaws.com/pycassa.rpm

# rpm -i pycassa.rpm

# cd release

# ./painbird.py -u tom_wilkie

16

Implement the “core”

• In core.py

• def insert_tweet(cassandra, tweet):

• def do_query(cassandra, term, start, finish):

17

Check you data-bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type 'help;' or '?' for help.Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)

18

Extensions

19

UI

• Pretty graphs

• Automatically periodically update

• Search multiple terms

Painbird

• mentions of multiple terms

• sentiment analysis - http://www.nltk.org/

20