Top five questions to ask when choosing a big data solution

Post on 26-Jan-2015

108 views 3 download

Tags:

description

 

transcript

Five factors to consider when choosing a big data solution!Jonathan EllisCTO, DataStaxProject Chair, Apache Cassandra

©2012 DataStax

how do I

modelmy application?

©2012 DataStax

Popular options• Key/value

• Tabular

• Document

• Graph?

©2012 DataStax

Schema is your friend

{ "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c", "name": "jbellis", "state": "TX", "birthdate": "1/1/1976", "email_addresses": ["jbellis@gmail", "jbellis@datastax.com"],}

©2012 DataStax

SQL can be your friend too

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date);

CREATE INDEX ON users(state);

SELECT * FROM usersWHERE state=‘Texas’ AND birth_date > ‘1950-01-01’;

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date);

CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);

SELECT *FROM users NATURAL JOIN users_addresses;

Collections

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date);

CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);

SELECT *FROM users NATURAL JOIN users_addresses;

Collections

X

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date, email_addresses set<text>);

UPDATE usersSET email_addresses = email_addresses + {‘jbellis@gmail.com’, ‘jbellis@datastax.com’};

Collections

©2012 DataStax

Joins don’t scale• No joins

• No subqueries

• No aggregation functions* or GROUP BY

• ORDER BY?

©2012 DataStax

SELECT * FROM tweetsWHERE user_id IN (SELECT follower FROM followers WHERE user_id = ’driftx’)

followers

?

tweets

©2012 DataStax

CREATE TABLE timeline (  user_id uuid,  tweet_id timeuuid,  tweet_author uuid, tweet_body text,  PRIMARY KEY (user_id, tweet_id));

Clustering in Cassandrauser_id tweet_id _author _body

jbellis 3290f9da.. rbranson loremjbellis 3895411a.. tjake ipsum

... ... ...

driftx 3290f9da.. rbranson loremdriftx 71b46a84.. yzhang dolor

... ... ...

yukim 3290f9da.. rbranson loremyukim e451dd42.. tjake amet

... ... ...

©2012 DataStax

CREATE TABLE timeline (  user_id uuid,  tweet_id timeuuid,  tweet_author uuid, tweet_body text,  PRIMARY KEY (user_id, tweet_id));

Clustering in Cassandrauser_id tweet_id _author _body

jbellis 3290f9da.. rbranson loremjbellis 3895411a.. tjake ipsum

... ... ...

driftx 3290f9da.. rbranson loremdriftx 71b46a84.. yzhang dolor

... ... ...

yukim 3290f9da.. rbranson loremyukim e451dd42.. tjake amet

... ... ...

SELECT * FROM timelineWHERE user_id = ’driftx’;

©2012 DataStax

how does it

perform?

©2012 DataStax

Larger than memory datasets

©2012 DataStax

Locking

©2012 DataStax

Efficiency

©2012 DataStax

UPDATE usersSET email_addresses = email_addresses + {...}WHERE user_id = ‘jbellis’;

©2012 DataStax

Durability

©2012 DataStax

C* storage engine very briefly

Memory

Hard drive

Memtable

write( , )k1 c1:v1

Commit log

©2012 DataStax

Memory

Hard drive

Memtable

write( , )k1 c1:v1

Commit log

k1 c1:v1

k1 c1:v1

©2012 DataStax

Memory

Hard drive

write( , )k1 c2:v2

k1 c1:v1

k1 c1:v1

k1 c2:v2

c2:v2

©2012 DataStax

Memory

Hard drive

k1 c1:v1

k1 c1:v1

k1 c2:v2

c2:v2

write( , )k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

©2012 DataStax

Memory

Hard drive

k1 c1:v1

k1 c1:v4

k1 c2:v2

c2:v2

write( , )k1 c1:v4 c3:v3

k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

k1 c1:v4 c3:v3

c3:v3

©2012 DataStax

Memory

Hard drive

SSTable

flush

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

index

cleanup

©2012 DataStax

No random writes

©2012 DataStax

0

5000

10000

15000

20000

25000

30000

35000

Cassandra 0.6

Cassandra 1.0

reads/s writes/s

©2012 DataStax

how does it handle

failure?

©2012 DataStax

Classic partitioning with SPOFpartition 1 partition 2 partition 3 partition 4

router

client

©2012 DataStax

Availability• “High availability implies that a single fault will not bring

down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax

• “The biggest problem with failover is that you're almost never using it until it really hurts. It's like backups that you never test.” -- Rick Branson: Instagram

©2012 DataStax

Fully distributed, no SPOFclient

p1

p1

p1p3

p6

©2012 DataStax

Multiple datacenters

©2012 DataStax

©2012 DataStax

how does it

scale?

©2012 DataStax

Scaling antipatterns• Metadata servers

• Router bottlenecks

• Overloading existing nodes when adding capacity

©2012 DataStax

©2012 DataStax

how

flexibleis it?

36

©2012 DataStax

Data model: Realtime

Portfolios

StockHist

stock lastGOOG $95.52AAPL $186.10AMZN $112.98

LiveStocks

stock date priceGOOG 2011-01-01 $8.23GOOG 2011-01-02 $6.14GOOG 2011-001-03 $7.78

user stock sharesjbellis GOOG 80jbellis LNKD 20yukim AMZN 100

©2012 DataStax

Data model: Analytics

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

©2012 DataStax

Data model: Analyticsstock rdate returnGOOG 2011-07-25 $8.23GOOG 2011-07-24 $6.14GOOG 2011-07-23 $7.78AAPL 2011-07-25 $15.32AAPL 2011-07-24 $12.68

10dayreturns

INSERT OVERWRITE TABLE 10dayreturnsSELECT a.stock, b.date as rdate, b.price - a.priceFROM StockHist a JOIN StockHist b ON (a.stock = b.stock AND date_add(a.date, 10) = b.date);

©2012 DataStax

Data model: Analytics

portfolio rdate preturnPortfolio1 2011-07-25 $118.21Portfolio1 2011-07-24 $60.78Portfolio1 2011-07-23 -$34.81Portfolio2 2011-07-25 $2143.92Portfolio3 2011-07-24 -$10.19

portfolio_returns

INSERT OVERWRITE TABLE portfolio_returnsSELECT portfolio, rdate, SUM(b.return)FROM portfolios a JOIN 10dayreturns b ON (a.stock = b.stock)GROUP BY portfolio, rdate;

©2012 DataStax

Data model: Analytics

INSERT OVERWRITE TABLE HistLossSELECT a.portfolio, rdate, minpFROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

42

©2012 DataStax

Some Cassandra users