The Apache Cassandra storage engine

Post on 17-Dec-2016

227 views 4 download

transcript

©2012 DataStax

The Apache Cassandra storage engine

Sylvain Lebresne

1

NoSQL matters 2012

©2012 DataStax

• Sylvain Lebresne

• sylvain@datastax.com

• @pcmanus

About me

2

©2012 DataStax3

©2012 DataStax3

1. What is Apache Cassandra

2. Data Model

3. The storage engine

©2012 DataStax

1. What is Apache Cassandra

2. Data Model

3. The storage engine

3

©2012 DataStax

about:project• Distributed data store aimed at big data.

• Apache project since 2010.

• Version 1.1 released last month.• Proven in production (Netflix, Twitter, Reddit,

Cisco, ...). Largest know cluster has over 300TB in over 400 machines.

4

©2012 DataStax

Apache Cassandra

5

©2012 DataStax

Apache CassandraA database:

5

©2012 DataStax

Apache CassandraA database:• distributed / decentralized

5

©2012 DataStax

Apache CassandraA database:• distributed / decentralized• replicated & durable

5

©2012 DataStax

Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic

5

©2012 DataStax

Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic

5

©2012 DataStax

Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF

5

©2012 DataStax

Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF• highly available

5

©2012 DataStax

Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF• highly available• data center aware

US Europe

6

©2012 DataStax7

1. What is Apache Cassandra

2. Data Model

3. The storage engine

©2012 DataStax

• Not SQL (no transaction, nor joins) but more than Key/Value.

• Inspired by Google BigTable

• Column families based.

Data Model

8

©2012 DataStaxUsers

Ex: user profiles

birth_year 1994

50e8-e29b

fname Justin

lname Bieber

“For each user, holds profile infos”

9

©2012 DataStaxUsers

Ex: user profiles

birth_year 1994

50e8-e29b

fname Justin

lname Bieber

birth_year 1978

2ab1-f1b7

email a@kutcher.com

fname Ashton

lname Kutcher

“For each user, holds profile infos”

10

©2012 DataStaxTimeline

Ex: user’s Tweets

50e8-e29b

“For each user, tweets he has made”

11

©2012 DataStaxTimeline

Ex: user’s Tweets

50e8-e29b

0 @LiveLoveKary glad you had a good birthday #muchlove

“For each user, tweets he has made”

11

©2012 DataStaxTimeline

Ex: user’s Tweets

50e8-e29b

0 @LiveLoveKary glad you had a good birthday #muchlove

1 @NickDeMoura happy bday my dude.

“For each user, tweets he has made”

11

©2012 DataStaxTimeline

Ex: user’s Tweets

50e8-e29b

0 @LiveLoveKary glad you had a good birthday #muchlove

1 @NickDeMoura happy bday my dude.

2 @MickyArison @miamiHEAT thanks for the gam tonight

“For each user, tweets he has made”

11

©2012 DataStaxTimeline

Ex: user’s Tweets

50e8-e29b

0 @LiveLoveKary glad you had a good birthday #muchlove

1 @NickDeMoura happy bday my dude.

2 @MickyArison @miamiHEAT thanks for the gam tonight

3 still a little tired. back in the studio today with Timbaland

“For each user, tweets he has made”

11

©2012 DataStax

There’s more• Secondary indexes

• Distributed counters

• Composite columns

12

©2012 DataStax13

©2012 DataStax13

1. What is Apache Cassandra

2. Data Model

3. The storage engine

©2012 DataStax

• Writes are harder than reads to scale

• Spinning disks aren’t good with random I/O

• Goal: minimize random I/O

Goal

14

©2012 DataStax

A write’s journey

Memory

Hard drive

Memtable

Commit log

15

write( , )k1 c1:v1

©2012 DataStax

A write’s journey

Memory

Hard drive

Memtable

write( , )k1 c1:v1

Commit log

k1 c1:v1

k1 c1:v1

16

©2012 DataStax

A write’s journey

Memory

Hard drive

k1 c1:v1

k1 c1:v1

ack

17

©2012 DataStax

A write’s journey

Memory

Hard drive

k1 c1:v1

k1 c1:v1

write( , )k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

18

©2012 DataStax

A write’s journey

Memory

Hard drive

k1 c1:v1

k1 c1:v4 c2:v2

write( , )k1 c1:v4 c3:v3

k2 c1:v1 c2:v2

c3:v3

c2:v2

k2 c1:v1 c2:v2

k1 c1:v4 c3:v3

c2:v2

19

©2012 DataStax

A write’s journey

Memory

Hard drive

SSTable

flush

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

index

cleanup

20

©2012 DataStax

A write’s journey

Memory

Hard drive

k2 c1:v2 c3:v3

k1 c1:v5 c4:v4

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

index

more updates

k1 c1:v5 c4:v4

k2 c1:v2 c3:v3

21

©2012 DataStax

A write’s journey

Memory

Hard drive

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

indexk1 c1:v5 c4:v4

k2 c1:v2 c3:v3

index

flush

22

©2012 DataStax

Writes properties• No reads or seeks

• Only sequential I/O

• Immutable SSTables: easy snapshots

23

©2012 DataStax

A read’s journey

Memory

Hard drive

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

indexk1 c1:v5 c4:v4

k2 c1:v2 c3:v3

index

read( )k1

?

24

A read’s journey

Memory

Hard drive

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

index

k1 c1:v5 c4:v4

k2 c1:v2 c3:v3

index

k1

merge

c1:v5 c2:v2 c3:v3 c4:v4

©2012 DataStax

Compaction

• Goal: keep the number of SSTables low

• Merge sort against multiple sstables

• Sequential I/O

26

©2012 DataStax

Compaction

• Goal: keep the number of SSTables low

• Merge sort against multiple sstables

• Sequential I/O

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

index

k1 c1:v5 c4:v4

k2 c1:v2 c3:v3

index

26

©2012 DataStax

Compaction

• Goal: keep the number of SSTables low

• Merge sort against multiple sstables

• Sequential I/O

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

index

k1 c1:v5 c4:v4

k2 c1:v2 c3:v3

indexk1 c1:v5 c2:v2

k2 c1:v2 c2:v2

c3:v3

indexc4:v4

c3:v3

26

©2012 DataStax

SSTables

27

BF Index SummaryMemoryDisk

k1 k2 k3

k1

312 0 ...

Index

Data

Col. BF Col. Index c1:v4 c2:v2 c3:v3 ... k2 Col. BF ...

©2012 DataStax

Optimizations• Row Cache

• Bloom filters: eliminates whole SSTable

• Key Cache• Rows & Columns Indexes

• ...

28

©2012 DataStax

Other features

• Compression

• Checksums

• Time to live

29

©2012 DataStax30

QUESTIONS?

©2012 DataStax

• http://cassandra.apache.org/

• http://wiki.apache.org/cassandra/

• http://www.datastax.com/docs/1.0

31