©2012 DataStax
The Apache Cassandra storage engine
Sylvain Lebresne
1
NoSQL matters 2012
©2012 DataStax
• Sylvain Lebresne
• @pcmanus
About me
2
©2012 DataStax3
©2012 DataStax3
1. What is Apache Cassandra
2. Data Model
3. The storage engine
©2012 DataStax
1. What is Apache Cassandra
2. Data Model
3. The storage engine
3
©2012 DataStax
about:project• Distributed data store aimed at big data.
• Apache project since 2010.
• Version 1.1 released last month.• Proven in production (Netflix, Twitter, Reddit,
Cisco, ...). Largest know cluster has over 300TB in over 400 machines.
4
©2012 DataStax
Apache Cassandra
5
©2012 DataStax
Apache CassandraA database:
5
©2012 DataStax
Apache CassandraA database:• distributed / decentralized
5
©2012 DataStax
Apache CassandraA database:• distributed / decentralized• replicated & durable
5
©2012 DataStax
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic
5
©2012 DataStax
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic
5
©2012 DataStax
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF
5
©2012 DataStax
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF• highly available
5
©2012 DataStax
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF• highly available• data center aware
US Europe
6
©2012 DataStax7
1. What is Apache Cassandra
2. Data Model
3. The storage engine
©2012 DataStax
• Not SQL (no transaction, nor joins) but more than Key/Value.
• Inspired by Google BigTable
• Column families based.
Data Model
8
©2012 DataStaxUsers
Ex: user profiles
birth_year 1994
50e8-e29b
fname Justin
lname Bieber
“For each user, holds profile infos”
9
©2012 DataStaxUsers
Ex: user profiles
birth_year 1994
50e8-e29b
fname Justin
lname Bieber
birth_year 1978
2ab1-f1b7
email [email protected]
fname Ashton
lname Kutcher
“For each user, holds profile infos”
10
©2012 DataStaxTimeline
Ex: user’s Tweets
50e8-e29b
“For each user, tweets he has made”
11
©2012 DataStaxTimeline
Ex: user’s Tweets
50e8-e29b
0 @LiveLoveKary glad you had a good birthday #muchlove
“For each user, tweets he has made”
11
©2012 DataStaxTimeline
Ex: user’s Tweets
50e8-e29b
0 @LiveLoveKary glad you had a good birthday #muchlove
1 @NickDeMoura happy bday my dude.
“For each user, tweets he has made”
11
©2012 DataStaxTimeline
Ex: user’s Tweets
50e8-e29b
0 @LiveLoveKary glad you had a good birthday #muchlove
1 @NickDeMoura happy bday my dude.
2 @MickyArison @miamiHEAT thanks for the gam tonight
“For each user, tweets he has made”
11
©2012 DataStaxTimeline
Ex: user’s Tweets
50e8-e29b
0 @LiveLoveKary glad you had a good birthday #muchlove
1 @NickDeMoura happy bday my dude.
2 @MickyArison @miamiHEAT thanks for the gam tonight
3 still a little tired. back in the studio today with Timbaland
“For each user, tweets he has made”
11
©2012 DataStax
There’s more• Secondary indexes
• Distributed counters
• Composite columns
12
©2012 DataStax13
©2012 DataStax13
1. What is Apache Cassandra
2. Data Model
3. The storage engine
©2012 DataStax
• Writes are harder than reads to scale
• Spinning disks aren’t good with random I/O
• Goal: minimize random I/O
Goal
14
©2012 DataStax
A write’s journey
Memory
Hard drive
Memtable
Commit log
15
write( , )k1 c1:v1
©2012 DataStax
A write’s journey
Memory
Hard drive
Memtable
write( , )k1 c1:v1
Commit log
k1 c1:v1
k1 c1:v1
16
©2012 DataStax
A write’s journey
Memory
Hard drive
k1 c1:v1
k1 c1:v1
ack
17
©2012 DataStax
A write’s journey
Memory
Hard drive
k1 c1:v1
k1 c1:v1
write( , )k2 c1:v1 c2:v2
k2 c1:v1 c2:v2
k2 c1:v1 c2:v2
18
©2012 DataStax
A write’s journey
Memory
Hard drive
k1 c1:v1
k1 c1:v4 c2:v2
write( , )k1 c1:v4 c3:v3
k2 c1:v1 c2:v2
c3:v3
c2:v2
k2 c1:v1 c2:v2
k1 c1:v4 c3:v3
c2:v2
19
©2012 DataStax
A write’s journey
Memory
Hard drive
SSTable
flush
k1 c1:v4 c2:v2
k2 c1:v1 c2:v2
c3:v3
index
cleanup
20
©2012 DataStax
A write’s journey
Memory
Hard drive
k2 c1:v2 c3:v3
k1 c1:v5 c4:v4
k1 c1:v4 c2:v2
k2 c1:v1 c2:v2
c3:v3
index
more updates
k1 c1:v5 c4:v4
k2 c1:v2 c3:v3
21
©2012 DataStax
A write’s journey
Memory
Hard drive
k1 c1:v4 c2:v2
k2 c1:v1 c2:v2
c3:v3
indexk1 c1:v5 c4:v4
k2 c1:v2 c3:v3
index
flush
22
©2012 DataStax
Writes properties• No reads or seeks
• Only sequential I/O
• Immutable SSTables: easy snapshots
23
©2012 DataStax
A read’s journey
Memory
Hard drive
k1 c1:v4 c2:v2
k2 c1:v1 c2:v2
c3:v3
indexk1 c1:v5 c4:v4
k2 c1:v2 c3:v3
index
read( )k1
?
24
A read’s journey
Memory
Hard drive
k1 c1:v4 c2:v2
k2 c1:v1 c2:v2
c3:v3
index
k1 c1:v5 c4:v4
k2 c1:v2 c3:v3
index
k1
merge
c1:v5 c2:v2 c3:v3 c4:v4
©2012 DataStax
Compaction
• Goal: keep the number of SSTables low
• Merge sort against multiple sstables
• Sequential I/O
26
©2012 DataStax
Compaction
• Goal: keep the number of SSTables low
• Merge sort against multiple sstables
• Sequential I/O
k1 c1:v4 c2:v2
k2 c1:v1 c2:v2
c3:v3
index
k1 c1:v5 c4:v4
k2 c1:v2 c3:v3
index
26
©2012 DataStax
Compaction
• Goal: keep the number of SSTables low
• Merge sort against multiple sstables
• Sequential I/O
k1 c1:v4 c2:v2
k2 c1:v1 c2:v2
c3:v3
index
k1 c1:v5 c4:v4
k2 c1:v2 c3:v3
indexk1 c1:v5 c2:v2
k2 c1:v2 c2:v2
c3:v3
indexc4:v4
c3:v3
26
©2012 DataStax
SSTables
27
BF Index SummaryMemoryDisk
k1 k2 k3
k1
312 0 ...
Index
Data
Col. BF Col. Index c1:v4 c2:v2 c3:v3 ... k2 Col. BF ...
©2012 DataStax
Optimizations• Row Cache
• Bloom filters: eliminates whole SSTable
• Key Cache• Rows & Columns Indexes
• ...
28
©2012 DataStax
Other features
• Compression
• Checksums
• Time to live
29
©2012 DataStax30
QUESTIONS?
©2012 DataStax
• http://cassandra.apache.org/
• http://wiki.apache.org/cassandra/
• http://www.datastax.com/docs/1.0
31