Cassandra for mission critical data

Apache Cassandra for mission critical dataOLEKSANDR SEMENOV

Agenda1) CAP Theorem2) NoSQL vs RDBMS: advantages and disadvantages3) What is Cassandra? History.4) Cassandra features5) Cassandra datamodel6) Ways to access data: Thrift, CQL, Kundera ORM

What is NoSQLNoSQL Not SQL

does not mean

NoSQL Not Only SQL ORNot Relational Database

it means

CAP Theorem You can choose only two: Consistency, Availability, Partition tolerance

Choosing AP data storages

Cassandra is an AP storage

RDBMS+ Strong mathematical basis+ Referential Integrity+ ACID transactions+ Standard SQL+ Well-known approaches to data modeling- Poor performance at great data amounts- Scaling issues

NoSQL+ Great performance+ Flexible data schema+ Easy scaling- Data redundancy- Integrity should be ensured by developer in most cases- Different access interfaces for different stores- Paradigm shift required- BASE consistency model instead of ACID transactions

ACID consistency model

Atomicity• Transaction

s are all or nothing

Consistency• Data written

is valid according all rules:

Isolation• Transaction

s do not affect each other

Durability• Data written

will not be lost

BASE consistency model

BASE system example

What is Cassandra? Cassandra is a:• non-relational• highly-scalable• decentralized• eventually consistent key-multivalue storage

History

Who uses Cassandra?

Cassandra Features

Decentralized• each node

has the same role and can process any request

Replication• Cassandra

supports multi -datacenter replication

Scalable• read and

write throughput both increase linearly as new machines are added

Durable• data write

once will survive in case of hardware failure

Cassandra Features

Fault-tolerant• data is

automatically replicated to multiple nodes for fault-tolerance

Tunable consistency• you can

choose desired consistency level

CQL• SQL-like

query language

Very fast IO• Both reads

and writes are very fast

Availability: partitioning with SPOF

Availability: Cassandra & no SPOF

• Each node can act as router

• Data is replicated to several nodes according to replication factor

Replication Factor

Replication Factor = 3

Availability

Tunable consistency

Consistency can be set on per-operation basis

Write path in Cassandra• Data is written to any node called coordinator

• Data is written to commitlog(for durability) and then to memTable

• MemTable is flushed to disk(SSTable) periodically, it is recreated in memory

• Deletes are special cases of writes - tombstones

Read path in Cassandra• Any server can be queried, it acts as coordinator

• Contacts node with requested key

• If consistency < ALL, read repair is performed on background

Read at consistency level = ONE

Read repair• Read repair means that when a query is made against a given key, we

perform a digest query against all the replicas of the key and push the most recent version to any out-of-date replicas.

Cassandra datamodel Keyspace

ColumnFamily

Columns SuperColumns

Database

Table

Columns

RDBMS Cassandra

ColumnFamilies usage patterns

Static

Dynamic

Columns Column – is a tuple which contains 3 fields: name, value and timestamp

Special column types• Expiring Columns –

column with auto-removal• Counter columns –

columns with auto-increment.

• SuperColumns – columns, which contain other columns. Deprecated.

SuperColumns

Indexes• Primary index – index built by key of the each row• Secondary index – index on column values,

should be created manually. Good only for low cardinality columns. Example: columns Gender can have only two values: M and F.And it is a problem.

• Indexing is performed in background

Data modelling• Query-driven approach is

required• How to get data if I can

query only by key?• Denormalize it!• Create multiple tables for

data• Use fast writes to do few

reads as possible

What Cassandra is good for?

Time series data (logs, sensor data) Write intensive applications

Applications with

predefined query-model

Never use Cassandra• If you want to replace traditional RDBMS with it.

• If you can’t tell in which way your data will be queried

• If you have a lot of reads

• If strong consistency is required (financial, medical areas)

• Cassandra is not a silver-bullet solution

Ways to access data

Thrift• First & native

client. Deprecated.

Hector, Pelops• Libraries

based on Thrift

CQL• SQL-like

language, very limited

Kundera• ORM/ONM

framework

Thrift• Apache Thrift – framework for cross-language

services development• Supported languages: C++, Java, Python, PHP,

Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Smalltalk, OCaml and others.

• Was developed by Facebook and released in 2007• Deprecated

Hector

• Hector - is a high level Java client for Apache Cassandra currently in use on a number of production systems.

• Includes an incredible number of features

Hector main features• Security – connection using Kerberos• Speed4j monitoring library integrating capabilities• Hector Object Mapper – simple ORM(not

compliant with JPA )• Connection pooling• Failover behavior on client side

CQLCQL – a SQL-like language introduced in Cassandra 0.8Offers next functionality:• No JOINS• Creating/dropping keyspaces, column families,

columns and rows• Inserting/retrieving columns• Indexing

Kundera ORM

Kundera is a “Polyglot Object Mapper” Supports:

◦ Cassandra◦ HBase◦ MongoDB◦ RDBMS◦ and other

Kundera ORM

JPA 2.1 compliantSupports cross-

datastore-persistance

Supports many-to-many relationships

Allows to add any NoSQL support by

implementing Client Extension

Performance Comparison

Benchmarked on Amazon Ubuntu large instance:◦ 7.5 GB memory◦ 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute

Units each)◦ 64-bit platform


Number Of Threads (1 record) Pelops Time (in sec) Hector Time (in sec) Kundera (in sec)

10 0.148 0.100 0.117

100 0.350 0.363 0.361

1000 1.793 1.885 2.180

10000 11.478 11.480 14.262

40000 38.887 37.241 41.977

50000 48.646 47.749 49.285

100000 91.280 92.874 97.707

Concurrent load – 1 record per thread


10 100 1000 10000 40000 50000 1000000

20

40

60

80

100

120

Concurrent load - 1 record for each thread

Pelops

Hector

Kundera

Threads number

Tim

e, s

Performance ComparisonConcurrent + Bulk load – 1000 record per thread

Number Of Threads (1000 rec/ thread) Pelops Time (in sec) Hector Time (in sec) Kundera (in sec)

10 5.929 5.286 7.722

100 34.750 32.228 39.124

1000 368.022 352.711 393.931


10 100 10000

200

400

600

800

1000

1200

Concurrent + Bulk load – 1000 record per thread

Kundera

Hector

Pelops

Thread number

Tim

e, s

Cassandra limitations

The key (and column names) must < 64K

bytes.

The maximum number of column per row is 2 billion.

A single column value may not be larger

than 2GB.

All data read should fit in memory due to

Thrift streaming support lack

SummaryGreat I/O performance

Several data access interfaces

AP data store (CAP)

Production ready & production proved

Good for time series data

Extremely available

References Datastax - http://www.datastax.com/docs/1.1/index Apache Cassandra - http://cassandra.apache.org/ All Things Distributed - http://www.allthingsdistributed.com/ Hector - http://hector-client.github.com/hector/build/html/index.html Kundera - https://github.com/impetus-opensource/Kundera

http://www.datastax.com/docs/1.1/index



http://cassandra.apache.org/



http://www.allthingsdistributed.com/



http://hector-client.github.com/hector/build/html/index.html



https://github.com/impetus-opensource/Kundera

https://github.com/impetus-opensource/Kundera

Thank you!

Date post:	12-Apr-2017
Category:	Software
Upload:	oleksandr-semenov
View:	243 times
Download:	0 times

Cassandra for mission critical data

Software