Date post: | 12-Apr-2017 |
Category: |
Software |
Upload: | oleksandr-semenov |
View: | 243 times |
Download: | 0 times |
Apache Cassandra for mission critical dataOLEKSANDR SEMENOV
Agenda1) CAP Theorem2) NoSQL vs RDBMS: advantages and disadvantages3) What is Cassandra? History.4) Cassandra features5) Cassandra datamodel6) Ways to access data: Thrift, CQL, Kundera ORM
What is NoSQLNoSQL Not SQL
does not mean
NoSQL Not Only SQL ORNot Relational Database
it means
CAP Theorem You can choose only two: Consistency, Availability, Partition tolerance
Choosing AP data storages
Cassandra is an AP storage
RDBMS+ Strong mathematical basis+ Referential Integrity+ ACID transactions+ Standard SQL+ Well-known approaches to data modeling- Poor performance at great data amounts- Scaling issues
NoSQL+ Great performance+ Flexible data schema+ Easy scaling- Data redundancy- Integrity should be ensured by developer in most cases- Different access interfaces for different stores- Paradigm shift required- BASE consistency model instead of ACID transactions
ACID consistency model
Atomicity• Transaction
s are all or nothing
Consistency• Data written
is valid according all rules:
Isolation• Transaction
s do not affect each other
Durability• Data written
will not be lost
BASE consistency model
BASE system example
What is Cassandra? Cassandra is a:• non-relational• highly-scalable• decentralized• eventually consistent key-multivalue storage
History
Who uses Cassandra?
Cassandra Features
Decentralized• each node
has the same role and can process any request
Replication• Cassandra
supports multi -datacenter replication
Scalable• read and
write throughput both increase linearly as new machines are added
Durable• data write
once will survive in case of hardware failure
Cassandra Features
Fault-tolerant• data is
automatically replicated to multiple nodes for fault-tolerance
Tunable consistency• you can
choose desired consistency level
CQL• SQL-like
query language
Very fast IO• Both reads
and writes are very fast
Availability: partitioning with SPOF
Availability: Cassandra & no SPOF
• Each node can act as router
• Data is replicated to several nodes according to replication factor
Replication Factor
Replication Factor = 3
Availability
Tunable consistency
Consistency can be set on per-operation basis
Write path in Cassandra• Data is written to any node called coordinator
• Data is written to commitlog(for durability) and then to memTable
• MemTable is flushed to disk(SSTable) periodically, it is recreated in memory
• Deletes are special cases of writes - tombstones
Read path in Cassandra• Any server can be queried, it acts as coordinator
• Contacts node with requested key
• If consistency < ALL, read repair is performed on background
Read at consistency level = ONE
Read repair• Read repair means that when a query is made against a given key, we
perform a digest query against all the replicas of the key and push the most recent version to any out-of-date replicas.
Cassandra datamodel Keyspace
ColumnFamily
Columns SuperColumns
Database
Table
Columns
RDBMS Cassandra
ColumnFamilies usage patterns
Static
Dynamic
Columns Column – is a tuple which contains 3 fields: name, value and timestamp
Special column types• Expiring Columns –
column with auto-removal• Counter columns –
columns with auto-increment.
• SuperColumns – columns, which contain other columns. Deprecated.
SuperColumns
Indexes• Primary index – index built by key of the each row• Secondary index – index on column values,
should be created manually. Good only for low cardinality columns. Example: columns Gender can have only two values: M and F.And it is a problem.
• Indexing is performed in background
Data modelling• Query-driven approach is
required• How to get data if I can
query only by key?• Denormalize it!• Create multiple tables for
data• Use fast writes to do few
reads as possible
What Cassandra is good for?
Time series data (logs, sensor data) Write intensive applications
Applications with
predefined query-model
Never use Cassandra• If you want to replace traditional RDBMS with it.
• If you can’t tell in which way your data will be queried
• If you have a lot of reads
• If strong consistency is required (financial, medical areas)
• Cassandra is not a silver-bullet solution
Ways to access data
Thrift• First & native
client. Deprecated.
Hector, Pelops• Libraries
based on Thrift
CQL• SQL-like
language, very limited
Kundera• ORM/ONM
framework
Thrift• Apache Thrift – framework for cross-language
services development• Supported languages: C++, Java, Python, PHP,
Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Smalltalk, OCaml and others.
• Was developed by Facebook and released in 2007• Deprecated
Hector
• Hector - is a high level Java client for Apache Cassandra currently in use on a number of production systems.
• Includes an incredible number of features
Hector main features• Security – connection using Kerberos• Speed4j monitoring library integrating capabilities• Hector Object Mapper – simple ORM(not
compliant with JPA )• Connection pooling• Failover behavior on client side
CQLCQL – a SQL-like language introduced in Cassandra 0.8Offers next functionality:• No JOINS• Creating/dropping keyspaces, column families,
columns and rows• Inserting/retrieving columns• Indexing
Kundera ORM
Kundera is a “Polyglot Object Mapper” Supports:
◦ Cassandra◦ HBase◦ MongoDB◦ RDBMS◦ and other
Kundera ORM
JPA 2.1 compliantSupports cross-
datastore-persistance
Supports many-to-many relationships
Allows to add any NoSQL support by
implementing Client Extension
Performance Comparison
Benchmarked on Amazon Ubuntu large instance:◦ 7.5 GB memory◦ 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute
Units each)◦ 64-bit platform
Performance Comparison
Number Of Threads (1 record) Pelops Time (in sec) Hector Time (in sec) Kundera (in sec)
10 0.148 0.100 0.117
100 0.350 0.363 0.361
1000 1.793 1.885 2.180
10000 11.478 11.480 14.262
40000 38.887 37.241 41.977
50000 48.646 47.749 49.285
100000 91.280 92.874 97.707
Concurrent load – 1 record per thread
Performance Comparison
10 100 1000 10000 40000 50000 1000000
20
40
60
80
100
120
Concurrent load - 1 record for each thread
Pelops
Hector
Kundera
Threads number
Tim
e, s
Performance ComparisonConcurrent + Bulk load – 1000 record per thread
Number Of Threads (1000 rec/ thread) Pelops Time (in sec) Hector Time (in sec) Kundera (in sec)
10 5.929 5.286 7.722
100 34.750 32.228 39.124
1000 368.022 352.711 393.931
Performance Comparison
10 100 10000
200
400
600
800
1000
1200
Concurrent + Bulk load – 1000 record per thread
Kundera
Hector
Pelops
Thread number
Tim
e, s
Cassandra limitations
The key (and column names) must < 64K
bytes.
The maximum number of column per row is 2 billion.
A single column value may not be larger
than 2GB.
All data read should fit in memory due to
Thrift streaming support lack
SummaryGreat I/O performance
Several data access interfaces
AP data store (CAP)
Production ready & production proved
Good for time series data
Extremely available
References Datastax - http://www.datastax.com/docs/1.1/index Apache Cassandra - http://cassandra.apache.org/ All Things Distributed - http://www.allthingsdistributed.com/ Hector - http://hector-client.github.com/hector/build/html/index.html Kundera - https://github.com/impetus-opensource/Kundera
Thank you!