Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | planet-cassandra |
View: | 4,109 times |
Download: | 1 times |
Cassandra Introduction & Key Features
Meetup Vienna Cassandra Users
13th of January 2014
Definition
Apache Cassandra is an open source, distributed,decentralized, elastically scalable, highly available,fault-tolerant, tuneably consistent, column-orienteddatabase that bases its distribution design on Amazon’sDynamo and its data model on Google’s Bigtable.Created at Facebook, it is now used at some of the mostpopular sites on the Web [The Definitive Guide, EbenHewitt, 2010]
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 2
History
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk
Bigtable, 2006 Dynamo, 2007
OpenSource, 2008
3
Key Features
Cassandra
Distributed and
Decentra-lized
Elastic Scalability
High Availability and Fault Tolerance
TuneableConsistency
Column-oriented
Key-Value store
CQL – A SQL like query interface
High Perfor-mance
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 4
Distributed and Decentralized
• Distributed: Capable of running on multiple machines
• Decentralized: No single point of failure
No master-slave issues due to peer-to-peer architecture (protocol "gossip")
Single Cassandra cluster may run across geographically dispersed data centers
Read- and write-requests to any node
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 5
1
35
4
Datacenter 1
7
9
10
Datacenter 2
26 812
11
Elastic Scalability
• Cassandra scales horizontally, adding more machines that have all or some of the data on
• Adding of nodes increase performance throughput linearly
• De-/ and increasing the nodecount happen seamlessly
Linearly scales to terabytes and
petabytes of data
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 6
12
2
8
4
63
4
1
5
7 3Performance throughput = N x 2
Performance throughput = N
Scaling Benchmark By Netflix*
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk
Cassandra scales linearly far beyond our current capacity requirements, and very rapid deploymentautomation makes it easy to manage. In particular, benchmarking in the cloud is fast, cheap and scalable,
*http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
48, 96, 144 and 288 instances, with 10, 20, 30 and 60 clients respectively. Each client generated ~20.000w/s having 400byte in size
7
High Availability and Fault Tolerance
• High Availability?Multiple networked computers
operating in a cluster
Facility for recognizing node failures
Forward failing over requests to another part of the system
• Cassandra has High AvailabilityNo single point of failuredue to the peer-to-peer
architecture
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 8
1
35
4
26
Tunable Consistency
• Choose between strong and eventual consistency
• Adjustable for read- and write-operations separately
• Conflicts are solved during reads, as focus lies on write-performance
Use case dependent level of consistency
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk
Available Consistency
TUNABLE
9
When do we have strong consistency?
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk
• Simple Formula:(nodes_written + nodes_read) >
replication_factor
• Ensures that a read always reflects the most recent write
• If not: Weak consistency Eventually consistent
NW: 2
NR: 2
RF: 3
t2
t2
t1
jsmith t2
t1
t1
10
jsmith jsmith
jsmith
Column-oriented Key-Value Store
• Data is stored in sparse multidimensional hash tables
• A row can have multiple columns –not necessarily the same amount of columns for each row
• Each row has a unique key, which also determines partitioning
• No relations!
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 11
Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
Row Key1
ColumnKey1
ColumnKey2
ColumnValue1
ColumnValue2
ColumnKey3
ColumnValue3
…
……
Stored sorted by column key/value
Sto
red
sort
edb
yro
wke
y*
* Row keys (partition keys) should be hashed, in order to distribute data across the cluster evenly
CQL – An SQL-like query interface
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 12
• “CQL 3 is the default and primary interface into the Cassandra DBMS” *
• Familiar SQL-like syntax that maps to Cassandras storage engine and simplifies data modelling
* http://www.datastax.com/documentation/cql/3.0/pdf/cql30.pdf
CRETE TABLE songs (
id uuid PRIMARY KEY,
title text,
album text,
artist text,
data blob,
tags set<text>
);
INSERT INTO songs
(id, title, artist,
album, tags)
VALUES(
'a3e64f8f...',
'La Grange',
'ZZ Top',
'Tres Hombres'‚
{'cool', 'hot'});
SELECT *
FROM songs
WHERE id = 'a3e64f8f...';
“SQL-like” but NOT relational SQL
High Performance
• Optimized from the ground up for high throughput
• All disk writes are sequential, append only operations
• No reading before writing
• Cassandra`s threading-concept is optimized for running on multiprocessor/ multicore machines
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk
Optimized for writing, but fast reads are possible as well
13
Benchmark from 2011 (Cassandra 0.7.4)*
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk
*NoSql Benchmarking by Curbithttp://www.cubrid.org/blog/dev-platform/nosql-benchmarking/
Cassandra showed outstanding throughput in “INSERT-only” with 20,000 ops
Insert: Enter 50 million 1K-sized recordsRead: Search key for a one hour period + optional updateHardware: Nehalem 6 Core x 2 CPU, 16GB Memory
ops
14
Benchmark from 2013 (Cassandra 1.1.6)*
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk
* Benchmarking Top NoSQL Databases by End Point Corporation, http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-Top-NoSQL-Databases.pdfYahoo! Cloud Serving Benchmark: https://github.com/brianfrankcooper/YCSB
15
When do we need these features?
Large Deployments
Lots of Writes,
Statistics, and Analysis
Geographical Distribution
Evolving Applications
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 16
Who is using Cassandra?
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 17
ebay Data Infrastructure*
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk
• 10+ clusters• 100+ nodes• > 250 TB provisioned
(local HDD + shared SSD)• > 9 billion writes/day• > 5 billion reads/day
• Thousands of nodes • The world largest cluster
with 2K+ nodes
• Thousands of nodes• > 2K sharded logical host• > 16K tables• > 27K indexes• > 140 billion SQLs/day• > 5 PB provisioned
• Hundreds of nodes• Persistent & in-memory• > 40 billion SQLs/day
Hundreds of nodes> 50 TB> 2 billion ops/day
18
Not replacing RDMBS but complementing!
*by Jay Patel, Cassandra Summit June 2013 San Francisco
Cassandra Use Case at Ebay
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 19
Application/Use Case
• Time-series data and real-time insights
• Fraud detection & prevention
• Quality Click Pricing for affiliates
• Order & Shipment Tracking
• …
• Server metrics collection
• Taste graph-based next-gen recommendation system
• Social Signals on eBay Product & Item pages
Why Cassandra?
• Multi-Datacenter (active-active)
• No SPOF
• Easy to scale
• Write performance
• Distributed Counters
Cassandra/Hadoop Deployment
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 20
Summary• History
• Key features of Cassandra• Distributed and Decentralized
• Elastic Scalability
• High Availability and Fault Tolerance
• Tunable Consistency
• Column-oriented key-value store
• CQL interface
• High Performance
• Ebay Use Case
13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk
Community portal: http://planetcassandra.org
21
Documentation: http://www.datastax.com/docs
Apache project: http://cassandra.apache.org