Cassandra

Post on 14-Dec-2014

467 views 4 download

Tags:

description

Wide Column Store for BigData

transcript

APACHE CASSANDRA

Wide Column Store for Big Data

Kai Spichale

Outline

Motivation

Introduction to Cassandra

Big Data Solution

„Must Haves“ for Big Data?

What do modern businesses need for big data?

A scalable high-performance database

that is easy to use and

cost effective Scalable

Performance

CostEffective

OperationalEase

„Must Haves“ for Big Data?

„Modern businesses need to be able to manage large

volumes of realtime data and run analytic and enterprise

search operations on that same data as quickly as possible

to make business decisions.“

Real-Time

Databases

Data MovementETL Process

Analytic/Search

Databases

Legacy RDBMS ≠ Big Data

„Big data is comprised of (1) Velocity – how fast the data is coming in;

(2) Variety – all types are new being captured; (3) Volume – TB‘s to

PB‘s of data; (4) Complexity – mulit-location, data center, etc.“

“Big data technologies describe a new generation of technologies and

architectures, designed to economically extract value from very large

volumes of a wide variety of data, by enabling high-velocity capture,

discovery, and/or analysis.”

“Big data is data that exceeds the processing capacity of conventional

database systems. The data is too big, moves too fast, or doesn’t fit the

strictures of your database architectures. To gain value from this data,

you must choose an alternative way to process it.”

Trends & Challenges in Data Mngt.

Exponential Data

Growth

More Connected

Data

Semi Structured

Data

Cloud

Key Value

Graph

Document

Wide Column

Trends & Challenges in Data Mngt.

Exponential Data

Growth

More Connected

Data

Semi Structured

Data

Cloud

Key Value

Graph

Document

Apache

Cassandra

Apache Cassandra

A massively scalable, decentralized, structured

data store (aka database).

Project history:

Cassandra is…

A

B

C

DE

F

G

H

O(1) Distributed Hash Table

Sharding, Replication

Elastic

Fault tolerant

No Single Point of Failure

Durable

Nodes Token

A 0

B 4

C 8

D 12

E 16

F 20

G 24

H 28

Cassandra is…

AP-System (CAP Theorem)

Eventual consistency

Tunable trade-offs:

Consistency vs. Latency

Choose between synchronous or asynchronous

replication for each update

A P

C

C = Consistency

A = High Availability

P = Partitioning Tolerance

Cassandra is…

A BigTable Clone

No schema

Predestined for

Semi-structured data

Sparse data

Keyspace

Column Family

Key Row

Column Column

Key Row

Column

Key Row

Column Column Column

Column Family

Row

SuperColumn SuperColumn

Column Column Column Column

Row

SuperColumn

Column Column Column

Cassandra-based Big Data

Solution

Real-time

Cassandra

Real-time

Cassandra

Search

Solr

Search

Solr

Search

Solr

Analytics

Hadoop

Analytics

Hadoop

Real-time

Cassandra

Real-time queries with

Cassandra

Distributed Search with

Solr

Analytics with Hadoop

MapReduce

Cassandra Cluster

(Replication)

Summary

Apache Cassandra is a elastic scalable, fault-

tolerant data store

Tunable consistency levels

Wide Column: flexible datamodel without schema

Supports: real-time queries, analytics through

Hadoop integration, Solr-based fulltext search

Thank you!

Q&A