Cassandra implementation for collecting data and presenting data

Cassandra implementation for collecting data and

presenting data

Robert [email protected]

Agenda• SQL vs NOSQL• Why Cassandra• Cassandra introduction• Our architecture and design• Configuration best practice• How we write data• How we read data • Demo

A highly scalable, eventually consistent, distributed, structured key-value store.

Cassandra™ is the highly scalable and high performance distributed data infrastructure. Offering distribution of data across multiple data centers and incremental scalability with no single points of failure, Cassandra is the logical choice when you need reliability without compromising performance. Cassandra is relied upon by

leading companies like Netflix, Twitter, Cisco, Rackspace, Ooyala, Openwave, and many more.

http://cassandra.apache.org/

http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers

http://techblog.netflix.com/2011/01/nosql-at-netflix.html

http://www.datastax.com/wp-content/uploads/2011/04/DataStax-CS-Ooyala.pdf

http://www.datastax.com/wp-content/uploads/2011/05/DataStax-CaseStudy-Openwave.pdf

SQL vs NOSQL• NOSQL

• Not just SQL, schema free• Big data• NOSQL can service heavy read/write workloads • Probably not consistent in real time read

• SQL• Can support complex join relationship• Oracle RAC solution for big data? Too expensive• Typical RDBMS implementations are tuned for small but frequent read/write transactions or for

large batch transactions with rare write access • RDBMSs (they say) have shown poor performance on data-intensive applications, including:

• Indexing a large number of documents• Serving pages on high-traffic websites• Handling the volumes of social networking data• Delivering streaming media

• Consistent in all read

Why Cassandra• To solve our central netapp filer storage bottleneck issue• Choose cassandra instead of Hbase

• No Single point of failure• Fast development

• Big data and dynamically changing environment • Good fit for horizontally production environment• Low total cost of ownership

• No special hardware needed, just some x86 boxes

Cassandra Design • High availability (A wily hare has three burrows ）• Eventual consistency

• trade-off strong consistency in favor of high availability• allows you to choose strong consistency or allow varying degress of more relaxed consistency

• Incremental scalability(linearly scalable), Horizontal!• Nodes added to a Cassandra cluster (all done online) increase the throughput of your database

in a predictable, linear fashion for both read and write operations

• Optimistic Replication•

http://www.datastax.com/docs/0.8/dml/data_consistency



Cassandra Design II

• All nodes are identical: decentralized/symmetric• No master or SPOF• Adding is simple• Distributed, read/write anywhere design

• Massively scalable peer-to-peer architecture• Based on the best of Amazon Dynamo and Google BigTable

• Minimal administration• Multi-datacenter replication• No caching layer required

Cassandra Design III• very fast writes• fault tolerant, Guaranteed data safety• automatic provisioning of new nodes• big data• Transparent fault detection and recovery

• Cassandra utilizes gossip protocols to detect machine failure and recover when a machine is brought back into the cluster – all without your application noticing.

write op

Write op (continue)

• Writes go to log and memory table• Periodically memory table merged with disk table

Cassandra node

Disk

RAM

Log SSTable file

Memtable

Update

(later)

Read

Query

Closest replica

Cassandra Cluster

Replica A

Result

Replica B Replica C

Digest QueryDigest Response Digest Response

Result

Client

Read repair if digests differ

Configuration best practice• Put the data files on good performance RAID volumes• Start with Sun JDK 1.6+• Configure with Java Native libs• The clocks on each node must be synchronized to maintain precision

across the cluster on inserts.

Data collection Architecture

Web UI (High Chart/ JQuery)

Active MQ (Message Bus)

1. collect data sent to Active MQ

2. Consume data, save to Cassandra

3. Filer the data, showing on the plots

Data structure

keyspace

settings (eg,

partitioner)

column family

settings (eg, comparator, type [Std])

columnname value clock

Company Logo

Our Data Model

CoreMetrics (keyspace)

LoadAvg1 (Column family)

host1_131696(row)

Column:6449, value: 0.04


host2_131811(row)



Company Logo

Our Data Model


Primary (Column family)

host1:loadAvg1 (row)

Column:1316966449, value: 0.04 Column:1316965546, value: 0.02

host2:loadAvg1 (row)


Company Logo

Our Meta Data Model


PrimaryMeta (Column family)

host1.com (row)

Column:loadAvg15:Total value: 1

Column:loadAvg15:Total value: 1

host2 (row)

Column:loadAvg15:Total value: 1 Column:loadAvg15:Total value: 1

Company Logo

Our Hbase Data Model

Primary (Column family)

host1:loadAvg1:1 (row: host:metric:instance)

Column:c:1316966449, value: 0.04 Column:c:1316965546, value: 0.02

host2:loadAvg1:1 (row: host:metric:instance)


Company Logo

Our Data Model (II)• Keyspace: CoreMetrics (database name), one per application

• Column families: (metrics, each metric is a column family)

• loadAvg1• loadAvg5• etc (About 80 server metrics)

• Rows and columns: inspired by the design of Hbase and opentsdb, we use the similar way to design our rows and columns:

separate timestamp into row and column keys, which improve tremendously the reading performance

How we write to cassandraMultiple data loaders connect to cassandra nodes 9160 port and insert data like this:

$CLIENT = new Cassandra::CassandraClient($PROTOCOL);

$CLIENT->set_keyspace($keyspace);

$CLIENT->insert($rowkey, $column_parent, $column, $consistency_level);

How we read data from cassandra

Using pycassa to multiget of the rows and do some aggregation if too many data points returns.

get_coremetrics(metric_name, host, stime, etime, samples = 1000):

Company Logo

Demo: data model view

Company Logo

Demo: graphing the data

Cassandra monitoring

1.Nagios plugin for cassandra2.JMX

Thoughts and future

1.Migrate more applications to Cassandra2.Livestat data (Bids/Listings…)3.Help other team to do data collection and graphing?

Reference URLs

• Thrift (12 language bindings!)• http://wiki.apache.org/cassandra/ThriftInterface • http://thrift.apache.org/download/

• Pycassa• http://pycassa.github.com/pycassa/tutorial.html

http://wiki.apache.org/cassandra/ThriftInterface

http://thrift.apache.org/download/

http://pycassa.github.com/pycassa/tutorial.html

Date post:	10-May-2015
Category:	Technology
Upload:	robert-chen
View:	253 times
Download:	2 times

Cassandra implementation for collecting data and presenting data

Technology