Download - Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts

Cassandra in the Netflix Architecture

CassandraEU London March 28th, 2012Denis Sheahan

Agenda

• Netflix and the Cloud• Why Cassandra• Cassandra Deployment @ Netflix• Cassandra Operations• Cassandra lessons learned• Scalability• Open Source

Netflix and the Cloud

With more than 23 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix, Inc. is the world's

leading internet subscription service for enjoying movies and TV series..

Netflix.com is almost 100% deployed on the Amazon Cloud

Source: http://ir.netflix.com

Out-Growing Data Centerhttp://techblog.netflix.com/2011/02/redesigning-netflix-api.html

37x Growth Jan 2010-Jan 2011

DatacenterCapacity

Netflix Deployed on AWS

Content

Video Masters

EC2

S3

CDNs

Logs

S3

EMR Hadoop

Hive

BusinessIntelligence

Play

DRM

CDN routing

Bookmarks

Logging

WWW

Sign-Up

Search

Movie Choosing

Ratings

API

Metadata

Device Config

TV Movie Choosing

Social Facebook

CS

International CS lookup

Diagnostics & Actions

Customer Call Log

CS Analytics

Netflix AWS Cost

Why Cassandra

Distributed Key-Value Store vs Central SQL Database

• Datacenter had a central database• Schema changes require downtime• Cloud has in contrast many key-value data

stores– Joins take place in java code– No schema to change, no scheduled downtime

DBA

Goals for moving from Netflix DC to the Cloud

• Faster– Lower latency than the equivalent datacenter web

• Scalable– Avoid needing any more datacenter capacity as subscriber

count increases

• Available– Substantially higher robustness and availability than

datacenter services

• Productive– Optimize agility of a large development team with

automation and tools

Cassandra• Faster

– Low latency, low latency variance• Scalable

– Supports running on Amazon EC2– High and scalable read and write throughput– Support for Multi-region clusters

• Available– We value Availability over Consistency – Cassandra Eventually Consistent– Supports Amazon Availability Zones– Data integrity checks and repairs– Online Snapshot Backup, Restore/Rollback

• Productive– We want FOSS + Support

Cassandra Deployment

Netflix Cassandra Use Cases

• Many different profiles– Read heavy environments with a strict SLA– Batch Write environments (70 rows per batch)

also serving low latency Reads– Read Modify Write environments with large rows– Write only environments with rapidly increasing

data sets– Any many more….

How much we use Cassandra30 Number of production clusters

12 Number of multi-region clusters

3 Max regions, one cluster

65 Total TB of data across all clusters

472 Number of Cassandra nodes

72/28 Largest Cassandra cluster (nodes/data in TB)

6k/250k Max read/writes per second

Front End Load Balancer

API Proxy

Load Balancer

DiscoveryService

API

AWS EC2

S3

ComponentServices

memcached

Deployment Architecture API

CassandraEC2

Internal Disks

Backup

High Availability Deployment• Fact of life – EC2 instances die• We store 3 local replicas in 3 different Cassandra nodes

– One replica per EC2 Availability Zone (AZ)• Minimum Cluster configuration is 6 nodes, 2 per AZ

– Single instance failure still leaves at least one node in each AZ• Use Local Quorum for writes• Use Quorum One for reads• Entire cluster replicated in Multi-region deployments• AWS Availability Zones

– Separate buildings– Separate power etc.– Fairly close together

Astyanax - Cassandra Write Data FlowsSingle Region, Multiple Availability Zone, Token Aware

Token Aware Clients

Cassandra•Disks•Zone A

Cassandra•Disks•Zone A

Cassandra•Disks•Zone B

Cassandra•Disks•Zone B

Cassandra•Disks•Zone C

Cassandra•Disks•Zone C

• Client Writes to nodes and Zones

• Nodes return ack to client

• Data written to internal commit log disks (no more than 10 seconds later)

If a node goes offline, hinted handoff completes the write when the node comes back up.

Requests can choose to wait for one node, a quorum, or all nodes to ack the write

SSTable disk writes and compactions occur asynchronously

Extending to Multi-RegionIn production for UK/Ireland support

• Create cluster in EU• Backup US cluster to S3• Restore backup in EU• Local repair EU cluster• Global repair/join

US Clients

Cassandra• Disks• Zone A


Cassandra• Disks• Zone B


Cassandra• Disks• Zone C


EU Clients







S3

Data Flows for Multi-Region WritesToken Aware, Consistency Level = Local Quorum

• Client writes to local replicas• Local write acks returned to

Client which continues when 2 of 3 local nodes are committed

• Local coordinator writes to remote coordinator.

• When data arrives, remote coordinator node acks

• Remote co-ordinator sends data to other remote zones

• Remote nodes ack to local coordinator

• Data flushed to internal commit log disks (no more than 10 seconds later)

If a node or region goes offline, hinted handoff completes the write when the node comes back up.Nightly global compare and repair jobs ensure everything stays consistent.

US Clients







EU Clients







Local Remote

Priam – Cassandra AutomationAvailable at http://github.com/netflix

• Open Source Tomcat Code running as a sidecar on each Cassandra node. Deployed as a separate rpm

• Zero touch auto-configuration• Token allocation and assignment including multi-

region• Broken node replacement and ring expansion• Full and incremental backup/restore to/from S3• Metrics collection and forwarding via JMX

Cassandra Backup/Restore

S3 Backup

Cassandra

Cassandra

Cassandra

Cassandra

Cassandra

CassandraCassandra

Cassandra

Cassandra

Cassandra

Cassandra

• Full Backup– Time based snapshot– SSTable compress -> S3

• Incremental– SSTable write triggers

compressed copy to S3• Archive– Copy cross region

• Restore– Full restore or create new

Ring from Backup A

Cassandra Operations

Consoles, Monitors and Explorers

• Netflix Application Console (NAC)– Primary AWS provisioning/config interface

• EPIC Counters – Primary method for issue resolution

• Dashboards• Cassandra Explorer– Browse clusters, keyspaces, column families

• AWS Usage Analyzer– Breaks down costs by application and resource

Cloud Deployment Model

Auto ScalingGroup

LaunchConfiguration

Elastic LoadBalancer

SecurityGroup

Amazon MachineImage

Instances

NAC

• Netflix Application Console (NAC) is Netflix’s primary tool in both Prod and Test to: • Create and Destroy Applications• Create and Destroy Auto Scaling Groups (ASGs)• Scale Instances up and down within an ASG and manage auto-

scaling• Manage launch configs and AMIs

• http://www.slideshare.net/joesondow

http://www.slideshare.net/joesondow

NAC

• Kiosk mode – no alerting• High level cluster status (thift, gossip)• Warns on a small set of metrics

27

Cassandra Explorer

28

Epic

• Netflix-wide monitoring and alerting tool based on RRD• Priam sends all JMX data to Epic• Very useful for finding specific issues

29

Dashboards

• Next level cluster details• Throughput• Latency, Gossip status, Maintenance operations• Trouble indicators

• Useful for finding anomalies• Most investigations start here

Things we monitor

• Cassandra– Throughput, Latency, Compactions , Repairs– Pending threads, Dropped operations– Backup failures– Recent restarts– Schema changes

• System– Disk space, Disk throughput, Load average

• Errors and exceptions in Cassandra, System and Tomcat log files

30

Cassandra AWS Pain Points

• Compactions cause spikes, esp. on read-heavy systems– Affects clients (hector, astyanax)– Throttling in newer Cassandra versions helps

• Repairs are toxic to performance• Disk performance on Cloud instances and its impact on

SSTable count• Memory requirements due to filesystem cache• Compression unusable in our environment• Multi-tenancy performance unpredictable• Java Heap size and OOM issues

Lessons learned

• In EC2 best to choose instances that are not multi-tenant

• Better to compact on our terms and not Cassandra’s. Take nodes out of service for major compactions

• Size memtable flushes for optimizing compactions– Helps when writes are uniformly distributed, easier to

determine flush patterns– Best to optimize flushes based on memtable size, not time– Makes minor compactions smoother

32

Lessons Learned (cont)

• Key and row caches– Left unbounded can chew up jvm memory needed

for normal work– Latencies will spike as the jvm needs to fight for

memory– Off-heap row cache still maintains data structures

on-heap• mmap() as in-memory cache– When process terminated, mmap pages are added

to the free list

Lessons Learned (cont)

• Sharding – If a single row has many gets/mutates, the nodes

holding it will become hot spots– If a row grows too large, it won’t fit into memory• Problem for reads, compactions, and repairs• Some of our indices ran afoul of this

• For more info see Jason Brown’s slides Cassandra from the trenches slideshare.net/netflix

Scalability

Scalability Testing

• Cloud Based Testing – frictionless, elastic– Create/destroy any sized cluster in minutes– Many test scenarios run in parallel

• Test Scenarios– Internal app specific tests– Simple “stress” tool provided with Cassandra

• Scale test, keep making the cluster bigger– Check that tooling and automation works…– How many ten column row writes/sec can we do?

Scale-Up Linearity

0 50 100 150 200 250 300 3500

200000

400000

600000

800000

1000000

1200000

174373

366828

537172

1099837

Client Writes/s by node count – Replication Factor = 3

Tran

sacti

ons

EC2 Instances

Measured at the Cassandra ServerThroughput 3.3 Million writes/sec

Cass

andr

a W

rites

per

sec

ond

Elapsed time seconds

Response time 0.014ms

Elapsed time seconds

Cass

andr

a Re

spon

se ti

me

Per Node ActivityPer Node 48 Nodes 96 Nodes 144 Nodes 288 Nodes

Per Server Writes/s 10,900 w/s 11,460 w/s 11,900 w/s 11,456 w/s

Mean Server Latency 0.0117 ms 0.0134 ms 0.0148 ms 0.0139 ms

Mean CPU %Busy 74.4 % 75.4 % 72.5 % 81.5 %

Disk Read 5,600 KB/s 4,590 KB/s 4,060 KB/s 4,280 KB/s

Disk Write 12,800 KB/s 11,590 KB/s 10,380 KB/s 10,080 KB/s

Network Read 22,460 KB/s 23,610 KB/s 21,390 KB/s 23,640 KB/s

Network Write 18,600 KB/s 19,600 KB/s 17,810 KB/s 19,770 KB/s

Node specification – Xen Virtual Images, AWS US East, three zones• Cassandra 0.8.6, CentOS, SunJDK6• AWS EC2 m1 Extra Large – Standard price $ 0.68/Hour• 15 GB RAM, 4 Cores, 1Gbit network• 4 internal disks (total 1.6TB, striped together, md, XFS)

Time is Money48 nodes 96 nodes 144 nodes 288 nodes

Writes Capacity 174373 w/s 366828 w/s 537172 w/s 1,099,837 w/s

Storage Capacity 12.8 TB 25.6 TB 38.4 TB 76.8 TB

Nodes Cost/hr $32.64 $65.28 $97.92 $195.84

Test Driver Instances 10 20 30 60

Test Driver Cost/hr $20.00 $40.00 $60.00 $120.00

Cross AZ Traffic 5 TB/hr 10 TB/hr 15 TB/hr 301 TB/hr

Traffic Cost/10min $8.33 $16.66 $25.00 $50.00

Setup Duration 15 minutes 22 minutes 31 minutes 662 minutes

AWS Billed Duration 1hr 1hr 1 hr 2 hr

Total Test Cost $60.97 $121.94 $182.92 $561.68

1 Estimate two thirds of total network traffic 2 Workaround for a tooling bug slowed setup

Open Source

Open Source @ Netflix

• Source at http://netflix.github.com• Binaries at Maven https://issues.sonatype.org/browse/OSSRH-2116

Cassandra JMeter Plugin• Netflix uses JMeter across the fleet for load

testing• JMeter plugin provides a wide range of

samplers for Get, Put, Delete and Schema Creation

• Used extensively to load data, Cassandra stress tests, feature testing etc.

• Described at https://github.com/Netflix/CassJMeter/wiki

AstyanaxAvailable at http://github.com/netflix

• Cassandra java client• API abstraction on top of Thrift protocol• “Fixed” Connection Pool abstraction (vs. Hector)

– Round robin with Failover– Retry-able operations not tied to a connection– Netflix PaaS Discovery service integration– Host reconnect (fixed interval or exponential backoff)– Token aware to save a network hop – lower latency– Latency aware to avoid compacting/repairing nodes – lower variance

• Simplified use of serializers via method overloading (vs. Hector)• ConnectionPoolMonitor interface