Cassandra in the Netflix Architecture
CassandraEU London March 28th, 2012Denis Sheahan
Agenda
• Netflix and the Cloud• Why Cassandra• Cassandra Deployment @ Netflix• Cassandra Operations• Cassandra lessons learned• Scalability• Open Source
Netflix and the Cloud
With more than 23 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix, Inc. is the world's
leading internet subscription service for enjoying movies and TV series..
Netflix.com is almost 100% deployed on the Amazon Cloud
Source: http://ir.netflix.com
Out-Growing Data Centerhttp://techblog.netflix.com/2011/02/redesigning-netflix-api.html
37x Growth Jan 2010-Jan 2011
DatacenterCapacity
Netflix Deployed on AWS
Content
Video Masters
EC2
S3
CDNs
Logs
S3
EMR Hadoop
Hive
BusinessIntelligence
Play
DRM
CDN routing
Bookmarks
Logging
WWW
Sign-Up
Search
Movie Choosing
Ratings
API
Metadata
Device Config
TV Movie Choosing
Social Facebook
CS
International CS lookup
Diagnostics & Actions
Customer Call Log
CS Analytics
Netflix AWS Cost
Why Cassandra
Distributed Key-Value Store vs Central SQL Database
• Datacenter had a central database• Schema changes require downtime• Cloud has in contrast many key-value data
stores– Joins take place in java code– No schema to change, no scheduled downtime
DBA
Goals for moving from Netflix DC to the Cloud
• Faster– Lower latency than the equivalent datacenter web
• Scalable– Avoid needing any more datacenter capacity as subscriber
count increases
• Available– Substantially higher robustness and availability than
datacenter services
• Productive– Optimize agility of a large development team with
automation and tools
Cassandra• Faster
– Low latency, low latency variance• Scalable
– Supports running on Amazon EC2– High and scalable read and write throughput– Support for Multi-region clusters
• Available– We value Availability over Consistency – Cassandra Eventually Consistent– Supports Amazon Availability Zones– Data integrity checks and repairs– Online Snapshot Backup, Restore/Rollback
• Productive– We want FOSS + Support
Cassandra Deployment
Netflix Cassandra Use Cases
• Many different profiles– Read heavy environments with a strict SLA– Batch Write environments (70 rows per batch)
also serving low latency Reads– Read Modify Write environments with large rows– Write only environments with rapidly increasing
data sets– Any many more….
How much we use Cassandra30 Number of production clusters
12 Number of multi-region clusters
3 Max regions, one cluster
65 Total TB of data across all clusters
472 Number of Cassandra nodes
72/28 Largest Cassandra cluster (nodes/data in TB)
6k/250k Max read/writes per second
Front End Load Balancer
API Proxy
Load Balancer
DiscoveryService
API
AWS EC2
S3
ComponentServices
memcached
Deployment Architecture API
CassandraEC2
Internal Disks
Backup
High Availability Deployment• Fact of life – EC2 instances die• We store 3 local replicas in 3 different Cassandra nodes
– One replica per EC2 Availability Zone (AZ)• Minimum Cluster configuration is 6 nodes, 2 per AZ
– Single instance failure still leaves at least one node in each AZ• Use Local Quorum for writes• Use Quorum One for reads• Entire cluster replicated in Multi-region deployments• AWS Availability Zones
– Separate buildings– Separate power etc.– Fairly close together
Astyanax - Cassandra Write Data FlowsSingle Region, Multiple Availability Zone, Token Aware
Token Aware Clients
Cassandra•Disks•Zone A
Cassandra•Disks•Zone A
Cassandra•Disks•Zone B
Cassandra•Disks•Zone B
Cassandra•Disks•Zone C
Cassandra•Disks•Zone C
• Client Writes to nodes and Zones
• Nodes return ack to client
• Data written to internal commit log disks (no more than 10 seconds later)
If a node goes offline, hinted handoff completes the write when the node comes back up.
Requests can choose to wait for one node, a quorum, or all nodes to ack the write
SSTable disk writes and compactions occur asynchronously
Extending to Multi-RegionIn production for UK/Ireland support
• Create cluster in EU• Backup US cluster to S3• Restore backup in EU• Local repair EU cluster• Global repair/join
US Clients
Cassandra• Disks• Zone A
Cassandra• Disks• Zone A
Cassandra• Disks• Zone B
Cassandra• Disks• Zone B
Cassandra• Disks• Zone C
Cassandra• Disks• Zone C
EU Clients
Cassandra• Disks• Zone A
Cassandra• Disks• Zone A
Cassandra• Disks• Zone B
Cassandra• Disks• Zone B
Cassandra• Disks• Zone C
Cassandra• Disks• Zone C
S3
Data Flows for Multi-Region WritesToken Aware, Consistency Level = Local Quorum
• Client writes to local replicas• Local write acks returned to
Client which continues when 2 of 3 local nodes are committed
• Local coordinator writes to remote coordinator.
• When data arrives, remote coordinator node acks
• Remote co-ordinator sends data to other remote zones
• Remote nodes ack to local coordinator
• Data flushed to internal commit log disks (no more than 10 seconds later)
If a node or region goes offline, hinted handoff completes the write when the node comes back up.Nightly global compare and repair jobs ensure everything stays consistent.
US Clients
Cassandra• Disks• Zone A
Cassandra• Disks• Zone A
Cassandra• Disks• Zone B
Cassandra• Disks• Zone B
Cassandra• Disks• Zone C
Cassandra• Disks• Zone C
EU Clients
Cassandra• Disks• Zone A
Cassandra• Disks• Zone A
Cassandra• Disks• Zone B
Cassandra• Disks• Zone B
Cassandra• Disks• Zone C
Cassandra• Disks• Zone C
Local Remote
Priam – Cassandra AutomationAvailable at http://github.com/netflix
• Open Source Tomcat Code running as a sidecar on each Cassandra node. Deployed as a separate rpm
• Zero touch auto-configuration• Token allocation and assignment including multi-
region• Broken node replacement and ring expansion• Full and incremental backup/restore to/from S3• Metrics collection and forwarding via JMX
Cassandra Backup/Restore
S3 Backup
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
CassandraCassandra
Cassandra
Cassandra
Cassandra
Cassandra
• Full Backup– Time based snapshot– SSTable compress -> S3
• Incremental– SSTable write triggers
compressed copy to S3• Archive– Copy cross region
• Restore– Full restore or create new
Ring from Backup A
Cassandra Operations
Consoles, Monitors and Explorers
• Netflix Application Console (NAC)– Primary AWS provisioning/config interface
• EPIC Counters – Primary method for issue resolution
• Dashboards• Cassandra Explorer– Browse clusters, keyspaces, column families
• AWS Usage Analyzer– Breaks down costs by application and resource
Cloud Deployment Model
Auto ScalingGroup
LaunchConfiguration
Elastic LoadBalancer
SecurityGroup
Amazon MachineImage
Instances
NAC
• Netflix Application Console (NAC) is Netflix’s primary tool in both Prod and Test to: • Create and Destroy Applications• Create and Destroy Auto Scaling Groups (ASGs)• Scale Instances up and down within an ASG and manage auto-
scaling• Manage launch configs and AMIs
• http://www.slideshare.net/joesondow
NAC
• Kiosk mode – no alerting• High level cluster status (thift, gossip)• Warns on a small set of metrics
27
Cassandra Explorer
28
Epic
• Netflix-wide monitoring and alerting tool based on RRD• Priam sends all JMX data to Epic• Very useful for finding specific issues
29
Dashboards
• Next level cluster details• Throughput• Latency, Gossip status, Maintenance operations• Trouble indicators
• Useful for finding anomalies• Most investigations start here
Things we monitor
• Cassandra– Throughput, Latency, Compactions , Repairs– Pending threads, Dropped operations– Backup failures– Recent restarts– Schema changes
• System– Disk space, Disk throughput, Load average
• Errors and exceptions in Cassandra, System and Tomcat log files
30
Cassandra AWS Pain Points
• Compactions cause spikes, esp. on read-heavy systems– Affects clients (hector, astyanax)– Throttling in newer Cassandra versions helps
• Repairs are toxic to performance• Disk performance on Cloud instances and its impact on
SSTable count• Memory requirements due to filesystem cache• Compression unusable in our environment• Multi-tenancy performance unpredictable• Java Heap size and OOM issues
Lessons learned
• In EC2 best to choose instances that are not multi-tenant
• Better to compact on our terms and not Cassandra’s. Take nodes out of service for major compactions
• Size memtable flushes for optimizing compactions– Helps when writes are uniformly distributed, easier to
determine flush patterns– Best to optimize flushes based on memtable size, not time– Makes minor compactions smoother
32
Lessons Learned (cont)
• Key and row caches– Left unbounded can chew up jvm memory needed
for normal work– Latencies will spike as the jvm needs to fight for
memory– Off-heap row cache still maintains data structures
on-heap• mmap() as in-memory cache– When process terminated, mmap pages are added
to the free list
Lessons Learned (cont)
• Sharding – If a single row has many gets/mutates, the nodes
holding it will become hot spots– If a row grows too large, it won’t fit into memory• Problem for reads, compactions, and repairs• Some of our indices ran afoul of this
• For more info see Jason Brown’s slides Cassandra from the trenches slideshare.net/netflix
Scalability
Scalability Testing
• Cloud Based Testing – frictionless, elastic– Create/destroy any sized cluster in minutes– Many test scenarios run in parallel
• Test Scenarios– Internal app specific tests– Simple “stress” tool provided with Cassandra
• Scale test, keep making the cluster bigger– Check that tooling and automation works…– How many ten column row writes/sec can we do?
Scale-Up Linearity
0 50 100 150 200 250 300 3500
200000
400000
600000
800000
1000000
1200000
174373
366828
537172
1099837
Client Writes/s by node count – Replication Factor = 3
Tran
sacti
ons
EC2 Instances
Measured at the Cassandra ServerThroughput 3.3 Million writes/sec
Cass
andr
a W
rites
per
sec
ond
Elapsed time seconds
Response time 0.014ms
Elapsed time seconds
Cass
andr
a Re
spon
se ti
me
Per Node ActivityPer Node 48 Nodes 96 Nodes 144 Nodes 288 Nodes
Per Server Writes/s 10,900 w/s 11,460 w/s 11,900 w/s 11,456 w/s
Mean Server Latency 0.0117 ms 0.0134 ms 0.0148 ms 0.0139 ms
Mean CPU %Busy 74.4 % 75.4 % 72.5 % 81.5 %
Disk Read 5,600 KB/s 4,590 KB/s 4,060 KB/s 4,280 KB/s
Disk Write 12,800 KB/s 11,590 KB/s 10,380 KB/s 10,080 KB/s
Network Read 22,460 KB/s 23,610 KB/s 21,390 KB/s 23,640 KB/s
Network Write 18,600 KB/s 19,600 KB/s 17,810 KB/s 19,770 KB/s
Node specification – Xen Virtual Images, AWS US East, three zones• Cassandra 0.8.6, CentOS, SunJDK6• AWS EC2 m1 Extra Large – Standard price $ 0.68/Hour• 15 GB RAM, 4 Cores, 1Gbit network• 4 internal disks (total 1.6TB, striped together, md, XFS)
Time is Money48 nodes 96 nodes 144 nodes 288 nodes
Writes Capacity 174373 w/s 366828 w/s 537172 w/s 1,099,837 w/s
Storage Capacity 12.8 TB 25.6 TB 38.4 TB 76.8 TB
Nodes Cost/hr $32.64 $65.28 $97.92 $195.84
Test Driver Instances 10 20 30 60
Test Driver Cost/hr $20.00 $40.00 $60.00 $120.00
Cross AZ Traffic 5 TB/hr 10 TB/hr 15 TB/hr 301 TB/hr
Traffic Cost/10min $8.33 $16.66 $25.00 $50.00
Setup Duration 15 minutes 22 minutes 31 minutes 662 minutes
AWS Billed Duration 1hr 1hr 1 hr 2 hr
Total Test Cost $60.97 $121.94 $182.92 $561.68
1 Estimate two thirds of total network traffic 2 Workaround for a tooling bug slowed setup
Open Source
Open Source @ Netflix
• Source at http://netflix.github.com• Binaries at Maven https://issues.sonatype.org/browse/OSSRH-2116
Cassandra JMeter Plugin• Netflix uses JMeter across the fleet for load
testing• JMeter plugin provides a wide range of
samplers for Get, Put, Delete and Schema Creation
• Used extensively to load data, Cassandra stress tests, feature testing etc.
• Described at https://github.com/Netflix/CassJMeter/wiki
AstyanaxAvailable at http://github.com/netflix
• Cassandra java client• API abstraction on top of Thrift protocol• “Fixed” Connection Pool abstraction (vs. Hector)
– Round robin with Failover– Retry-able operations not tied to a connection– Netflix PaaS Discovery service integration– Host reconnect (fixed interval or exponential backoff)– Token aware to save a network hop – lower latency– Latency aware to avoid compacting/repairing nodes – lower variance
• Simplified use of serializers via method overloading (vs. Hector)• ConnectionPoolMonitor interface