Date post: | 05-Dec-2014 |
Category: |
Technology |
Upload: | planet-cassandra |
View: | 6,041 times |
Download: | 1 times |
CASSANDRAAT INSTAGRAMRick Branson, Infrastructure Engineer@rbranson
2013 Cassandra Summit#cassandra13 June 12, 2013
San Francisco, CA
September 2012Redis fillin' up.
What sucks?
THE OBVIOUSMemory is expensive.
LESS OBVIOUS:In-memory "degrades" poorly
• Flat namespace. What's in there?
• Heap fragmentation
• Single threaded
BGSAVE
• Boils down to centralized logging
• VERY high skew of writes to reads (1,000:1)
• Ever growing data set
• Durability highly valued
The Data
• Cassandra 1.1
• 3 EC2 m1.xlarge (2-core, 15GB RAM)
• RAIDed ephemerals (1.6TB of SATA)
• RF=3
• 6GB Heap, 200MB NewSize
• HSHA
The Setup
It worked. Mostly.
The horriblecool thing about Chef...
commit a1489a34d2aa69316b010146ab5254895f7b9141Author: Rick BransonDate: Thu Oct 18 20:05:16 2012 -0700
Follow the rules for Cassandra listen_address so I don't burn a whole day fixing my retarded mistake
commit 41c96f3243a902dd6af4ea29ef6097351a16494aAuthor: Rick BransonDate: Tue Oct 30 17:12:00 2012 -0700
Use 256k JVM stack size for C* -- fixes a bug that got integrated with 1.1.6 packaging + Java 1.6.0_u34+
November 2013Doubled to 6 nodes.
18,000 connections. Spread those more evenly.
commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0Author: Rick BransonDate: Wed Nov 21 09:50:21 2012 -0800
Drop key cache size on C*UA cluster: was causing heap issues, and apparently 1GB is _WAY_ outside of the normal range of operation for nodes of this size.
commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786Author: Rick BransonDate: Mon Dec 24 12:41:13 2012 -0800
Lower memtable sizes on C* UA cluster to make more room for compression metadata / bloom filters on heap
1.2.1.It went well.well... until...
commit 84982635d5c807840d625c22a8bd4407c1879ebaAuthor: Rick BransonDate: Thu Jan 31 09:43:56 2013 -0800
Switch Cassandra from tokens to vnodes
commit e990acc5dc69468c8a96a848695fca56e79f8b83Author: Rick BransonDate: Sun Feb 10 20:26:32 2013 -0800
We aren't ready for vnodes yet guys
TAKEAWAYLet stupidenterprising, experienced operators that
will submit patches take the first few bullets on brand-new major versions.
commit acb02daea57dca889c2aa45963754a271fa51566Author: Rick BransonDate: Sun Feb 10 20:36:34 2013 -0800
Doubled C* cluster
commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670Author: Rick BransonDate: Thu Mar 14 16:23:18 2013 -0700
Subtract token from C*ua7 to replace the node
pycassa exceptions (last 6 months)
• 3.4TB
• Will try vnode migration again soon...
TAKEAWAYAdopt a technology by understanding what it's best at and letting it do that first, then expand...
• Sharded Redis
• 32x68GB (m2.4xlarge)
• Space (memory) bound
• Resharding sucks
• Let's get some better availability...
user_id: [ activity, activity, ...]
user_id: [ activity, activity, ...]
Thrift Serialized Activity
Bound the Sizeuser_id: [ activity1, activity2, ... activity100, activity101, ...]
LTRIM <user_id> 0 99
Undo
user_id: [ activity1, activity2, activity3, ...]
LREM <user_id> 0 <activity2>
C* data model
user_idTimeUUID1 TimeUUID2
...TimeUUID101
user_id<activity> <activity>
...<activity>
Bound the Size
user_idTimeUUID1 TimeUUID2
...TimeUUID101
user_id<activity> <activity>
...<activity>
get(<user_id>)delete(<user_id>, columns=[<TimeUUID101>, <TimeUUID102>, <TimeUUID103>, ...])
The great destroyer of systems shows up. Tombstones abound.
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
timestamp101
TimeUUID = timestamp
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
timestamp101
delete(<user_id>, timestamp=<timestamp101>)
Row DeleteDeletes any data on a row with a timestampvalue equal to or less than the timestamp provided in the delete operation.
Optimizes Reads
SSTable
max_ts=100
SSTable
max_ts=200
SSTable
max_ts=300
SSTable
max_ts=400
SSTable
max_ts=500
SSTable
max_ts=600
SSTable
max_ts=700
SSTable
max_ts=800
Contains row tombstonewith timestamp 350
Safely ignoredusing in-memorymetadata
~10% of actions are undos.
Undo Support
user_idTimeUUID1 TimeUUID2
...TimeUUID101
user_id<activity> <activity>
...<activity>
get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])
get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])
Simple Race ConditionThe state of the row may have changed between these two operations.
💩
Replica[A, B]
Replica[A]
Writer Writer
insert B read [A]OK
Replica[A, B]
FAIL
"like Z" undo "like Z"
Diverging Replicas
SuperColumn = Old/Busted AntiColumn = New/Hotness
user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_idanti-column activity activity
"Anti-Column"Contains an MD5 hash of the activity data it is marking as deleted.
user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_idanti-column activity activity
Composite ColumnFirst component is zero for anti-columns,splitting the row into two independent lists,and ensuring the anti-columns always appearat the head.
Replica[A, B, C]
Replica[A, C]
Writer Writer
insert B insert COK
Replica[A, B, C]
FAIL
"like Z" undo "like Z"
Diverging Replicas: Solved
OK
TAKEAWAYRead-before-write is a smell. Try to model data as a log of user "intent" rather than manhandling the
data into place.
• Keep 30% "buffer" for trims.
• Undo without read. (thumbsup)
• Large lists suck for this. (thumbsdown)
• CASSANDRA-5527
Built in two days.Experience pays.
Reusability is key to rapid rollout.Great documentation eases concerns.
• C* 1.2.3
• vnodes, LeveledCompactionStrategy
• 12 hi1.4xlarge (8-core, 60GB, SSD)
• 3 AZs, RF=3, W=2, R=1
• 8GB heap, 800MB NewSize
1. Dial up Double Writes
2. Test with "Shadow" Reads
3. Dial up "Real" Reads
Rollout
commit 1c3d99a9e337f9383b093009dba074b8ade20768Author: Rick BransonDate: Mon May 6 14:58:54 2013 -0700
Bump C* inbox heap size 8G -> 10G, seeing heap pressure
Bootstrapping sucked because compacting10,000 SSTables takes forever.
sstable_size_in_mb: 5 => 25
Come in on Monday, one of the nodeswas unable to flush and has builtup 8,000+ commit log segments.
"Normal" Rebuild Process
1. /etc/init.d/cassandra stop
2. mv /data/cassandra /data/cassandra.old
3. /etc/init.d/cassandra start
For "non-vnode" clusters, best practiceis to set the initial_token in cassandra.yaml.
for vnode clusters, multiple tokens are selected randomly when a node is
bootstrapped.
IP address is effectively the "primary key"for nodes in a ring.
What had happened was.
1. Rebuilding node generated entirely new tokens and joined cluster.
2. Rest of cluster dropped the previously stored token data associated with the rebuilding node's IP address.
3. Token ranges shifted massively.
UPDATE COLUMN FAMILYInboxActivitiesByUserIDWITH read_repair_chance = 1.0;
stats.inbox.empty
Kicked off "nodetool repair" and waited... and
waited...
LeveledCompactionStrategy + vnodes = tragedy.
kill -3 <cassandra>"AntiEntropyStage:1" java.lang.Thread.State: RUNNABLE <...> at io.sstable.SSTableReader.decodeKey(SSTableReader.java:1014) at io.sstable.SSTableReader.getPosition(SSTableReader.java:802) at io.sstable.SSTableReader.getPosition(SSTableReader.java:717) at io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:664) at streaming.StreamOut.createPendingFiles(StreamOut.java:155) at streaming.StreamOut.transferSSTables(StreamOut.java:140) at streaming.StreamingRepairTask.initiateStreaming(StreamingRepairTask.java:133) at streaming.StreamingRepairTask.run(StreamingRepairTask.java:115) <...>
Every repair task was scanning everySSTable file to find ranges to repair.
Scan all the things.
• Standard Compaction: Only a few dozen SSTables.
• Non-VNodes: Repair is done once per token, and there is only one token.
~20X increase in repair performance.
TAKEAWAYIf you want to use VNodes and
LeveledCompactionStrategy, wait until the 1.2.6 release when CASSANDRA-5569 is merged in.
Where were we?It was a bad thing to not know data was
inconsistent until we saw an increase in user reported problems.
CASSANDRA-5618
$ nodetool netstatsMode: NORMALNot sending any streams.Not receiving any streams.Read Repair Statistics:Attempted: 3192520Mismatch (Blocking): 0Mismatch (Background): 11584Pool Name Active Pending CompletedCommands n/a 0 1837765727Responses n/a 1 1750784545
UPDATE COLUMN FAMILYInboxActivitiesByUserIDWITH read_repair_chance = 0.01;
99.63% consistent
TAKEAWAYThe way to rebuild a box in a vnode cluster is to
build a brand new node, then remove the old one with "nodetool removenode."
Fetch & Deserialize Time (measured from app)
Mean vs P90 (ms), trough-to-peak
Column Family: InboxActivitiesByUserIDSSTable count: 3264SSTables in each level: [1, 10, 105/100, 1053/1000, 2095, 0, 0]Space used (live): 80114509324Space used (total): 80444164726Memtable Columns Count: 2315159Memtable Data Size: 112197632Memtable Switch Count: 1312Read Count: 316192445Read Latency: 1.982 ms.Write Count: 1581610760Write Latency: 0.031 ms.Pending Tasks: 0Bloom Filter False Positives: 481617Bloom Filter False Ratio: 0.08558Bloom Filter Space Used: 54723960Compacted row minimum size: 25Compacted row maximum size: 545791Compacted row mean size: 3020
Thank you!We're hiring!