Date post: | 02-Jul-2015 |
Category: |
Software |
Upload: | tomer-gabel |
View: | 1,600 times |
Download: | 4 times |
Put Your Thinking
CAP OnTomer Gabel, Wix
JDay Lviv, 2015
Credits
Originally a talk by
Yoav Abrahami (Wix)
Based on “Call Me Maybe” by
Kyle “Aphyr” Kingsbury
Brewer’s CAP Theorem
Partition Tolerance
ConsistencyAvailability
Brewer’s CAP Theorem
Partition Tolerance
ConsistencyAvailability
By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
• … and someone else
just bought it.
Consistency
Consistency: Defined
• In a consistent
system:
All participants
see the same value
at the same time
• “Do you have this
book in stock?”
Consistency: Defined
• If our book store is an
inconsistent system:
– Two customers may
buy the book
– But there’s only one
item in inventory!
• We’ve just violated a
business constraint.
Availability
Availability: Defined
• An available system:
– Is reachable
– Responds to requests
(within SLA)
• Availability does not
guarantee success!
– The operation may fail
– “This book is no longer
available”
Availability: Defined
• What if the system is
unavailable?
– I complete the
checkout
– And click on “Pay”
– And wait
– And wait some more
– And…
• Did I purchase the
book or not?!
Partition
Tolerance
Partition Tolerance: Defined
• Partition: one or
more nodes are
unreachable
• No practical
system runs on a
single node
• So all systems are
susceptible!
A
B
C
D
E
“The Network is Reliable”
• All four happen in an
IP network
• To a client, delays
and drops are the
same
• Perfect failure
detection is provably
impossible1!
A B
drop delay
duplicate reorder
A B
A B A B
time
1 “Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch and Paterson
Partition Tolerance: Reified
• External causes:– Bad network config
– Faulty equipment
– Scheduled maintenance
• Even software causes partitions:– Bad network config.
– GC pauses
– Overloaded servers
• Plenty of war stories!– Netflix
– Twilio
– GitHub
– Wix :-)
• Some hard numbers1:– 5.2 failed devices/day
– 59K lost packets/day
– Adding redundancy only improves by 40%
1 “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, Gill et al
“Proving” CAP
In Pictures
• Let’s consider a simple
system:
– Service A writes values
– Service B reads values
– Values are replicated
between nodes
• These are “ideal”
systems
– Bug-free, predictable
Node 1
V0A
Node 2
V0B
In Pictures
• “Sunny day scenario”:
– A writes a new value V1
– The value is replicated
to node 2
– B reads the new value
Node 1
V0A
Node 2
V0B
V1
V1
V1
V1
In Pictures
• What happens if the
network drops?
– A writes a new value V1
– Replication fails
– B still sees the old value
– The system is
inconsistent
Node 1
V0A
Node 2
V0B
V1
V0
V1
In Pictures
• Possible mitigation is
synchronous replication
– A writes a new value V1
– Cannot replicate, so write is
rejected
– Both A and B still see V0
– The system is logically
unavailable
Node 1
V0A
Node 2
V0B
V1
What does it all mean?
The network is not reliable
• Distributed systems must handle partitions
• Any modern system runs on >1 nodes…
• … and is therefore distributed
• Ergo, you have to choose:
– Consistency over availability
– Availability over consistency
Granularity
• Real systems comprise many operations
– “Add book to cart”
– “Pay for the book”
• Each has different properties
• It’s a spectrum, not a binary choice!
Consistency Availability
Shopping CartCheckout
CAP IN THE REAL
WORLD
Kyle “Aphyr” Kingsbury
Breaking consistency
guarantees since 2013
PostgreSQL
• Traditional RDBMS
– Transactional
– ACID compliant
• Primarily a CP system
– Writes against a
master node
• “Not a distributed
system”
– Except with a client at
play!
PostgreSQL
• Writes are a simplified
2PC:
– Client votes to commit
– Server validates
transaction
– Server stores changes
– Server acknowledges
commit
– Client receives
acknowledgement
Client Server
Store
PostgreSQL
• But what if the ack is
never received?
• The commit is already
stored…
• … but the client has
no indication!
• The system is in an
inconsistent state
Client Server
Store
?
PostgreSQL
• Let’s experiment!
• 5 clients write to a
PostgreSQL instance
• We then drop the server
from the network
• Results:
– 1000 writes
– 950 acknowledged
– 952 survivors
So what can we do?
1. Accept false-negatives
– May not be acceptable for your use case!
2. Use idempotent operations
3. Apply unique transaction IDs
– Query state after partition is resolved
• These strategies apply to any RDBMS
• A document-oriented database
• Availability/scale via replica sets
– Client writes to a master node
– Master replicates writes to n replicas
• User-selectable consistency guarantees
MongoDB
• When a partition occurs:
– If the master is in the
minority, it is demoted
– The majority promotes a
new master…
– … selected by the highest
optime
MongoDB
• The cluster “heals” after partition resolution:
– The “old” master rejoins the cluster
– Acknowleged minority writes are reverted!
MongoDB
• Let’s experiment!
• Set up a 5-node
MongoDB cluster
• 5 clients write to
the cluster
• We then partition
the cluster
• … and restore it to
see what happens
MongoDB
• With write concern unacknowleged:– Server does not ack
writes (except TCP)
– The default prior to November 2012
• Results:– 6000 writes
– 5700 acknowledged
– 3319 survivors
– 42% data loss!
MongoDB
• With write concern
acknowleged:
– Server acknowledges
writes (after store)
– The default guarantee
• Results:
– 6000 writes
– 5900 acknowledged
– 3692 survivors
– 37% data loss!
MongoDB
• With write concern replica acknowleged:– Client specifies
minimum replicas
– Server acks after writes to replicas
• Results:– 6000 writes
– 5695 acknowledged
– 3768 survivors
– 33% data loss!
MongoDB
• With write concern majority:– For an n-node cluster,
requires at least n/2replicas
– Also called “quorum”
• Results:– 6000 writes
– 5700 acknowledged
– 5701 survivors
– No data loss
So what can we do?
1. Keep calm and carry on
– As Aphyr puts it, “not all applications need
consistency”
– Have a reliable backup strategy
– … and make sure you drill restores!
2. Use write concern majority
– And take the performance hit
The prime suspects
• Aphyr’s Jepsen tests
include:
– Redis
– Riak
– Zookeeper
– Kafka
– Cassandra
– RabbitMQ
– etcd (and consul)
– ElasticSearch
• If you’re
considering them,
go read his posts
• In fact, go read his
posts regardless
http://aphyr.com/tags/jepsen
STRATEGIES FOR
DISTRIBUTED SYSTEMS
Immutable Data
• Immutable (adj.):
“Unchanging over
time or unable to be
changed.”
• Meaning:
– No deletes
– No updates
– No merge conflicts
– Replication is trivial
Idempotence
• An idempotent
operation:
– Can be applied one or
more times with the
same effect
• Enables retries
• Not always possible
– Side-effects are key
– Consider: payments
Eventual Consistency
• A design which prefers
availability
• … but guarantees that
clients will eventually see
consistent reads
• Consider git:
– Always available locally
– Converges via push/pull
– Human conflict resolution
Eventual Consistency
• The system expects
data to diverge
• … and includes
mechanisms to regain
convergence
– Partial ordering to
minimize conflicts
– A merge function to
resolve conflicts
Vector Clocks
• A technique for partial ordering
• Each node has a logical clock
– The clock increases on every write
– Track the last observed clocks for each item
– Include this vector on replication
• When observed and inbound vectors have
no common ancestor, we have a conflict
• This lets us know when history diverged
CRDTs• Commutative Replicated Data Types1
• A CRDT is a data structure that:
– Eventually converges to a consistent state
– Guarantees no conflicts on replication
1 “A comprehensive study of Convergent and Commutative Replicated Data Types”, Shapiro et al
CRDTs
• CRDTs provide specialized semantics:
– G-Counter: Monotonously increasing counter
– PN-Counter: Also supports decrements
– G-Set: A set that only supports adds
– 2P-Set: Supports removals but only once
• OR-Sets are particularly useful
– Keeps track of both additions and removals
– Can be used for shopping carts
Questions?
Complaints?
WE’RE DONE
HERE!
Thank you for listening
@tomerg
http://il.linkedin.com/in/tomergabel
Aphyr’s “Call Me Maybe” blog posts:
http://aphyr.com/tags/jepsen