Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | sargun-dhillon |
View: | 664 times |
Download: | 0 times |
Internet Traffic vs. Penetration
0
25
50
75
100
0
10000
20000
30000
40000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
IP Traffic (PB/mo) Global Penetration (%)
Biggest AWS Database• vCPUs: 32
• Memory: 244
• Storage: 3TB
• IOPs: 30,000 IOPs
• Networking: 10 Gigabit
• Resiliency: Multi-AZ
• SLA: 99.95%
• Backend: Postgresql
-F1: A Distributed SQL Database That Scales, Google
“Because the data is synchronously replicated across multiple datacenters, and because
we’ve chosen widely distributed datacenters, the commit latencies are relatively high (50-150
ms).”
-Kohavi and Longbotham 2007
“Every 100 ms increase in load time of Amazon.com decreased sales by 1%.”
(~$120M of losses per 100 ms)
“Average partition duration ranged from 6 minutes for software-related failures to more than 8.2 hours for
hardware-related failures (median 2.7 and 32 minutes; 95th percentile of 19.9 minutes and 3.7 days,
respectively).” -The Network is Reliable
WANs Fail
-F1: A Distributed SQL Database That Scales, Google
“We also have a lot of experience with eventual consistency systems at Google. In all such
systems, we find developers spend a significant fraction of their time building
extremely complex and error-prone mechanisms to cope with eventual consistency
and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems
should be solved at the database level. ”
“A shared-data system can have at most two of the three following properties:
Consistency, Availability, and tolerance to network Partitions.”
-Dr. Eric Brewer
On Consistency
• ACID Consistency: Any transaction, or operation will bring the database from one valid state to another
• CAP Consistency: All nodes see the same data at the same time (synchrony)
On Partition Tolerance
• The network will be allowed to lose arbitrarily many messages sent from one node to another.
• Databases systems, in order to be useful must have communication over the network
• Clients count
There is no such thing as a 100% reliable network:
Can’t choose CA
http://codahale.com/you-cant-sacrifice-partition-tolerance
PNUTS• Paper released by Yahoo! research in 2008
• Operations:
• Read-Any
• Read-Critical(Required-Version)*
• Read-Latest
• Write
• Test-and-set-write(Required-Version)
* Will fall back to CP operation
“This is a specific form of weak consistency; the storage system
guarantees that if no new updates are made to the object,
eventually all accesses will return the last updated value.”
Definition of “Eventual Consistency” from “Eventually Consistency Revisited” - Werner Vogels
State*****:test> SELECT * FROM users;
user_name | friends | posts -----------+----------+------- sargun | {'BOSS'} | null
State at DC2 & DC3*****:test> SELECT * FROM users;
user_name | friends | posts -----------+----------+------- sargun | {'BOSS'} | null
State at DC2 & DC3*****:test> SELECT * FROM users;
user_name | friends | posts -----------+----------+----------- sargun | {'BOSS'} | {'PARTY'}
Strong Eventual Consistency
“Any set of nodes that have received the same (unordered) set of updates
will be in the same state.”
Vector Clocks• Extension of Lamport Clocks
• Used to detect cause and effect in distributed systems
• Can determine concurrency of events, and causality violations
• Preserves h.b. relationships
• CRDTs:
• Convergent Replicated Data Types
• Commutative Replication Data Types
• Enables data structures to be always writeable on both sides of a partition, and replay after healing a partition
• Enable distributed computation across monotonic functions
• Two Types:
• CvRDTs
• CmRDTs
CRDTs
CmRDTs
• Op / method based CRDTs
• Size grows monotonically
• Uses version vectors to determine order of operations
CRDTs in the Wild• Sets
• Observe-remove set
• Grow-only sets
• Counters
• Grow-only counters
• PN-Counters
• Flags
• Maps
Data structures that are CRDTs
• Probabilistic, convergent data structures
• Hyper log log
• Bloom filter
• Co-recursive folding functions
• Maximum-counter
• Running Average
• Operational Transform
CRDTs
• Incredibly powerful primitive
• Not only useful for in-database manipulation but client-database interaction
• You can compose them, and build your own
• Garbage collection is tricky
Modelcurl -s http://localhost:8098/types/test/buckets/test/datatypes/sargun |python -mjson.tool { "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq", "type": "map", "value": { "friends_set": [ "Boss" ], "posts_set": [] } }
“Primary Key”curl -s http://localhost:8098/types/test/buckets/test/datatypes/sargun |python -mjson.tool { "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq", "type": "map", "value": { "friends_set": [ "Boss" ], "posts_set": [] } }
Causal Contextcurl -s http://localhost:8098/types/test/buckets/test/datatypes/sargun |python -mjson.tool { "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq", "type": "map", "value": { "friends_set": [ "Boss" ], "posts_set": [] } }
Updatecurl -XPOST http://localhost:8098/types/test/buckets/test/datatypes/sargun \ -H "Content-Type: application/json" \ -H "X-Riak-Vclock: g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq" \ -d ' { "update": { "friends_set": { "remove": "Boss" } } }'
Updated Entries (during partition)
{ "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq", "type": "map", "value": { "friends_set": [ "Boss" ], "posts_set": [] } }
{ "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQtq", "type": "map", "value": { "friends_set": [], "posts_set": [] } }
Updatecurl -XPOST http://localhost:8098/types/test/buckets/test/datatypes/sargun \ -H "Content-Type: application/json" -H "X-Riak-Vclock: g2wAAAABaAJtAAAACBjtDYuvG6A4YQtq" -d ' { "update": { "posts_set": { "add": "Party" } } }'
Updated Entries (After Healing)
{ "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQ5q", "type": "map", "value": { "friends_set": [], "posts_set": [ "Party" ] } }
{ "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQ5q", "type": "map", "value": { "friends_set": [], "posts_set": [ "Party" ] } }
Invariant Operation AP / CPSpecify unique ID Any CP
Generate unique ID Any AP
> INCREMENT AP
> DECREMENT CP
< INCREMENT CP
< DECREMENT AP
Secondary Index Any AP
Materialized View Any APAUTO_INCREMEN
TINSERT CP
Linearizability CAS CP
Operations Requiring
Weak Consistency
vs.
Strong Consistency
BASE not ACID• Basically Available: There will be a response
per request (failure, or success)
• Soft State: Any two reads against the system may yield different data (when measured against time)
• Eventually Consistent: The system will eventually become consistent when all failures have healed, and time goes to infinity
Technology Timeline• 1996 - Log structured merge tree
• 2000 - CAP Theorem
• 2007 - Amazon Dynamo Paper
• 2011 - INRIA CRDT Technical Report
• 2014 - Riak DT map: a composable, convergent replicated dictionary
Further Reading• Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area
Storage with COPS
• PNUTS: Yahoo!’s Hosted Data Serving Platform
• F1: A Distributed SQL Database That Scales
• Spanner: Google's Globally-Distributed Database
• The Network is Reliable: An informal survey of real-world communications failures
• A comprehensive study of Convergent and CommutativeReplicated Data Types
• Riak DT Map: A Composable, Convergent Replicated Dictionary
Get in Touch• If you’re interested in cheating the speed of light
• Come use our software
• If you’re interested in solving today’s computer science problems
• Come work for us
• If you’d like to learn more about distributed systems at scale
• Maybe you have a better idea