Post on 28-Mar-2015
transcript
Transactional storage for geo-replicated systems
Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li
NYU and MSR SVC
Life in a web startup
Web apps need geo-replicated storage
Geo-replicated transactional storage
Consistency vs. performance: existing tradeoffs
Eventual Consistency
Less coordinationMore anomalies
More coordinationFewer anomalies
Serializability
• Maximize multi-site performance• Have few anomalies
Snapshot Isolation
Our contribution
1. New semantics: Parallel Snapshot Isolation (PSI)2. Walter: implementing PSI efficiently– Preferred site– Counting set
3. Application experience
Snapshot isolation
Timeline of storage state
Read-X Write-X Commit
Read-Y Write-Y Commit
• Snapshot isolation’s guarantees
1. Read snapshots from global timeline2. Prohibit write-write conflict3. Preserve causality
T1
T2
PSI avoids global transaction ordering
Site1
Site2
Site1 timeline
Site2 timeline
Read-X Write-X Commit
Read-Y Write-Y Commit
A transaction commits locally first, then propagates to remote sites.
T1
T2
Walter achieves this efficiently
• Snapshot isolation’s guarantees
1. Read snapshots from global timeline2. Prohibit write-write conflict3. Preserve causality
Parallel
Per-site
PSI has few anomalies
short fork No Yes Yes Yeslong fork No No Yes Yesconflicting fork No No No Yes
Anomaly Serializ-ability
Snapshot Isolation
PSI Eventual
dirty read No No No Yesnon-repeatable read
No No No Yes
lost update No No No Yes
PSI’s anomaly
T1
T2
Short fork
(allowed bysnapshot isolation)
T1 commits
T2 commits
Long fork
(disallowed bysnapshot isolation)
T1
T2
T1 commits
T2 commits
T1 and T2 propagate to both sites
Walter overviewC
•Start_TX•Commit_TX
•Read•Write
C C C C C
• Replicate data• Coordinate for PSISite1 Site2
• Main challenge: avoid write-write conflict across sites• Walter’s solution
1. Preferred site2. Counting set
Technique #1: preferred site
• Associate each user’s data with a preferred site• Common case: write at preferred site fast commit– Rare case: write at non-preferred site cross-site 2-phase commit
Bob’s photos
Alice’s photos
Write
CC
Alice’s photos
Bob’s photos
Write (fast commit)
slow commit
Site1 Site2
Alice Bob
Technique #2: counting set
• Problem: some objects are modified from many sites• Counting set: a data type free of write-write conflict
Be-friend EveBe-friend Eve
write write
CC
Site 1 Site 2
Eve’sfriendlist
Eve’sfriendlist
Alice Bob
Technique #2: counting set
add(“Bob”)
• Add/del operations commute no need to check for write-write conflict
• Caveat: application developers must deal with counts
C
Bob 1Alice 1
Bob 1
add(“Alice”)
C
addadd
Alice 1
Eve’s friendlistEve’s friendlist
Alice Bob
Be-friend EveBe-friend Eve
Site1 Site2
Site failure
• Two options to handle a site failure– Conservative: block writes whose preferred site failed– Aggressive: re-assign preferred site elsewhere
Warning: Committed but not-yet-replicated transactions may be lost
Application #1: WaltSocial
Wall and Friendlist are counting sets
Meow says: Meow Meow MeowBob-cat says: I saw a mouseBob-cat says: I saw a mousePeanut says: awldaiwdliawdMeow says: I think I ate too much catnip last night. Meow.
Befriend transactionA read Alice’s profileB read Bob’s profileAdd A.uid to B.friendlistAdd B.uid to A.friendlistAdd “Alice is now friends with Bob” to A.wallAdd “Bob is now friends with Alice” to B.wall
Applications #2: Twitter clone
• Third party app in PHP• Our port: switch storage backend from Redis to Walter
Each user’s timeline is a counting set
Post-status transactionwrite status to new object Oforeach f in user’s followers add O to f’s timeline_cset
Evaluation
• Walter prototype– Implemented in C++ with PHP binding– Custom RPC library with Protocol Buffers
• Testbed: Amazon EC2– Extra-large instance– Up to 4-sites (Virginia, California, Ireland, Singapore)• Full replication across sites
Walter scales
• Read/write a 100-byte object• Reads’ working set fits in memory
Read Write
WaltSocial achieves low latency
A post-on-wall transactionreads 2 objects, writes 2 objects, updates 2 counting sets
Walter lets ReTwis scale to >1 sites
Read Timeline Post status Follow user
Redis Walter (1-site) Walter (2-site)
Related work
• Cloud storage systems– Single-site: Bigtable, Sinfonia, Percolator– No/limited transaction: Dynamo, COPS, PNUTS– Synchronous replication: Megastore, Scatter
• Replicated database systems– Eager vs. lazy replication– Escrow transactions: for numeric data
• Conflict-free replicated data types – Inspired counting sets
Conclusion
• PSI is a good tradeoff for geo-replicated storage– Allows fast commit with asynchronous replication– Prohibits write-write conflict and preserves causality
• Walter realizes PSI efficiently– Preferred site– Conflict-free counting set