Post on 16-Jul-2020
transcript
Marc Shapiro, INRIA & LIP6Nuno Preguiça, U. Nova de Lisboa
Carlos Baquero, U. MinhoMarek Zawirski, INRIA & UPMC
Conflict-free Replicated Data Types (CRDTs)
for collaborative environments
Conflict-free Replicated Data Types
Conflict-free objects for large-scale distribution
Shared Mutable data• Read ⇒ replicate• Updates?
Novel, principled approach:Conflict-free objects
Can we design useful object types without any synchronisation whatsoever?
Can we build practical systems from such objects?
2
•Large, dynamic graph•Incremental, parallel,
asynchronous:- updates - processing
Conflict-free Replicated Data Types
Conflict-free objects for large-scale distribution
Shared Mutable data• Read ⇒ replicate• Updates?
Novel, principled approach:Conflict-free objects
Can we design useful object types without any synchronisation whatsoever?
Can we build practical systems from such objects?
2
•Large, dynamic graph•Incremental, parallel,
asynchronous:- updates - processing
Conflict-free Replicated Data Types
Conflict-free objects for large-scale distribution
Shared Mutable data• Read ⇒ replicate• Updates?
Novel, principled approach:Conflict-free objects
Can we design useful object types without any synchronisation whatsoever?
Can we build practical systems from such objects?
2
•Large, dynamic graph•Incremental, parallel,
asynchronous:- updates - processing
Conflict-free Replicated Data Types
Conflict-free objects for large-scale distribution
Shared Mutable data• Read ⇒ replicate• Updates?
Novel, principled approach:Conflict-free objects
Can we design useful object types without any synchronisation whatsoever?
Can we build practical systems from such objects?
2
•Large, dynamic graph•Incremental, parallel,
asynchronous:- updates - processing
Conflict-free Replicated Data Types
Conflict-free objects for large-scale distribution
Shared Mutable data• Read ⇒ replicate• Updates?
Novel, principled approach:Conflict-free objects
Can we design useful object types without any synchronisation whatsoever?
Can we build practical systems from such objects?
2
•Large, dynamic graph•Incremental, parallel,
asynchronous:- updates - processing
Replication for beginners
Conflict-free Replicated Data Types
Replicated data
Share data ⇒ Replicate at many locations• Performance: local reads• Availability: immune from network failure• Fault-tolerance: replicate computation• Scalability: load balancing
Updates• Push to all replicas• Conflicts: Consistency?
4
•Fault tolerance•and parallelism too?•Conflict!!
Conflict-free Replicated Data Types
Strong consistency
Preclude conflicts• All replicas execute updates
in same total order• Any deterministic object
Consensus• Serialisation bottleneck• Tolerates < n/2 faults
5
•Very general•Correct•Doesn't scale
•Simultaneous N-way agreement
Conflict-free Replicated Data Types
Strong consistency
Preclude conflicts• All replicas execute updates
in same total order• Any deterministic object
Consensus• Serialisation bottleneck• Tolerates < n/2 faults
5
•Very general•Correct•Doesn't scale
1 •Simultaneous N-way agreement
Conflict-free Replicated Data Types
Strong consistency
Preclude conflicts• All replicas execute updates
in same total order• Any deterministic object
Consensus• Serialisation bottleneck• Tolerates < n/2 faults
5
•Very general•Correct•Doesn't scale
12 •Simultaneous N-way agreement
Conflict-free Replicated Data Types
Strong consistency
Preclude conflicts• All replicas execute updates
in same total order• Any deterministic object
Consensus• Serialisation bottleneck• Tolerates < n/2 faults
5
•Very general•Correct•Doesn't scale
123 •Simultaneous N-way agreement
Conflict-free Replicated Data Types
Strong consistency
Preclude conflicts• All replicas execute updates
in same total order• Any deterministic object
Consensus• Serialisation bottleneck• Tolerates < n/2 faults
5
•Very general•Correct•Doesn't scale
1234 •Simultaneous N-way agreement
Conflict-free Replicated Data Types
Strong consistency
Preclude conflicts• All replicas execute updates
in same total order• Any deterministic object
Consensus• Serialisation bottleneck• Tolerates < n/2 faults
5
•Very general•Correct•Doesn't scale
12345 •Simultaneous N-way agreement
Conflict-free Replicated Data Types
Strong consistency
Preclude conflicts• All replicas execute updates
in same total order• Any deterministic object
Consensus• Serialisation bottleneck• Tolerates < n/2 faults
5
•Very general•Correct•Doesn't scale
123456 •Simultaneous N-way agreement
Conflict-free Replicated Data Types
Eventual Consistency
Update local + propagate• No foreground synch• Expose tentative state• Eventual, reliable delivery
On conflict• Arbitrate• Roll back
Consensus moved to background
6
•Availability ++•Parallelism++•Latency --
•Complexity ++
Conflict-free Replicated Data Types
Eventual Consistency
Update local + propagate• No foreground synch• Expose tentative state• Eventual, reliable delivery
On conflict• Arbitrate• Roll back
Consensus moved to background
6
•Availability ++•Parallelism++•Latency --
•Complexity ++
Conflict-free Replicated Data Types
Eventual Consistency
Update local + propagate• No foreground synch• Expose tentative state• Eventual, reliable delivery
On conflict• Arbitrate• Roll back
Consensus moved to background
6
•Availability ++•Parallelism++•Latency --
•Complexity ++
Conflict-free Replicated Data Types
Eventual Consistency
Update local + propagate• No foreground synch• Expose tentative state• Eventual, reliable delivery
On conflict• Arbitrate• Roll back
Consensus moved to background
6
•Availability ++•Parallelism++•Latency --
•Complexity ++
Conflict-free Replicated Data Types
Eventual Consistency
Update local + propagate• No foreground synch• Expose tentative state• Eventual, reliable delivery
On conflict• Arbitrate• Roll back
Consensus moved to background
6
•Availability ++•Parallelism++•Latency --
•Complexity ++
Conflict-free Replicated Data Types
Eventual Consistency
Update local + propagate• No foreground synch• Expose tentative state• Eventual, reliable delivery
On conflict• Arbitrate• Roll back
Consensus moved to background
6
•Availability ++•Parallelism++•Latency --
•Complexity ++
Conflict!
Conflict-free Replicated Data Types
Eventual Consistency
Update local + propagate• No foreground synch• Expose tentative state• Eventual, reliable delivery
On conflict• Arbitrate• Roll back
Consensus moved to background
6
•Availability ++•Parallelism++•Latency --
•Complexity ++
Conflict-free Replicated Data Types
Eventual Consistency
Update local + propagate• No foreground synch• Expose tentative state• Eventual, reliable delivery
On conflict• Arbitrate• Roll back
Consensus moved to background
6
•Availability ++•Parallelism++•Latency --
•Complexity ++
Conflict-free Replicated Data Types
Eventual Consistency
Update local + propagate• No foreground synch• Expose tentative state• Eventual, reliable delivery
On conflict• Arbitrate• Roll back
Consensus moved to background
6
•Availability ++•Parallelism++•Latency --
•Complexity ++
Conflict-free Replicated Data Types
Strong Eventual Consistency
Update local + propagate• No synch• Expose intermediate state• Eventual, reliable delivery
No conflict• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faultsNot universalSolves the CAP problem
7
•Available, responsive•More parallelism•No conflicts•No rollback
Conflict-free Replicated Data Types
Strong Eventual Consistency
Update local + propagate• No synch• Expose intermediate state• Eventual, reliable delivery
No conflict• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faultsNot universalSolves the CAP problem
7
•Available, responsive•More parallelism•No conflicts•No rollback
Conflict-free Replicated Data Types
Strong Eventual Consistency
Update local + propagate• No synch• Expose intermediate state• Eventual, reliable delivery
No conflict• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faultsNot universalSolves the CAP problem
7
•Available, responsive•More parallelism•No conflicts•No rollback
Conflict-free Replicated Data Types
Strong Eventual Consistency
Update local + propagate• No synch• Expose intermediate state• Eventual, reliable delivery
No conflict• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faultsNot universalSolves the CAP problem
7
•Available, responsive•More parallelism•No conflicts•No rollback
Conflict-free Replicated Data Types
Strong Eventual Consistency
Update local + propagate• No synch• Expose intermediate state• Eventual, reliable delivery
No conflict• Deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faultsNot universalSolves the CAP problem
7
•Available, responsive•More parallelism•No conflicts•No rollback
The challenge:What interesting objects can
we design with no synchronisation whatsoever?
Conflict-free Replicated Data Types
Portfolio of CRDTs
Register• Last-Writer Wins• Multi-Value
Set• Grow-Only• 2P• Observed-Remove
Map• Set of Registers
Counter• Unlimited• Non-negative
Graphs• Directed• Monotonic DAG• Edit graph
Sequence• Edit sequence
9
Conflict-free Replicated Data Types
Set design alternatives
Sequential specification:• {true} add(e) {e ∈ S}• {true} remove(e) {e ∉ S}
{true} add(e) || remove(e) {????}• linearisable?• add wins?• remove wins?• last writer wins?• error state?
10
•linearisable: sequential order
•equivalent to real-time order
•Requires consensus
Conflict-free Replicated Data Types
Observed-Remove Set
11
•Can never remove more tokens than exist
•Op order ⇒ removed tokens have been previously added
{}
{}
{}s3
s1
s2
s
Conflict-free Replicated Data Types
Observed-Remove Set
11
•Can never remove more tokens than exist
•Op order ⇒ removed tokens have been previously added
add(a){}{aα}
{}
{}s3
s1
s2
sS
Conflict-free Replicated Data Types
Observed-Remove Set
11
•Can never remove more tokens than exist
•Op order ⇒ removed tokens have been previously added
add(a)
add(a)
{}{aα}
{}
{}s3
s1
s2
sS
S
{aβ}
Conflict-free Replicated Data Types
Observed-Remove Set
11
•Can never remove more tokens than exist
•Op order ⇒ removed tokens have been previously added
add(a)
add(a)
{}{aα}
{}
{aβ}M
{aβ}
{}s3
s1
s2
sS
S
{aβ}
Conflict-free Replicated Data Types
Observed-Remove Set
11
•Can never remove more tokens than exist
•Op order ⇒ removed tokens have been previously added
add(a)
add(a)
{}{aα}
{}
{aβ}M
{aα}
M
{aβ} {aβ, aα}
{}s3
s1
s2
sS
S
{aβ}
Conflict-free Replicated Data Types
Observed-Remove Set
11
•Can never remove more tokens than exist
•Op order ⇒ removed tokens have been previously added
add(a)
add(a)
rmv (a){}{aα} {aα}
{}
{aβ}M
{aα}
M
{aβ} {aβ, aα}
{}s3
s1
s2
sS S
S
{aβ}
Conflict-free Replicated Data Types
Observed-Remove Set
11
•Can never remove more tokens than exist
•Op order ⇒ removed tokens have been previously added
add(a)
add(a)
rmv (a){}{aα} {aα}
{}
{aβ}M
{aα}
M
{aα}M
{aβ} {aβ, aα} {aβ, aα}{}
s3
s1
s2
sS S
S
{aβ}
Conflict-free Replicated Data Types
Observed-Remove Set
11
•Can never remove more tokens than exist
•Op order ⇒ removed tokens have been previously added
add(a)
add(a)
rmv (a){}{aα} {aα}
{}
{aβ}M
{aα}
M
{aα}M
{aβ}M
{aβ, aα}
{aβ} {aβ, aα} {aβ, aα}{}
s3
s1
s2
sS S
S
{aβ}
Conflict-free Replicated Data Types
Observed-Remove Set
• Payload: added, removed (element, unique-token)add(e) = A ≔ A ∪ {(e, α)}
• Remove: all unique elements observedremove(e) = R ≔ R ∪ { (e, –) ∈ A}
• lookup(e) = ∃ (e, –) ∈ A \ R • merge (S, S') = (A ∪ A', R ∪ R')• {true} add(e) || remove(e) {e ∈ S}
11
•Can never remove more tokens than exist
•Op order ⇒ removed tokens have been previously added
add(a)
add(a)
rmv (a){}{aα} {aα}
{}
{aβ}M
{aα}
M
{aα}M
{aβ}M
{aβ, aα}
{aβ} {aβ, aα} {aβ, aα}{}
s3
s1
s2
sS S
S
{aβ}
Conflict-free Replicated Data Types
OR-Set
Set: solves Dynamo Shopping Cart anomalyOptimisations
• Just mark tombstones• Garbage-collect tombstones• Operation-based approach
12
Conflict-free Replicated Data Types
Graph design alternatives
Graph = (V, E) where E ⊆ V ! V
Sequential specification:• {v,v' ∈ V} addEdge(v,v') {…}• {∄(v,v') ∈ E} removeVertex(v) {…}
Concurrent: removeVertex(v') || addEdge(v,v')• linearisable?• addEdge wins?• removeVertex wins?• etc.
13
•for our Web Search Engine application, removeVertex wins
•Do not check precondition at add/remove
Conflict-free Replicated Data Types
Graph
Payload = OR-Set V, OR-Set EUpdates add/remove to V, E• addVertex(v), removeVertex(v)• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ VremoveVertex(v') || addEdge(v,v')• remove wins"
14
Conflict-free Replicated Data Types
Graph
Payload = OR-Set V, OR-Set EUpdates add/remove to V, E• addVertex(v), removeVertex(v)• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ VremoveVertex(v') || addEdge(v,v')• remove wins"
14
Conflict-free Replicated Data Types
Graph
Payload = OR-Set V, OR-Set EUpdates add/remove to V, E• addVertex(v), removeVertex(v)• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ VremoveVertex(v') || addEdge(v,v')• remove wins"
14
Conflict-free Replicated Data Types
Graph
Payload = OR-Set V, OR-Set EUpdates add/remove to V, E• addVertex(v), removeVertex(v)• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ VremoveVertex(v') || addEdge(v,v')• remove wins"
14
Conflict-free Replicated Data Types
Graph
Payload = OR-Set V, OR-Set EUpdates add/remove to V, E• addVertex(v), removeVertex(v)• addEdge(v,v'), removeEdge(v,v')
Do not enforce invariant a priori• lookupEdge(v,v') = (v,v') ∈ E
∧ v ∈ V ∧ v' ∈ VremoveVertex(v') || addEdge(v,v')• remove wins"
14
Conflict-free Replicated Data Types
Co-operative editing
⊢ ⊣⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Co-operative editing
⊢
Iδ
⊣⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Co-operative editing
⊢
Iδ
⊣
Aε
⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Co-operative editing
⊢
Iδ
⊣
Nβ A
ε
⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Co-operative editing
⊢
Rγ
Iδ
⊣
Nβ A
ε
⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Co-operative editing
⊢
Rγ
Iδ
⊣
Nβ
Iα
Aε
⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Co-operative editing
⊢
Rγ
Iδ
⊣
Nβ
Iα
Aε
⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Co-operative editing
⊢
Rγ
Iδ
⊣
Nβ
Iα
Aε
⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Co-operative editing
⊢
Rγ
Iδ
⊣
Nβ
Iα
Aε
⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Co-operative editing
⊢
Rγ
Iδ
⊣
Nβ
Iα
Aε
⊢ ⊣
I N R I A
⊢ α β γ δ ε ⊣
• Deep (internal) view:
• DAG representing insert order
• Surface view: summarises total order
• strong order takes precedence over weak
• add: between already-ordered elements
• begin and end sentinels
• ensure consistent ordering at all replicas
{x,z ∈ graphi ∧ x < z}add-betweeni (x, y, z)
{y ∈ graphi ∧ x<y<z}
• Local constraint implies globally acyclic
I N R A
⊢ α β γ ε ⊣
•This spec is implemented directly by WOOT
•Clean
Conflict-free Replicated Data Types
Continuum
Assign each element a unique real number• position
Real numbers not appropriate• approximate by tree
16
⊣⊣⊢⊢ I0
N100
Conflict-free Replicated Data Types
Continuum
Assign each element a unique real number• position
Real numbers not appropriate• approximate by tree
16
⊣⊣⊢⊢ I0
A101
N100
Conflict-free Replicated Data Types
Continuum
Assign each element a unique real number• position
Real numbers not appropriate• approximate by tree
16
⊣⊣⊢⊢ I0
I100.5
A101
N100
Conflict-free Replicated Data Types
Continuum
Assign each element a unique real number• position
Real numbers not appropriate• approximate by tree
16
⊣⊣⊢⊢ I0
R100.25
I100.5
’-1.00
A101
N100
L-1.01
Conflict-free Replicated Data Types
Treedoc binary tree
add appends leaf ⇒ non-destructive, IDs don’t change
17
1
1
0
Binary naming tree: • compact, self-adjusting• Logarithmic properties
0
remove: tombstone, IDs don't change
= L ’ I N R I
I
R
A
N I
•logarithmic: assuming tree
•Compact, low-arity tree
•In the following slides, will
•Low arity: binary, quaternary
Conflict-free Replicated Data Types
Treedoc binary tree
add appends leaf ⇒ non-destructive, IDs don’t change
17
1
1
0
Binary naming tree: • compact, self-adjusting• Logarithmic properties
0
L
0
remove: tombstone, IDs don't change
= L ’ I N R I
I
R
A
N I
•logarithmic: assuming tree
•Compact, low-arity tree
•In the following slides, will
•Low arity: binary, quaternary
Conflict-free Replicated Data Types
Treedoc binary tree
add appends leaf ⇒ non-destructive, IDs don’t change
17
1
1
0
Binary naming tree: • compact, self-adjusting• Logarithmic properties
’
0
L
0
remove: tombstone, IDs don't change
= L ’ I N R I
I
R
A
N I
•logarithmic: assuming tree
•Compact, low-arity tree
•In the following slides, will
•Low arity: binary, quaternary
Conflict-free Replicated Data Types
Treedoc binary tree
add appends leaf ⇒ non-destructive, IDs don’t change
17
1
1
0
Binary naming tree: • compact, self-adjusting• Logarithmic properties
’
0
L
0
remove: tombstone, IDs don't change
= L ’ I N R I
I
R
A
N I
•logarithmic: assuming tree
•Compact, low-arity tree
•In the following slides, will
•Low arity: binary, quaternary
Conflict-free Replicated Data Types
Treedoc binary tree
add appends leaf ⇒ non-destructive, IDs don’t change
17
1
1
0
Binary naming tree: • compact, self-adjusting• Logarithmic properties
’
0
L
0
remove: tombstone, IDs don't change
= L ’ I N R I
I
R
A
N I
•logarithmic: assuming tree
•Compact, low-arity tree
•In the following slides, will
•Low arity: binary, quaternary
Conflict-free Replicated Data Types
Layered Treedoc
18
binary tree
Conflict-free Replicated Data Types
Layered Treedoc
18
Site 34 Site 79sparse 864--ary tree
binary tree
Conflict-free Replicated Data Types
Layered Treedoc
18
Site 34 Site 79sparse 864--ary tree
binary tree
Site 22
Conflict-free Replicated Data Types
Layered Treedoc
18
Site 34 Site 79sparse 864--ary tree
binary tree
Site 34 Site 22
Conflict-free Replicated Data Types
Layered Treedoc
18
Site 34 Site 79
Site 34 Site 66 Site 79
sparse 864--ary tree
binary tree
Site 34 Site 22
Conflict-free Replicated Data Types
Layered Treedoc
18
Site 34 Site 79
Site 34 Site 66 Site 79
sparse 864--ary tree
binary tree
Site 34 Site 22
Edit: Binary treeConcurrency: Sparse tree
The theoryTwo simple conditions
for correctness without synchronisation
Conflict-free Replicated Data Types
Query
Local at source replica• Client's choice
20
•Example: Amazon shopping cart is replicated
•unspecified client, e.g., Web front-end
•One or more•load-balancer, failures may direct
client to different replicas
s3
s1
s2
s
client
Conflict-free Replicated Data Types
Query
Local at source replica• Client's choice
20
•Example: Amazon shopping cart is replicated
•unspecified client, e.g., Web front-end
•One or more•load-balancer, failures may direct
client to different replicas
S
s1.q(a)
s3
s1
s2
s
client
Conflict-free Replicated Data Types
Query
Local at source replica• Client's choice
20
•Example: Amazon shopping cart is replicated
•unspecified client, e.g., Web front-end
•One or more•load-balancer, failures may direct
client to different replicas
s2.q(b)S
S
s1.q(a)
s3
s1
s2
s
client
Conflict-free Replicated Data Types
State-based replication
Local at source s1.u(a), s2.u(b), …• Compute• Update local payload
Convergence:• Episodically: send si payload• On delivery: merge payloads m
21
•merge two valid states
•produce valid state•no historical info
available•Inefficient if
payload is large
s3
s1
s2
s
client
Conflict-free Replicated Data Types
State-based replication
Local at source s1.u(a), s2.u(b), …• Compute• Update local payload
Convergence:• Episodically: send si payload• On delivery: merge payloads m
21
•merge two valid states
•produce valid state•no historical info
available•Inefficient if
payload is large
S
s1.u(a)
s3
s1
s2
s
client
Conflict-free Replicated Data Types
State-based replication
Local at source s1.u(a), s2.u(b), …• Compute• Update local payload
Convergence:• Episodically: send si payload• On delivery: merge payloads m
21
•merge two valid states
•produce valid state•no historical info
available•Inefficient if
payload is large
s2.u(b)S
S
s1.u(a)
s3
s1
s2
s
client
Conflict-free Replicated Data Types
State-based replication
Local at source s1.u(a), s2.u(b), …• Compute• Update local payload
Convergence:• Episodically: send si payload• On delivery: merge payloads m
21
•merge two valid states
•produce valid state•no historical info
available•Inefficient if
payload is large
s2.u(b)S
S
s1.u(a)
s3
s1
s2
s
Conflict-free Replicated Data Types
State-based replication
Local at source s1.u(a), s2.u(b), …• Compute• Update local payload
Convergence:• Episodically: send si payload• On delivery: merge payloads m
21
•merge two valid states
•produce valid state•no historical info
available•Inefficient if
payload is large
M
s2.m(s1)
s2.u(b)S
S
s1.u(a)
s3
s1
s2
s
s1
Conflict-free Replicated Data Types
State-based replication
Local at source s1.u(a), s2.u(b), …• Compute• Update local payload
Convergence:• Episodically: send si payload• On delivery: merge payloads m
21
•merge two valid states
•produce valid state•no historical info
available•Inefficient if
payload is large
M
s2.m(s1)
s3.m(s2)M
s2.u(b)S
S
s1.u(a)
s3
s1
s2
s
s1
s2
Conflict-free Replicated Data Types
State-based replication
Local at source s1.u(a), s2.u(b), …• Compute• Update local payload
Convergence:• Episodically: send si payload• On delivery: merge payloads m
21
•merge two valid states
•produce valid state•no historical info
available•Inefficient if
payload is large
M
s2.m(s1)
s3.m(s2)M
s2.u(b)S
S
s1.u(a) s1.m(s2)M
s3
s1
s2
s
s1s2
s2
Conflict-free Replicated Data Types
If • payload type forms a semi-lattice• updates are increasing• merge computes Least Upper Bound
then replicas converge to LUB of last valuesExample: Payload = int, merge = max
22
•no reference to history
•⊔ = Least Upper Bound LUB = merge
State-based: monotonic semi-lattice ⇒ CRDT
M
s2.m(s1)
s3.m(s2)M
s2.u(b)S
S
s1.u(a) s1.m(s2)M
s3
s1
s2
s
s1s2
s2
Conflict-free Replicated Data Types
Operation-based replication
At source:• prepare• broadcast to all replicas
Eventually, at all replicas:• update local replica
23
S
S
•push to all replicas eventually
•push small updates- more efficient than
state-based
s3
s1
s2
s s1.u(a)
s2.u(b)
Conflict-free Replicated Data Types
Operation-based replication
At source:• prepare• broadcast to all replicas
Eventually, at all replicas:• update local replica
23
s1.u(b)D
s3.u(b)
D
S
S
•push to all replicas eventually
•push small updates- more efficient than
state-based
s3
s1
s2
s s1.u(a)
s2.u(b) b
b
Conflict-free Replicated Data Types
Operation-based replication
At source:• prepare• broadcast to all replicas
Eventually, at all replicas:• update local replica
23
D
s3.u(a)
s1.u(b)D
s3.u(b)
D
S
S
D
s2.u(a)
•push to all replicas eventually
•push small updates- more efficient than
state-based
s3
s1
s2
s s1.u(a)
s2.u(b) b a
ba
Conflict-free Replicated Data Types
Op-based: commute ⇒ CRDT
If:! •! (Liveness) all replicas execute all operations! ! ! in delivery order
• (Safety) concurrent operations all commuteThen: replicas converge
24
•Delivery order ≃ ensures downstream precondition
•happened-before or weaker
s1.u(a)
s2.u(b)
D
D
D
S
S
D
s3
s1
s2
s
b a
b
s3.u(b) s3.u(a)
s1.u(b)
s2.u(a)a
Conflict-free Replicated Data Types
Monotonic semi-lattice⇔ commutative
1. A state-based object can emulate an operation-based object, and vice-versa
2. State-based emulation of a CvRDT is a CmRDT
3. Operation-based emulation of a CvRDT is a CmRDT
25
•Systematic transformation•Inefficient•⇒ Hand-crafted op-based implementation
Conflict-free Replicated Data Types
Operation-based OR-Set
26
•Set of IDs associated with a in the old state
•As observed by source!
Payload: S = { (e,α), (e, β), (e', γ), … }! ! where α, β,… uniqueOperations: • lookup(e) = ∃ α: (e, α) ∈ S
Conflict-free Replicated Data Types
Operation-based OR-Set
• add(e) = S ≔ S ∪ {(e, α)} where α fresh
26
•Set of IDs associated with a in the old state
•As observed by source!
Payload: S = { (e,α), (e, β), (e', γ), … }! ! where α, β,… uniqueOperations: • lookup(e) = ∃ α: (e, α) ∈ S
Conflict-free Replicated Data Types
Operation-based OR-Set
• add(e) = S ≔ S ∪ {(e, α)} where α fresh• remove (e) =! (at source) R = {(e, α) ∈ S}! (downstream) S ≔ S \ R
26
•Set of IDs associated with a in the old state
•As observed by source!
Payload: S = { (e,α), (e, β), (e', γ), … }! ! where α, β,… uniqueOperations: • lookup(e) = ∃ α: (e, α) ∈ S
Conflict-free Replicated Data Types
Operation-based OR-Set
• add(e) = S ≔ S ∪ {(e, α)} where α fresh• remove (e) =! (at source) R = {(e, α) ∈ S}! (downstream) S ≔ S \ R
No tombstones
26
•Set of IDs associated with a in the old state
•As observed by source!
Payload: S = { (e,α), (e, β), (e', γ), … }! ! where α, β,… uniqueOperations: • lookup(e) = ∃ α: (e, α) ∈ S
Conflict-free Replicated Data Types
Operation-based OR-Set
• add(e) = S ≔ S ∪ {(e, α)} where α fresh• remove (e) =! (at source) R = {(e, α) ∈ S}! (downstream) S ≔ S \ R
No tombstones{true} add(e) || remove(e) {e ∈ S}
26
•Set of IDs associated with a in the old state
•As observed by source!
Payload: S = { (e,α), (e, β), (e', γ), … }! ! where α, β,… uniqueOperations: • lookup(e) = ∃ α: (e, α) ∈ S
Ongoing work
Conflict-free Replicated Data Types
CRDTsfor P2P
& Cloud Computing
ConcoRDanT: ANR 2010–2013Systematic study of conflict-free design space
• Theory and practice• Characterise invariants• Library of data types
Not universal• Conflict-free vs. conflict semantics• Move consensus off critical path, non-critical ops
28
Conflict-free Replicated Data Types
CRDT + dataflow
Incremental, asynchronous processing• Replicate, shard CRDTs near the edge• Propagate updates ≈ dataflow• Throttle according to QoS metrics
(freshness, availability, cost, etc.)Scale: shardedSynchronous processing: snapshot, at centre
29
Web site 1
Web site 2
Web site 3
Whitelist
crawl
Content DB
extractlinks
extractwords
Graph
Words
spam detector
Map URL to last 2 versions
Map word to Set of URLs
Graph
Set
Conflict-free Replicated Data Types
OR-Set + Snapshot
Read consistent snapshot• Despite concurrent, incremental updates
Unique token = time (vector clock)• α = Lamport (process i, counter t)• UIDs identify snapshot version• Snapshot: vector clock value• Retain tombstones until not needed
lookup(e, t) = ∃ (e, i, t')∈A : t'>t ∧ ∄ (e, i, t')∈R: t'>t
30
Conflict-free Replicated Data Types
Sharded OR-Set
Very large objects• Independent shards• Static: hash
Statically-Sharded CRDT• Each shard is a CRDT• Update: single shard• No cross-object invariants• The combination remains a CRDT
Statically Sharded OR-Set• Combination of smaller OR-Sets• Snapshot: clock across shards
31
•(Dynamic: requires consensus to rebalance)
Conflict-free Replicated Data Types
Take aways
Principled approach • Strong Eventual Consistency
Two sufficient conditions:• State: monotonic semi-lattice• Operation: commutativity
Useful CRDTs• Register, Counter, Set, Map (KVS),
Graph, Monotonic DAG, SequenceFuture work
• Snapshot, sharding, dataflow• A wee bit of synchronisation
32
Conflict-free Replicated Data Types
Portfolio of CRDTs
Register• Last-Writer Wins• Multi-Value
Set• Grow-Only• 2P• Observed-Remove
Map• Set of Registers
Counter• Unlimited• Non-negative
Graphs• Directed• Monotonic DAG• Edit graph
Sequence• Edit sequence
33