CS 347 Notes06 1
CS 347: Parallel and Distributed
Data ManagementNotes06: Reliable
DistributedDatabase Management
Hector Garcia-Molina
CS 347 Notes06 2
Reliable distributed database management
• Reliability• Failure models• Scenarios
CS 347 Notes06 3
Reliability
• Correctness– Serializability– Atomicity– Persistence
• Availability
CS 347 Notes06 4
Types of failures
• Processor failures– Halt, delay, restart, bezerk, ...
• Storage failures– Volatile, non-volatile, atomic write,
transient errors, spontaneous failures
• Network failures– Lost message, out-of-order messages,
partitions, bounded delay
CS 347 Notes06 5
• Malevolent failures• Multiple failures• Detectable failures
More Types of failures
CS 347 Notes06 6
Failure models
• Cannot protect against everything• Unlikely failures (e.g., flooding in the
Sahara)
• Expensive to protect failures (e.g., earthquake)
• Failures we know how to protect against (e.g., message sequence numbers; stable storage)
CS 347 Notes06 7
Failure model:
DesiredEvents
Expected
Undesired
Unexpected
CS 347 Notes06 8
Node models
(1) Fail-stop nodestime
perfect halted recoveryperfect
Volatilememory lost
Stablestorage ok
CS 347 Notes06 9
(2) Byzantine nodesA Perfect
Perfect Arbitrary failure Recovery
B
C
At any given time, at most some fraction f of
nodesfailed (typically f < 1/2 or f < 1/3)
Node models
CS 347 Notes06 10
Network models
(1) Reliable network- in order messages- no spontaneous messages- timeout TD
I.e., no lost messages, except for node failures
If no ack in TD sec.Destination down
(not paused)
CS 347 Notes06 11
Variation of reliable net:
• Persistent messages– If destination down, net will
eventually deliver message– Simplifies node recovery, but leads to
inefficiencies (hides too much)– Not considered here
CS 347 Notes06 12
Network models
(2) Partitionable network- In order messages- No spontaneous messages
- no timeout; nodes can have different view of failures
CS 347 Notes06 13
Scenarios
• Reliable network– Fail-stop nodes
– No data replication (1)– Data replication (2)
• Partitionable network– Fail-stop nodes (3)
CS 347 Notes06 14
No Data Replication
• Reliable network, fail-stop nodes
• Basic idea: node P controls X
P Item Xnet
CS 347 Notes06 15
No Data Replication
• Reliable network, fail-stop nodes
• Basic idea: node P controls X
P Item Xnet
- Single control point simplifies concurrency control, recovery
- Not an availability hit: if P down, X unavailable too!
CS 347 Notes06 16
“P controls X” means- P does concurrency control for
X- P does recovery for X
CS 347 Notes06 17
Say transaction T wants to access X:
PT is process that represents T at this node
PT
req
Local DMBS
Lockmgr
LOG X
CS 347 Notes06 18
Process models
(A) Cohorts Spawn process
CommunicationData Access
USER
T1
LocalDMBS
T2
LocalDMBS
T3
LocalDMBS
CS 347 Notes06 19
Process models
(B) Transaction servers (manager)USER
LocalDMBS
TransMGR
LocalDMBS
TransMGR
LocalDMBS
TransMGR
CS 347 Notes06 20
• Cohorts: application code responsible for remote access
• Transaction manager: “system” handles distribution, remote access
CS 347 Notes06 21
.
Distributed commit problem
Action:a1,a2
Action:a3
Action:a4,a5
Transaction T
CS 347 Notes06 22
Centralized two-phase commit
Coordinator Participant
I
W
C
A
I
W
C
A
go exec*
nok abort*
ok* commit *
commit -
exec ok
exec nok
abort -
CS 347 Notes06 23
• Notation: Incoming message Outgoing message
( * = everyone)• When participant enters “W” state:
– it must have acquired all resources– it can only abort or commit if so
instructedby a coordinator
• Coordinator only enters “C” state if all participants are in “W”, i.e., it is certain that all will eventually commit
CS 347 Notes06 24
Handling node failures
• Coordinator and participant logs are used to reconstruct state before failure
CS 347 Notes06 25
-> Example: after participant fails: Log:
T1X
undo/redoinfo
T1Y
info... ...
T1“W”state
CS 347 Notes06 26
At recovery:
• T1 is in “W” state• Obtain X,Y write locks (no read locks!)
• Wait for message from coordinator(or ask coordinator for outcome)
CS 347 Notes06 27
Other examples:
• No “W” record on log abort T1
• See “C” record on log finish T1
CS 347 Notes06 28
• Add timeouts to cope with messages lost during crashes
• Add finish (“F”) state for coordinator – all done, can forget outcome
Next
CS 347 Notes06 29
Coordinator
I
W
C
A
F
_go_exec*
c-ok*-
nok*-
nokabort*
ok*commit*
t=timeout
cping=coord. ping
CS 347 Notes06 30
Coordinator
I
W
C
A
F
_go_exec*
c-ok*-
nok*-
ping- _t_
cping
pingabort
pingcommit
_t_cping
_t_abort*
nokabort*
ok*commit*
t=timeout
cping=coord. ping
CS 347 Notes06 31
Participant
I
W
C
A
execok
execnok
commitc-ok
abortnok
CS 347 Notes06 32
Participant
I
W
C
A
execok
execnok
commitc-ok
abortnok
cping , _t - ping
cpingdone
cpingdone
“done” message counts as eitherc-ok or n-ok for coordinator
CS 347 Notes06 33
Participant
I
W
C
A
execok
equivalent to finish state
execnok
commitc-ok
abortnok
cping , _t - ping
cpingdone
cpingdone
“done” message counts as eitherc-ok or n-ok for coordinator
CS 347 Notes06 34
Presumed abort protocol
• “F” and “A” states combined in coordinator
• Saves persistent space (forget quicker)
• Presumed commit is analogous
CS 347 Notes06 35
Presumed abort-coordinator (participant unchanged)
I
W
C
A/F
_go_exec*
ping-
c-ok*-
pingabort
pingcommit _t_
cping
nok, tabort*
ok*commit*
CS 347 Notes06 36
Remember: all state transitions must be loggedExample: tracking who has sent “OK” msgsLog at coord:
• After failure, we know still waiting for OK from node b
• Alternative: do not log receipts of “OK”s abort T1
T1start
part={a,b}
T1OK
from a RCV
... ......
CS 347 Notes06 37
Example: logging receipt of C-OK messages
• If logged, can recover state• If not logged:
– resend commit *– participants reply “done” if duplicate
CS 347 Notes06 38
2PC is blocking
Sample scenario:Coord P2
W P1 P3
WP4
W
CS 347 Notes06 39
Case I:P1 “W”; coordinator sent commits
P1 “C”Case II:
P1 NOK; P1 A
P2, P3, P4 (surviving participants)
cannot safely abort or commit transaction
coord
P1
P2
P3
P4w
w
w
CS 347 Notes06 40
Variants of 2PC
• Linear
Coord
• Hierarchical
ok ok ok
commit commit commit
CS 347 Notes06 41
• Distributed
– Nodes broadcast all messages– Every node knows when to commit
Variants of 2PC
CS 347 Notes06 42
3PC = non-blocking commit
• Assume: failed node is down forever
• Key idea: before committing, coordinator tells participants everyone is ok
CS 347 Notes06 43
Coordinator Participant
I
W
P
A
_go_exec*
ack**commit*
nokabort*
ok*pre *
3PC
C
I
W
P
A
_exec_ok
commit-
execnok
preack
C
abort-
** means allnon-failed nodes
CS 347 Notes06 44
3PC recovery rules: termination protocol
• Survivors try to complete transaction, based on their current states
• Goal:– If dead nodes committed or aborted,
then survivors should not contradict!– Else, survivors can do as they please...
surv
ivors
CS 347 Notes06 45
• Let {S1,S2,…Sn} be survivor sites• If one or more Si = COMMIT COMMIT T• If one or more Si = ABORT ABORT T• If one or more Si = PREPARE
T could not have aborted COMMIT T• If no Si = PREPARE (or COMMIT)
T could not have committed ABORT T
surv
ivors
CS 347 Notes06 46
Example:
P
W
W
?
?
CS 347 Notes06 47
Example:
I
W
W
?
?
CS 347 Notes06 48
Example:
P
P
C
?
?
CS 347 Notes06 49
Example:
P
W
A
?
?
CS 347 Notes06 50
Once survivors make decision, they must select new coordinator to continue 3PC
P P C C
W P C C W P P C
Decide tocommit
Time1
Time2
Time3
Time4
CS 347 Notes06 51
Note: when survivors continue 3PC, failed nodes do not
countE.g., “ack**” when ack’s received
from all non-failed nodesP
C
ack**commit *
CS 347 Notes06 52
Note: 3PC unsafe with partitions!
W
W
W
P
P
abort commit
CS 347 Notes06 53
Node recovery:
• After node N recovers from failure:– do not participate in termination protocol
(why?)
W
W
W
?
P
A
CS 347 Notes06 54
Node recovery:
• After node N recovers from failure:– do not participate in termination protocol
(why?)
W
W
W
?
P
A
later on...
CS 347 Notes06 55
Node recovery:
• After node N recovers from failure:– do not participate in termination protocol
(why?)– wait until it hears commit or abort
decision from operational node
?ping
ping
ping“C”(or “A”)
CS 347 Notes06 56
• Waiting for commit/abort decision from other node is ok, unless all fail:
? ? ? ? ?
CS 347 Notes06 57
Two options for all-failed problem:(A) Wait for all to recover(B) Majority commit
CS 347 Notes06 58
Option A
• Recovering node waits for either:(1) commit/abort outcome for T from other node(2) all nodes that participated in T are up and recovering:
then 3PC can continue(no danger that a failed node could haveaborted or committed)
Option B
• Want a “gang” of failed but recovered nodes to be able to terminate a transaction even when rest are failed...
CS 347 Notes06 59
P1
P3
P2
P5
P4
CS 347 Notes06 60
Option B
• Nodes are assigned votes, total is VMajority is V+1 e.g., V=5
2 Maj=3 V=6 Maj=4
• To make state transitions, coordinator requires messages from nodes with a majority of votes
CS 347 Notes06 61
Example(1): Coord P2 W P1 P3 W
p4 W
• Nodes P2, P3, P4 enter “W” state and fail• When they recover, coord. and P1 are
down• Each node has 1 vote, V=5, Maj=3
?
?
?
CS 347 Notes06 62
Example(1): Coord P2 W P1 P3 W
p4 W
• Nodes P2, P3, P4 enter “W” state and fail• When they recover, coord. and P1 are
down• Each node has 1 vote, V=5, Maj=3
?
?
?
• Since P2, P3, P4 have majority, they know coord. could not have gone to “P” without at least one of their votes
• Therefore, T can be aborted!
CS 347 Notes06 63
Example(2): Coord P3 ”P” P1 P4 ”W” P2
• Each node has 1 vote; V=5, Maj=3• Nodes fail after entering states shown;
P3, P4 recover
?
?
CS 347 Notes06 64
Example(2): Coord P3 ”P” P1 P4 ”W” P2
• Each node has 1 vote; V=5, Maj=3• Nodes fail after entering states shown;
P3, P4 recover
?
?
• Termination rule says we can try to commit, but P3, P4 do not have enough votes, so they do nothing!
• P3, P4 doing nothing is good because later on, coord. P1, P2 could abort T
CS 347 Notes06 65
1
Summary: Majority rule ensures that any
decision (e.g., Preparing, committing) will be
knownto any future group making a decision
1 1
22
decision # 2
decision # 1
CS 347 Notes06 66
Important Detail for Majority 3PC
• Example:
W
W
W
?
P
A
CS 347 Notes06 67
Important Detail for Majority 3PC
• Example:
W
W
W
?
P
A
P C
CS 347 Notes06 68
Need “Prepare To Abort” State
I
W
PC PA
_go_exec*
ackC**commit*
nokpreA*ok*
preC *
C
I
W
PC
A
_exec_ok
commit-
execnok
preCackC
C
preAackA
coordinator participant
A
ackA**abort*
PA
abort-
** means participantswith majority votes preA
ackA
CS 347 Notes06 69
Example Revisited
W
W
W
?
PC
PA
CS 347 Notes06 70
Example Revisited
W
W
W
?
PC
PA
PC C
OK to commit sincetransaction could not have aborted
CS 347 Notes06 71
Example Revisited -II
W
W
W
?
PC
PA
PA
CS 347 Notes06 72
Example Revisited -II
W
W
W
?
PC
PA
PA
No decision:Transaction could have aborted orcould have committed... Block!
PA
CS 347 Notes06 73
3PC with Majority Voting
• If survivors have majority andall states W try to abort
• If survivors have majority andstates in {W, PC, C} try to commit
• If survivors have majority andstates in {W, PA, A} try to abort
• Otherwise block
CS 347 Notes06 74
Comparison
Option A: only nodes that have not failed participate in 3PC
• Any size group can terminate(even one node)
• If all nodes fail, must wait for all to recover
CS 347 Notes06 75
Option B: Majority voting• A group of failed+recovering
nodes can terminate transaction (with majority of votes)
• Need majority for every commit blocking protocol!
Comparison
CS 347 Notes06 76
Reminder
• When node recovers, it uses its log in a normal fashion to determine status of transactions:– if commit found in log redo if
necessary– if abort found (or no “W” record)
rollback if necessary
CS 347 Notes06 77
– if in “W” state (or “P” state): • reclaim locks held by T before crash• try to terminate T (with other nodes)
– after locks claimed for “in doubt” transactions, start normal processing
Reminder - Continued
CS 347 Notes06 78
Final note
• If nodes use 2P locking, global deadlocks possible
Local WFG: Local WFG: no cycles no cycles!
T1
T2
T1
T2
CS 347 Notes06 79
• Need to “combine” WFGs to discover global deadlock
T1
T2
T1
T2
T1
T2e.g., central
detection node
CS 347 Notes06 80
Problem: False deadlocks
T1
T2
T1
T2
T1
T2
T1
T2
Time1
Time2
Time3
infosent
infosent
T1
T2
at centralsite:
CS 347 Notes06 81
Problem: False deadlocks
T1
T2
T1
T2
T1
T2
T1
T2
Time1
Time2
Time3
infosent
infosent
T1
T2
at centralsite:
Note that T2 is not 2PL;it releases lock,then asks for another lock
Exercise
• Assume all waits are due to transaction lock requests
• Assume transactions well formed and 2PL; scheduler legal
• Show that false deadlocks are not possible
CS 347 Notes06 82
CS 347 Notes06 83
• Many deadlock solutions– Distributed vs. centralized– Detection vs. prevention
• timeouts• wait-die• wound-wait
• Covered in CS245