1
Paxos Commit
Jim GrayLeslie Lamport
Microsoft ResearchPreview of a paper in preparation
Presented Microsoft Research Techfest 3 March 2004, Redmond, WA
Article MSR-TR-2003-96
Consensus on Transaction Commithttp://research.microsoft.com/research/pubs/view.aspx?tr_id=701
2
Commit is Common
• Marriage ceremony
• Theater
• Contract law
Do you?I do.I now pronounce you…
Ready on the set?Ready!Action!
OfferSignatureDeal / lawsuit
3
Action!
Action!
Action!
The Common Picture
director actors
actors
actors
Ready?
Ready?
Ready?
Ready?
Ready
Ready
Ready
Ready
Action!
4
All or Nothing: If any actor says no the deal is off.
director actors
actors
actors
Ready?
Ready?
Ready?
Ready?
Ready
No!
Ready
Ready
No deal!
No deal!
No deal!
No deal!
No! or timeout
5
The Database Version
director actors
actors
actors
RM
RM
director
Commit
Ready
CommitCommit
TM: Transaction ManagerRM: Resource Manager
client TM RM
Ready?
6
Two Phase Commit• N Resource Managers (RMs)• Want all RMs to commit or all abort.• Coordinated by Transaction Manager (TM)
TM sends Prepare, Commit-Abort• RM responds Prepared, Aborted• 3N+1 messages• N+1 stable writes• Delay
– 4 message– 2 stable write
• Blocking: if TM fails, Commit-Abort stalls
working
committed aborted
Transaction Manager
working
prepared
committed aborted
Resource Manager
RequestCommit
PreparePreparePreparePrepare
PreparePreparePrepareCommit
PreparePreparePreparePrepared
7
The Problem With 2PC
• Atomicity – all or nothing
• Consistency – does right thing
• Isolation – no concurrency anomalies
• Durability / Reliability – state survives failures
• Availability: always up
Blocks if TM fails
8
Problem Statement
• ACID Transactions make error handling easy.
• One fault can make 2-Phase Commit block.
• Goal: ACID and Available.Non-blocking despite F faults.
9
RequestCommit
Prepare
Prepared
client
TM RM
TM RMRequestCommit
Prepare
Prepare
Prepared
Prepared
Fault-Tolerant Two Phase Commit
If the 2PC Transaction Manager (TM) Fails, transaction blocks. Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)
10
RequestCommit
Preparecommit
client
TM RM
TM RM
Prepare
Prepare
Prepared
Prepared
commitcommit
abort
commit
Fault-Tolerant Two Phase Commit
If the 2PC Transaction Manager (TM) Fails, transaction blocks.Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)
But… What if….?
TM
Prepare
Prepared
commit
abort
Inconsistent! Now What?
The complexity is a mess.
Prepared
11
Fault Tolerant 2PC
• Several workarounds proposed in database community:
• Often called "3-phase" or "non-blocking" commit.
• None with complete algorithm and correctness proof.
12
“Reaching Agreement in the Presence of Faults”
• 25 years of theory
• Now called the Consensus problem
• N processes want to agree on a value, even if F of them have failed.
Shostak, Pease, & LamportJACM, 1980
13
W Chosenclient
Propose X
consensusbox
client
clientPropose W
W Chosen
W Chosen
Consensus
• collects proposed values
• Picks one proposed value• remembers it forever
14
RMPropose PreparedPrepared Chosen
consensusbox
Prepared Chosen
Prepared
Prepared
Prepared
RequestCommit
Prepare
Commit
client
TM RM
TMRequest Commit
Prepare
Prepare
CommitCommit
Commit
Commit
Consensus for CommitThe Obvious Approach
• Get consensus on TM’s decision.• TM just learns consensus value.• TM is “stateless”
Propose Prepared
Prepared Chosen
15
RM
RM
RM1 Prepared Chosen
RM1 Prepared Chosen
RM2 Prepared Chosen
RequestCommit
Prepare
Commit
client
TM
TMRequest Commit
Prepare
Prepare
CommitCommit
Commit
Commitconsensus
box
consensusbox
Propose RM2 Prepared
Propose RM1 Prepared
Consensus for CommitThe Paxos Commit Approach
• Get consensus on each RM’s choice.• TM just combines consensus values.• TM is “stateless”
Propose RM1 Prepared
RM2 Prepared Chosen
Propose RM2 Prepared
16
Prepared Chosen
Prepared
Prepare
Commit
Propose Prepared
RM1 Prepared Chosen
Prepare
Commit
Propose RM1 Prepared
RM2 Prepared Chosen
Propose RM2 Prepared
The Obvious Approach Paxos Commit
One fewer message delay
17
RM
TM
TM
acceptor
acceptor
acceptor
Consensus boxPropose RM Prepared
Consensus in Action
• The normal (failure-free) case• Two message delays• Can optimize
Propose RM PreparedPropose RM Prepared
Vote RM Prepared
Vote RM Prepared
Vote RM PreparedRM
Prepared
Chosen
18
RM
TM
TM
acceptor
acceptor
acceptor
Consensus box
Consensus in Action
TM
TM can always learn what was chosen,or get Aborted chosen if nothing chosen yet; if majority of acceptors working .
19
The Complete Algorithm
• Subtle.
• More weird cases than most people imagine.
• Proved correct.
20
Paxos Commit• N RMs
• 2F+1 acceptors (~2F+1 TMs)
• If F+1 acceptors see all RMs prepared, then transaction committed.
• 2F(N+1) + 3N + 1 messages5 message delays 2 stable write delays.
Client TM RM1…NAcceptors
0…2Frequestcommit
prepare
prepared
all prepared
commit
21
Two-Phase Commit Paxos Commit
tolerates F faults
• 3N+1 messages
• N+1 stable writes
• 4 message delays
• 2 stable-write delays
• 3N+ 2F(N+1) +1 messages
• N+2F+1 stable writes
• 5 message delays
• 2 stable-write delays
Same algorithm when F=0 andTM = Acceptor
22
Summary
• Commit is common
• Two Phase commit is good but…It is the un-availability protocol
• Paxos commit is non-blocking if there are at most F faults.
• When F=0 (no fault-tolerance), Paxos Commit == 2PC
23
24
Paxos Consensus
• Group has a leader known to all– leader election is a subroutine
• Process proposes a value v to leader.
• Leader sends proposal (phase 2) (ballot, value) to all acceptors
• Acceptors respond with:max(ballot, value) they have seen
• If leader gets no higher ballot, and gets at least F+1 responses then leader can announce (ballot, value)
• Full protocol 3-phase • Phase 1:
– Leader starts new ballot
• Phase 2– Leader proposes value
• Phase 3– If value accepted by F+1
then value is accepted. – If not, leader tries to get
majority value accepted.
6F+4 messages, 2F+1 stable writes4 message delays and 2 stable write delays
25
RequestCommit
Prepare
Commit
Prepared
client
TM RM
TM RMRequestCommit
Prepare
Prepare
Prepared
Prepared
CommitCommit
Commit
Commitconsensus
boxconsensusbox
Using ConsensusHave a consensus for each RM
26
X Chosen
RMPropose X
consensusbox
TM
TM
Propose W
X Chosen
X Chosen
27
Paxos Commit (success case)
Acceptors
working
prepared
committed aborted
Resource Managers
working
AllPrepared aborted
Commit Leader
working
committed aborted
Request Commit
Prepare
Prepared Prepared
Prepared
Commit
All Prepared
28
Consensus• The distributed systems theory community has
thought about this a lot. • They call it Consensus:
N processes want to agree on a value• Want to tolerate F faults
– Tolerate F processes stopping– Tolerate F Messages delayed or lost
• If there are fewer than F faults in a windowThen consensus achieved.
• Byzantine faults need 3F “acceptors”• Benign faults need 2F+1 “acceptors”
stalls but safe if more than F faults