1 Paxos Commit Jim Gray Leslie Lamport Microsoft Research Preview of a paper in preparation...

Post on 26-Mar-2015

219 views 5 download

Tags:

transcript

1

Paxos Commit

Jim GrayLeslie Lamport

Microsoft ResearchPreview of a paper in preparation

Presented Microsoft Research Techfest 3 March 2004, Redmond, WA

Article MSR-TR-2003-96

Consensus on Transaction Commithttp://research.microsoft.com/research/pubs/view.aspx?tr_id=701

2

Commit is Common

• Marriage ceremony

• Theater

• Contract law

Do you?I do.I now pronounce you…

Ready on the set?Ready!Action!

OfferSignatureDeal / lawsuit

3

Action!

Action!

Action!

The Common Picture

director actors

actors

actors

Ready?

Ready?

Ready?

Ready?

Ready

Ready

Ready

Ready

Action!

4

All or Nothing: If any actor says no the deal is off.

director actors

actors

actors

Ready?

Ready?

Ready?

Ready?

Ready

No!

Ready

Ready

No deal!

No deal!

No deal!

No deal!

No! or timeout

5

The Database Version

director actors

actors

actors

RM

RM

director

Commit

Ready

CommitCommit

TM: Transaction ManagerRM: Resource Manager

client TM RM

Ready?

6

Two Phase Commit• N Resource Managers (RMs)• Want all RMs to commit or all abort.• Coordinated by Transaction Manager (TM)

TM sends Prepare, Commit-Abort• RM responds Prepared, Aborted• 3N+1 messages• N+1 stable writes• Delay

– 4 message– 2 stable write

• Blocking: if TM fails, Commit-Abort stalls

working

committed aborted

Transaction Manager

working

prepared

committed aborted

Resource Manager

RequestCommit

PreparePreparePreparePrepare

PreparePreparePrepareCommit

PreparePreparePreparePrepared

7

The Problem With 2PC

• Atomicity – all or nothing

• Consistency – does right thing

• Isolation – no concurrency anomalies

• Durability / Reliability – state survives failures

• Availability: always up

Blocks if TM fails

8

Problem Statement

• ACID Transactions make error handling easy.

• One fault can make 2-Phase Commit block.

• Goal: ACID and Available.Non-blocking despite F faults.

9

RequestCommit

Prepare

Prepared

client

TM RM

TM RMRequestCommit

Prepare

Prepare

Prepared

Prepared

Fault-Tolerant Two Phase Commit

If the 2PC Transaction Manager (TM) Fails, transaction blocks. Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)

10

RequestCommit

Preparecommit

client

TM RM

TM RM

Prepare

Prepare

Prepared

Prepared

commitcommit

abort

commit

Fault-Tolerant Two Phase Commit

If the 2PC Transaction Manager (TM) Fails, transaction blocks.Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)

But… What if….?

TM

Prepare

Prepared

commit

abort

Inconsistent! Now What?

The complexity is a mess.

Prepared

11

Fault Tolerant 2PC

• Several workarounds proposed in database community:

• Often called "3-phase" or "non-blocking" commit.

• None with complete algorithm and correctness proof.

12

“Reaching Agreement in the Presence of Faults”

• 25 years of theory

• Now called the Consensus problem

• N processes want to agree on a value, even if F of them have failed.

Shostak, Pease, & LamportJACM, 1980

13

W Chosenclient

Propose X

consensusbox

client

clientPropose W

W Chosen

W Chosen

Consensus

• collects proposed values

• Picks one proposed value• remembers it forever

14

RMPropose PreparedPrepared Chosen

consensusbox

Prepared Chosen

Prepared

Prepared

Prepared

RequestCommit

Prepare

Commit

client

TM RM

TMRequest Commit

Prepare

Prepare

CommitCommit

Commit

Commit

Consensus for CommitThe Obvious Approach

• Get consensus on TM’s decision.• TM just learns consensus value.• TM is “stateless”

Propose Prepared

Prepared Chosen

15

RM

RM

RM1 Prepared Chosen

RM1 Prepared Chosen

RM2 Prepared Chosen

RequestCommit

Prepare

Commit

client

TM

TMRequest Commit

Prepare

Prepare

CommitCommit

Commit

Commitconsensus

box

consensusbox

Propose RM2 Prepared

Propose RM1 Prepared

Consensus for CommitThe Paxos Commit Approach

• Get consensus on each RM’s choice.• TM just combines consensus values.• TM is “stateless”

Propose RM1 Prepared

RM2 Prepared Chosen

Propose RM2 Prepared

16

Prepared Chosen

Prepared

Prepare

Commit

Propose Prepared

RM1 Prepared Chosen

Prepare

Commit

Propose RM1 Prepared

RM2 Prepared Chosen

Propose RM2 Prepared

The Obvious Approach Paxos Commit

One fewer message delay

17

RM

TM

TM

acceptor

acceptor

acceptor

Consensus boxPropose RM Prepared

Consensus in Action

• The normal (failure-free) case• Two message delays• Can optimize

Propose RM PreparedPropose RM Prepared

Vote RM Prepared

Vote RM Prepared

Vote RM PreparedRM

Prepared

Chosen

18

RM

TM

TM

acceptor

acceptor

acceptor

Consensus box

Consensus in Action

TM

TM can always learn what was chosen,or get Aborted chosen if nothing chosen yet; if majority of acceptors working .

19

The Complete Algorithm

• Subtle.

• More weird cases than most people imagine.

• Proved correct.

20

Paxos Commit• N RMs

• 2F+1 acceptors (~2F+1 TMs)

• If F+1 acceptors see all RMs prepared, then transaction committed.

• 2F(N+1) + 3N + 1 messages5 message delays 2 stable write delays.

Client TM RM1…NAcceptors

0…2Frequestcommit

prepare

prepared

all prepared

commit

21

Two-Phase Commit Paxos Commit

tolerates F faults

• 3N+1 messages

• N+1 stable writes

• 4 message delays

• 2 stable-write delays

• 3N+ 2F(N+1) +1 messages

• N+2F+1 stable writes

• 5 message delays

• 2 stable-write delays

Same algorithm when F=0 andTM = Acceptor

22

Summary

• Commit is common

• Two Phase commit is good but…It is the un-availability protocol

• Paxos commit is non-blocking if there are at most F faults.

• When F=0 (no fault-tolerance), Paxos Commit == 2PC

23

24

Paxos Consensus

• Group has a leader known to all– leader election is a subroutine

• Process proposes a value v to leader.

• Leader sends proposal (phase 2) (ballot, value) to all acceptors

• Acceptors respond with:max(ballot, value) they have seen

• If leader gets no higher ballot, and gets at least F+1 responses then leader can announce (ballot, value)

• Full protocol 3-phase • Phase 1:

– Leader starts new ballot

• Phase 2– Leader proposes value

• Phase 3– If value accepted by F+1

then value is accepted. – If not, leader tries to get

majority value accepted.

6F+4 messages, 2F+1 stable writes4 message delays and 2 stable write delays

25

RequestCommit

Prepare

Commit

Prepared

client

TM RM

TM RMRequestCommit

Prepare

Prepare

Prepared

Prepared

CommitCommit

Commit

Commitconsensus

boxconsensusbox

Using ConsensusHave a consensus for each RM

26

X Chosen

RMPropose X

consensusbox

TM

TM

Propose W

X Chosen

X Chosen

27

Paxos Commit (success case)

Acceptors

working

prepared

committed aborted

Resource Managers

working

AllPrepared aborted

Commit Leader

working

committed aborted

Request Commit

Prepare

Prepared Prepared

Prepared

Commit

All Prepared

28

Consensus• The distributed systems theory community has

thought about this a lot. • They call it Consensus:

N processes want to agree on a value• Want to tolerate F faults

– Tolerate F processes stopping– Tolerate F Messages delayed or lost

• If there are fewer than F faults in a windowThen consensus achieved.

• Byzantine faults need 3F “acceptors”• Benign faults need 2F+1 “acceptors”

stalls but safe if more than F faults