Timeliness, Failure Detectors, and Consensus Performance

transcript

Timeliness, Failure Detectors,and Consensus Performance

Idit Keidar and Alexander ShraerTechnion – Israel Institute of Technology

PODC 2006Keidar & Shraer, Technion, Israel

Basic Model

• Message passing• Links between every pair of processes

– do not create, duplicate or alter messages (integrity)

• Process and link failures

Eventually Stable (Indulgent) Models

• Initially asynchronous– for unbounded period of time

• Eventually reach stabilization– GST (Global Stabilization Time) – following GST certain assumptions hold

• Examples– ES (Eventual Synchrony) – starting from GST all links

have a bound on message delay[Dwork, Lynch, Stockmeyer 88]

– failure detectors[Chandra, Toueg 96], [Chandra, Hadzilacos, Toueg 96]

Indulgent Models: Research Trend

• Weaken post-GST assumptions as much as possible [Guerraoui, Schiper96], [Aguilera et al. 03, 04], [Malkhi et al. 05]

Weaker = better?

You only need ONE machine with eventually ONE timely link. Buy the hardware to ensure it, set the timeout accordingly,

and EVERYTHING WILL WORK.

Indulgent Models: Research Trend

Consensus with Weak Assumptions

Network Network

Why isn’t anything happening???

Don’t worry!It will eventually happen!

Consensus with Weak Assumptions

Network Network

What’s Going On?

• In practice, bounds just need to hold “long enough” for the algorithm (TA) to finish

• But TA depends on our synchrony assumptions – with weak assumptions, TA might be unbounded

• For practical systems, eventual completion of the job is not enough!

Our Goal• Understand the relationship between:

– assumptions (1 timely link, failure detectors, etc.) that eventually hold

– performance of algorithms that exploit these assumptions, and only them

• Challenge: How do we understand the performance of asynchronous algorithms that make very different assumptions?

Typical Metric: Count “Rounds”

• Algorithms normally progress in rounds, though rounds are not synchronized among processes at process pi:

forever do send messages receive messages while (!some conditions) compute…

• Previous work: – look at synchronous runs (every message takes

exactly time)– count rounds or “s”[Keidar, Rajsbaum 01], [Dutta, Guerraoui 02], [Guerraoui, Raynal 04] [Dutta et al. 03], etc.

Are All “Rounds” the Same?

• Algorithm 1 waits for messages from a majority that includes a pre-defined leader in each round– takes 3 rounds

• Algorithm 2 waits for messages from all (unsuspected) processes in each round– E.g., group membership– takes 2 rounds

GIRAFGeneral Round-based Algorithm

Framework• Inspired by Gafni’s RRFD, generalizes it

• Organize algorithms into rounds

• Separate algorithm logic from waiting condition

• Waiting condition defines model

• Allows reasoning about lower and upper bounds for rounds of different types

Defining Properties in GIRAF

• Environment can have – perpetual properties

– eventual properties

• In every run r, there exists a round GSR(r)

• GSR(r) – the first round from which:– no process fails

– all eventual properties hold in each round

Defining Timeliness

• Timely link in round k: pd receives the round k message of ps, in round k

– if pd is correct, and ps executes round k (end-of-rounds occurs in round k)

Time – free!

Some Results: Context

• Consensus problem• Global decision time metric

– Time until all correct processes decide

• Message passing• Crash failures

– t < n/2 potential failures out of n>1 processes

◊LM Model: Leader and Majority• Nothing required before GSR

• In every round k ≥ GSR– Every correct process receives a round k

message from a majority of processes, one of which is the Ω-leader.

• Practically requires much shorter timeouts than Eventual Synchrony [Bakr, Keidar]

◊LM: Previous Work• Most Ω-based algorithms wait for

majority in each round (not ◊LM)

• Paxos [Lamport 98] works for ◊LM– Takes constant number of rounds in

Eventual Synchrony (ES)– But how many rounds without ES?

Paxos Run in ES

(Commit, 21 ,v1)

(“prepare”,21)

decide v1

(Commit, 21, v1)

Ω Leader

BallotNum

number of attempts to decide initiated by leaders

yes(“prepare”,2)

Paxos in ◊LM (w/out ES)

(“prepare”,2)

(“prepare”,9) (“prepare”,14)

Ω Leader

no (5)

no (8)

no (13)

GSR GSR+1 GSR+2 GSR+3

BallotNum

Commit may take Ω(n) rounds!

What Can We Hope For?

• Tight lower bound for ES: 3 rounds from GSR [DGK05]

• ◊LM weaker than ES

• One might expect it to take a longer time in ◊LM than in ES

Result 1: Don't Need ES• Leader and majority can give you the

same performance!

• Algorithm that matches lower bound for ES!

Our ◊LM Algorithm in a Nutshell• Commit with increasing ballot numbers, decide on value

committed by majority– like Paxos, etc.

• Challenge: Don’t know all ballots, how to choose the new one to be highest one?

• Solution: Choose it to be the round number• Challenge: rounds are wasted if a prepare/commit fails. • Solution: pipeline prepares and commits: try in each round• Challenge: do they really need to say no?• Solution: support leader’s prepare even if have a higher

ballot number– challenge: higher number may reflect later decision! Won’t

agreement be compromised?– solution: new field “trustMe” ensures supported leader doesn't miss

real decisions

Example Run: GSR=100

Ω Leader

Rounds: GSR+1 GSR+2

<PREPARE, …, trustMe>All PREPAREwith !trustMe

All COMMIT

All DECIDE

Did not lead todecision

Question 2: ◊S and Ω Equivalent?

• ◊S and Ω equivalent in the “classical” sense [Chandra, Hadzilacos, Toueg 96]– Weakest for consensus

• ◊S: eventually (from GSR onward), – all faulty processes are suspected by every

correct process– there exists one correct process that is not

suspected by any correct process.

• Can we substitute Ω with ◊S in ◊LM?

Result 2: ◊S and Ω not that Equivalent

• Consensus takes linear time from GSR

• By reduction to mobile failure model [Santoro, Widmayer 89]

Result 3: Do We Need Oracles?• Timely communication with majority

suffices!

• ◊AFM (All-From-Majority) simplified: – In every round k ≥ GSR, every correct

process p receives round k message from a majority of processes, and p’s message reaches a majority of processes.

• Decision in 5 rounds from GSR– 1st constant time algorithm w/out oracle or ES– idea: information passes to all nodes in 2

rounds

• ◊MFM: Majority from Majority– The rest receive a message from a minority

• Only a little missing for ◊AFM• Stronger than models in literature

[Aguilera et al. 03, 04], [Malkhi et al. 05]

• Bounded time from GSR impossible!

Result 4: Can We Assume Less?

Conclusions• Which guarantees should one implement ?

– weaker ≠ better• some previously suggested assumptions are too

– sometimes a little stronger = much better• worth longer timeouts / better hardware

– ES is not essential• not worth longer timeouts / better hardware

– future: more models, bounds to explore

• GIRAF

Timeliness, Failure Detectors, and Consensus Performance

Documents