Logical Clocks
Ken Birman
Time: A major issue in distributed systems
We tend to casually use temporal concepts Example: “p suspects that q has failed”
Implies a notion of time: first q was believed correct, later q is suspected faulty
Challenge: relating local notion of time in a single process to a global notion of time
Discuss this issue before developing practical tools for dealing with other aspects, such as system state
Time in Distributed Systems
Three notions of time: Time seen by external observer. A global clock
of perfect accuracy Time seen on clocks of individual processes.
Each has its own clock, and clocks may drift out of sync.
Logical notion of time: event a occurs before event b and this is detectable because information about a may have reached b.
External Time
The “gold standard” against which many protocols are defined
Not implementable: no system can avoid uncertain details that limit temporal precision!
Use of external time is also risky: many protocols that seek to provide properties defined by external observers are extremely costly and, sometimes, are unable to cope with failures
Time seen on internal clocks
Most workstations have reasonable clocks Clock synchronization is the big problem (will visit
topic later in course): clocks can drift apart and resynchronization, in software, is inaccurate
Unpredictable speeds a feature of all computing systems, hence can’t predict how long events will take (e.g. how long it will take to send a message and be sure it was delivered to the destination)
Logical notion of time
Has no clock in the sense of “real-time” Focus is on definition of the “happens before”
relationship: “a happens before b” if: both occur at same place and a finished before b started,
or a is the send of message m, b is the delivery of m, or a and b are linked by a chain of such events
Logical time as a time-space picture
a
b
d
c
p0
p1
p2
p3
a, b are concurrent
c happens after a, b
d happens after a, b, c
Notation
Use “arrow” to represent happens-before relation
For previous slide: a c, b c, c d hence, a d, b d a, b are concurrent
Also called the “potential causality” relation
Logical clocks
Proposed by Lamport to represent causal order
Write: LT(e) to denote logical timestamp of an event e, LT(m) for a timestamp on a message, LT(p) for the timestamp associated with process p
Algorithm ensures that if a b, then LT(a) < LT(b)
Algorithm
Each process maintains a counter, LT(p) For each event other than message delivery:
set LT(p) = LT(p)+1 When sending message m,
set LT(m) = LT(p) When delivering message m to process q,
set LT(q) = max(LT(m), LT(q))+1
Illustration of logical timestamps
0 1 2 7p0
p1
p2
p3
0 1 6
0 2 3 4 5 6
0 1
Concurrent events
If a, b are concurrent, LT(a) and LT(b) may have arbitrary values!
Thus, logical time lets us determine that a potentially happened before b, but not that a definitely did so!
Example: processes p and q never communicate. Both will have events 1, 2, ... but even if LT(e)<LT(e’) e may not have happened before e’
Vector timestamps
Extend logical timestamps into a list of counters, one per process in the system
Again, each process keeps its own copy Event e occurs at process p: p increments VT(p)[p]
(p’th entry in its own vector clock) q receives a message from p: q sets
VT(q)=max(VT(q),VT(p)) (element-by-element)
Illustration of vector timestamps
[1,0,0,0] [2,0,0,0]
[0,0,1,0]
[2,1,1,0] [2,2,1,0]
p0
p1
p2
p3
[0,0,0,1]
Vector timestamps accurately represent happens-before relation
Define VT(e)<VT(e’) if, for all i, VT(e)[i]<VT(e’)[i], and for some j, VT(e)[j]<VT(e’)[j]
Example: if VT(e)=[2,1,1,0] and VT(e’)=[2,3,1,0] then VT(e)<VT(e’)
Notice that not all VT’s are “comparable” under this rule: consider [4,0,0,0] and [0,0,0,4]
Vector timestamps accurately represent happens-before relation
Now can show that VT(e)<VT(e’) if andonly if e e’: If e e’, then there exists a chain e0 e1 ... en on
which vector timestamps increase “hop by hop” If VT(e)<VT(e’) suffices to look at VT(e’)[proc(e)],
where proc(e) is the place that e occured. By definition, we know that VT(e’)[proc(e)] is at least as large as VT(e)[proc(e)], and by construction, this implies a chain of events from e to e’
Examples of VT’s and happens-before
Example: suppose that VT(e)=[2,1,0,1] and VT(e’)=[2,3,0,1], so VT(e)<VT(e’)
How did e’ “learn” about the 3 and the 1? Either these events occured at the same place as e’, or Some chain of send/receive events carried the values!
If VT’s are not comparable, the corresponding events are concurrent
Notice that vector timestamps require a static notion of system membership
For vector to make sense, must agree on the number of entries
Later will see that vector timestamps are useful within groups of processes
Will also find ways to compress them and to deal with dynamic group membership changes
What about “real-time” clocks?
Accuracy of clock synchronization is ultimately limited by uncertainty in communication latencies
These latencies are “large” compared with speed of modern processors (typical latency may be 35us to 500us, time for thousands of instructions)
Limits use of real-time clocks to “coarse-grained” applications
Interpretations of temporal terms
Understand now that “a happens before b” means that information can flow from a to b
Understand that “a is concurrent with b” means that there is no information flow between a and b
What about the notion of an “instant in time”, over a set of processes?
Neither clock is appropriate
Problem is that with both clocks, there can be many events that are concurrent with a given event
Leads to a philosophical question: Event e has happened at process p Which events are “really” simultaneous with p?
Perspectives on logical time
One view is based on intuition from physics Imagine a time-space diagram Cones of causality define past and future “Now” is any cut across the system consistent
including no future events and no past events Next Tuesday will see algorithms based on this
Causal notions of past, future
a
b
g
e
p0
p1
p2
p3
d
c
f
FUTURE
Causal notions of past, future
a
b
g
e
p0
p1
p2
p3
d
c
fPAST
Issues raised by time
Time is a tool Typical uses of time?
To put events into some sort of order Example: the order of updates on a replicated
data item With one item, logical time may make sense With multiple items, consider VT with one
element per item
Ways to extend time to a total order
Often extend a logical timestamp or vector timestamp with actual clock time when the event occurred and process id where it occurred
Combination breaks any possible ties Or can use event “names”
An example
Suppose we are broadcasting messages Atomic broadcast is
Fault-tolerant: unless every process with a copy fails, the message is delivered everywhere (often expressed as all or nothing delivery)
Ordered: if p, q both receive m, n, either both receive m before n, or both receive n before m
How should we implement this policy?
Easy case
In many systems there is really just one source of broadcasts Typically we see this pattern when there is really one
reference copy of a replicated object and the replicas are viewed as cached copies
Accordingly we can use a FIFO ordered broadcast and reduce the problem to fault-tolerance
FIFO ordering simply requires a counter from sender
A more complex example
Sender-ordered multicast Sender places a timestamp in the broadcast Receiver waits until it has full set of messages Orders them by logical timestamp, breaks ties
with sender-id Then delivers in this order
How can it tell when it has the “full set”?
A more complex example
m
n
Deliver m,n or n,m?
A more complex example
Solution implicitly depends upon membership In fact, most distributed systems depend upon membership Membership is “the most fundamental” idea in many
systems for this reason Receiver can simply wait until all members have sent
one message System ends up running in rounds, where each
member contributes zero or one messages per round Use a “null” message if you have nothing to send
A more complex example
m
n
Optimizations
We could agree in advance on “permission to send” Now, perhaps only p, q have permission We treat their messages in rounds but others must
get permission before sending Avoids all the null messages and ensures fairness if
p, q send at same rate Dolev: explored extensions for varied rates, gets
quite elaborate…
Optimizations
In the limit, we end up with a token scheme While holding the token, p has permission to
send If q requests the token p must release it
(perhaps after a small delay) Token carries the sequence number to use
A more complex example
m:1
A more complex example
m:1
A more complex example
m:1
n:2
An example
Such solutions are expressed in many ways With a ring: Chang and Maxemchuck; messages
are like a “train” with new message tacked onto end and old ones delivered from front
Direct all-to-all broadcast Like a token moving around the ring, but it
carries the messages with it (inspired by FDDI) Tree structured in various ways
More examples
Old Isis system uses logical clocks Sender says “here is a message” Receivers maintain logical clocks. Each
proposes a delivery time Sender gathers votes, picks maximum, says
“commit delivery at time t” Receivers deliver committed messages in
timestamp order from front of a queue
More examples
m m:[1,p] n:[2,p]
n:[1,q] m:[2,q]
m:[1,r] n:[2,r]
More examples
m m:[1,p] n:[2,p] m:[2,q]
n:[1,q] m:[2,q] n:[2,r]
m:[1,r] n:[2,r]
More examples
m m:[1,p] n:[2,p] m:[2,q] m! n!
n:[1,q] m:[2,q] n:[2,r] m!n!
m:[1,r] n:[2,r] m!n!
More examples
Later versions of Isis used vector times Membership is handled separately Each message is assigned a vector time Delivered in vector time order, with ties broken
using process id of the sender
Totem and Transis
These systems represent time using partial order information
Message m arrives and includes ordering fields: Deliver m after n and o By transitivity, if n is after p, them m is after p
Break ties using process id number
Totem and Transis
m
n o
p
Things to notice
Time is just a programming tool But membership and message atomicity are very
fundamental Waiting for m won’t work if m never arrives And VT is only meaningful if we can agree on the
meaning of the indicies With failures, these algorithms get surprisingly
complicated: suppose p fails while sending m?
Major uses of time
To order updates on replicated data To define versions of objects To deal with processes that come and go in
dynamic networked applications Processes that joined earlier often have more complete
knowledge of system state Process that leaves and rejoins often needs some form of
incrementing “incarnation number” To prove correctness of complex protocols