Lightweight Logging For Lazy Release Consistent DSM Costa, et. al. CS 717 - 11/01/01.

Lightweight Logging For Lazy Release Consistent DSM

Costa, et. al.

CS 717 - 11/01/01

Definition of a SDSM In a software distributed shared memory (SDSM), each

node runs its own operating system, and has a local physical memory

Each node runs a local process. The these processes form the parallel application

The union of the local memory of each of the local processes form the global memory of the application

The global memory appears as one virtual address space – a process accesses all memory locations in the same manner, using standard load and stores

Basic Implementation of a SDSM

The virtual add. space is divided among different memory pages, which are distributed among the local memory of the different processes

Each node has a copy of the page to node assignments

We use the hardware’s virtual memory support to provide the appearance of SM (page table and faults)

The SDSM system is implemented as fault handler routines

Such a system is also called a SVM system

Illustration

N1 N2 N3

P1 P3 P4P5 P2

The same virtual page might appear in multiplephysical pages, on multiple nodes

P5

SDSM Operation If N2 attempts to write x on P2

P2 is marked as invalid on N2’s page table, so access will cause a fault

Fault handler checks page-node map, and then requests that N3 send it P2

N3 sends page, and notifies all nodes of the change N3 sets page access to “invalid” N2 sets page access to “read/write” Handler returns

Multiple N’s can have the same P in their physical add. space, if P is “read-only” for all of them, but only one N can have a copy of P if it is “read/write”

Page Size Granularity

Memory access is managed at the granularity of an OS page

Easy to implement Can be very inefficient

If N exhibits poor spatial locality, a lot of unnecessary data transfer

If both x and y are on the same page, P, and N1 is repeatedly writing to x while N2 is writing to y, P will be continually sent back and forth between N1 and N2 – false sharing

Sequential Consistency

Defined by Lamport as:

A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program

Is this SDSM Sequentially Consistent?

Assume a and b are on P1 and P2 respectivelyN1 N2 a = 1 print bb = 1 print a

If N2 does not invalidate its copy of P1, but does invalidate P2, the output will be <1,0> which is invalid under SC

Ensuring Sequential Consistency

For the system to be SC, N1 must ensure that N2 invalidated its copy of a page before it can write to that page

Before a write, N1 must tell N2 to invalidate its copy of the page, and then wait for N2 to acknowledge that it has done so

Of course, if we know that N2’s copy is already invalidated, we don’t need to do this N2 could not have re-obtained access with out N1’s

copy being invalidated

Ping-Pong Effect

SC, combined with the large sharing granularity (OS page), can lead to the ping-pong effect

Substantial, expensive, communication cost due to false-sharing

A Problem With SC

N1 is continually writing to x while N2 is cont. reading from y, both on the same P N2 has P in “read-only”, N1 has P in “r-o” N1 attempts to write to x, faults, tells N2 to go

to “invalid” N1 waits for N2 to go to “invalid”, N1 goes to

“r/w”, N1 does write N2 tries to read, faults, tells N1 to go to “r-o”,

and send current copy of P, N2 goes to “r-o” N2 gets P, does read

Ping-Pong Effect

N2

N1R/O

R/O inval

R/W R/O

R/O inval

R/W

…invalack

req

reply

Relaxing the Consistency Model

The memory consistency model specifies constraints on the order in which memory operations appear to to execute wrt. each other

Can we relax the consistency model to improve performance?

Release Consistency

Certain operations are specified as ‘acquire’ and ‘release’ operations

Code below an acquire can never be moved above the acquire

Code above the release can never be moved below the release

As long as there are no race-conditions, behavior of program same under RC or SC

RC Illustration

I

acq

II

rel

III

I

acq

II

relIII

Lazy Release Consistency (LRC)

In order for a system to be RC, it must ensure that all memory writes above a release become visible before that release is visible

i.e., before issuing a release, it must invalidate all other copies of the same page

Can we relax this further?

LRC

LRC is a further relaxation: Lets not invalidate pages until absolutely

necessary N1: I, acquire, II, release N2: III, acquire, IV, release Only when N2 is about to issue an acquire,

does N1 ensure that all changes it make before its release are visible

N1 invalidates N2’s copy of the pages before N2 does its acquire

Illustration

RC

LRC

N1

N2

A I R

inval ack

A II R

inval ack

A…

N1

N2

A I R A II R

inval ack

A…

TreadMarks

A high performance SDSM Implements LRC

Keller, Cox, Zwaenepoel 1994

Intervals

The execution of each process is divided into intervals, beginning at a synchronization access (acq. or release)

These form a partial order: intervals on the same process are totally

ordered intval. x precedes y if the release that ended

x corresponds to the acquire that began y When a process begins a new interval, it

creates a new IntervalRecord

Vector Clocks

Each process also keeps a current vector clock, VC, <…,L,M,N,O,…>

If VCN is process N’s vector clock, VCN(M) is the most recent interval of

process M that process N knows about VCN(N) is therefore the current interval

of process N

Interval Records

An IntervalRecord is a structure containing: The pid of the process that created this

record The vector-clock timestamp of when

this interval was created A list of WriteNotices

Write Notices

A WriteNotice is a record containing: The page number of the page written to A diff showing the changes made to this

page A pointer to the corresponding

IntervalRecord

Acquiring A Lock

When N1 wants to acquire a lock, it sends its current vector clock to the Lock Manager

The Lock Manager forwards this message to the last process that acquired this lock (assume N2)

N2 replies (to N1) with all the IntervalRecords that have a timestamp between the VC sent by N1 and the VC of the IR that ended with the most recent release of that lock

N1 received IntervalRecords from N2

N1 stores these IntervalRecord in volatile memory

N1 invalidates all pages for which it received a WriteNotice (in the IRs)

On a page fault, N1 obtains a copy of the page, and then applies all the diffs for that page in interval order

If N1 is about to write to that page, it makes a copy of it (so that it can compute the diff of its changes)

Example

N1

N2

N3

acq write P rel

IR/DIFF <1,0,0>

acq Apply diff

Request <0,0,0>

write P rel

acq

IR/DIFF <1,0,0>

Apply diff

Request <0,0,0>

write P rel

IR/DIFF <1,1,0>

<0,0,0>

<0,0,0>

<0,0,0>

<1,0,0>

<1,1,0>

<1,1,1>

Example (cont.)

If N1 were to issue another acquire, it would only have to apply the diffs in the IR of time <1,1,1> and <1,1,0>, because its current VC was <1,0,0>

Improvement: Garbage Collection

Each N is keeping a log of all shared memory writes that it made, along with all writes that it needed to know about

At a barrier, Ns can synchronize, so that each N has the most up to date copy of its pages, and the logs could then be discarded

Improvement: Sending Diffs

You might notice that if N1 writes to pages P1, P2, P3 during an interval, and N2 acquires the lock next, N1 needs to send the three diffs to N2, regardless if N2 will actually need those pages

In truth, N1 does not send the diffs, it sends a pointer to its local memory, where the diff is located

If N2 needs to apply that diff, it will request that diff from N1, using that pointer

Adding Fault Tolerance

Assume we would like the ability to survive single node failure (only one fails at a time, but multiple failures may occur during the running of the application)

What information would we need to log, and where? Remember, we already log IntervalRecords

and WriteNotices as part of the usual operation of TreadMarks

Ni fails and then restarts If it acquires a lock, it must see the same

version of the page that it saw during the original run

Therefore Nj must send it the same WriteNotices (diffs) as before, even though Nj’s current version of the page might be very different, and Nj’s vector clock has also changed

Example

If N3 is restarted, when it reissues the acquire, it must receive the same set of WriteNotices as it had during its original run.If we run the algorithm un-modified, N3 would receive <1,0,0><1,1,0><1,1,1><1,2,1>, and the application would be incorrect

N1

N2

N3 X

<0,0,0> ACQ/WRI/REL <1,0,0>

<0,0,0> ACQ/WRI/REL<1,1,0>

<0,0,0> ACQ/WRI/REL <1,1,1>

<1,1,0> ACQ/WRI/REL<1,2,1>

IR <1,0,0>

IR <1,0,0>IR <1,1,0>

IR <1,1,1>

Send Log

Therefore, N2 needs some way of logging which IntervalRecords it had sent to N3

It does this by storing the VC of N3 when it issued the acquire (this was sent to it with the request) and the VC of N2 when it received the request This is stored in N2’s send-log

From these two VC’s, N2 can determine which IntervalRecords it had sent to N1

Example

N1

N2

N3 X

<0,0,0> ACQ WRI REL<1,1,0>

<0,0,0> ACQ WRI REL <1,1,1>

IR <1,0,0>

IR <1,0,0>IR <1,1,0>

Send-Log: {N2, <0,0,0><1,0,0>}

Send-Log: {N3, <0,0,0><1,1,0>}

Restart

When N3 restarts, it will request the acquire at time <0,0,0>

N2 will look in its send log, and see that when it received an acquire request from N3 at <0,0,0>, it was at time <1,1,0>, so it will send the IR of all the intervening intervals

Therefore, N3 receives the same diffs as it did before

Logging, cont.

Is the send-log sufficient to provide the level of fault-tolerance that we wanted?

Imagine N2 had failed, and then restarted, could we then survive the failure of N3?

Logging

No, we could not survive the subsequent failure of N3, because N2 no longer had its send-log

We also need a way to recreate N2’s send log

Receive-Log

On every acquire, N, logs its vector time, before the acquire and its new vector time after seeing the IntervalRecords sent to it by M in N’s receive-log

If M fails, M’s send-log can be recreated from N’s receive-log

Example

N1

N2

N3 X

IR <1,0,0>

IR <1,0,0>IR <1,1,0>

Send-Log: {N2, <0,0,0><1,0,0>}

Send-Log: {N3, <0,0,0><1,1,0>}

Recv-Log: {N1, <0,0,0><1,0,0>}

Recv-Log: {N2, <0,0,0><1,1,0>}

If N2 were to fail, it would get restarted N1’s send-log will ensure that N2 sees the

same page as it did originally When, in the future, N3 sees a VC time

later than that in its receive log (wrt. N2) it will forward the information in its receive-log to N2

N2 will recreate its send-log We could now survive future failures

Checkpointing

When we arrive at garbage collection point, we could checkpoint all processes

Minimize rollback Survive concurrent failures Empty logs

Results

Results 2

Appl. Log Size (MB)

Avg. Ckpt. Size(MB)

Water 3.10 3.05

SOR .33 7.84

TSP .05 2.49

Results 3

Date post:	19-Dec-2015
Category:	Documents
View:	215 times
Download:	2 times

Lightweight Logging For Lazy Release Consistent DSM Costa, et. al. CS 717 - 11/01/01.

Documents