Rebound: Scalable Checkpointing for Coherent Shared...

Rebound: Scalable Checkpointing for Coherent Shared Memorfor Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep TorrellasD f C S iDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edup

Checkpointing in Shared-Memory MPs

rollback

Faultsave chkpt

save chkpt

• HW-based schemes for small CMPs use Global checkpointing– All procs participate in system-wide checkpoints

P1 P2 P3 P4

checkpoint

h k i t

P1 P2 P3 P4

• Global checkpointing is not scalable

checkpoint

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

– Synchronization, bursty movement of data, loss in rollback…2

Alternative: Coordinated Local Checkpointing

• Idea: threads coordinate their checkpointing in groups• Rationale:

– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant

P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

GlobalChkpt

LocalChkptLocal

Chkpt

+ Scalable: Checkpoint and rollback in processor groupsC l it R d i t th d d d d i ll


3

– Complexity: Record inter-thread dependences dynamically.

Contributions

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

• Leverages directory protocol to track inter-thread deps.

p g y

• Opts to boost checkpointing efficiency:• Delaying write-back of data to safe memory at checkpoints• Supporting multiple checkpoints• Optimizing checkpointing at barrier synchronization

• Avg. performance overhead for 64 procs: 2%• Compared to 15% for global checkpointing


p g p g

4

Background: In-Memory Checkpt with ReVive

P1 P2 P3Register

[Pvrulovic-02]

ExecutionP1 P2 P3Register

Dump

Caches

CHK

Dirty Cache

Displacement

Writebacks

Writeback

W W W W WBDirty Cache linesCheckpoint

ApplicationStalls

MemoryLogLogging

Stalls


5

Background: In-Memory Checkpt with ReVive

[Pvrulovic-02]Old Register

restoredP3P2P1

FaultCHK

P3P2

Caches

P1

Cache Invalidated

Memory LinesR d

W W W W WB

Reverted

Log Memory

GlobalBroadcast protocol

Local CoordinatedScalable protocol


6

Coordinated Local Checkpointing Rules

P1 P2 P1 P2 P1 P2

wr x

P1 P2 P1 P2 P1 P2

rd x

Producerrollback

Consumerrollback

Producerchkpoint

Consumerchkpoint

chkptchkpt

rollback rollback chkpoint chkpoint

P checkpoints P’s producers checkpointP rolls back P’s consumers rollback

• Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

P rolls back P s consumers rollback


based machines [Banatre96]

7

Rebound Fault Model

Log (in SW)Main Memory

Chip Multiprocessor

Log (in SW)

• Any part of the chip can suffer transient or permanent faults.• A fault can occur even during checkpointing• Off-chip memory and logs suffer no fault on their own (e g NVM)Off chip memory and logs suffer no fault on their own (e.g. NVM)• Fault detection outside our scope:

• Fault detection latency has upper-bound of L cycles


8

Rebound Architecture

Main Memory

Chip Multiprocessor

P+L1

L2DirectoryCache

MyProducerMyConsumer

DepRegister

LW-ID


9


Main Memory

Chip Multiprocessor

P+L1

L2DirectoryCache


DepRegister

• Dependence (Dep) registers in the L2 cache controller:

LW-ID

p ( p) g• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc that consumed data producedMyConsumers : bitmap of proc. that consumed data produced

by the local proc.


10


Main Memory

Chip Multiprocessor

P+L1

L2DirectoryCache


DepRegister

• Dependence (Dep) registers in the L2 cache controller:

LW-ID

p ( p) g• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc that consumed data producedMyConsumers : bitmap of proc. that consumed data produced

by the local proc. • Processor ID in each directory entry:

LW ID l t it t th li i th t h k i t i t l


• LW-ID : last writer to the line in the current checkpoint interval.

11

Recording Inter-Thread Dependences

P1 P2

Write

P1 writes MyProducersMyConsumers

MyProducersMyConsumers

DP1

Write

LW-ID

Log Memory


Assume MESI protocol12


P1 P2MyConsumers P2

P2 reads

y

MyProducers P1

MyProducersMyConsumers

MyProducersMyConsumersP2

P1

DP1 S

LW-ID

Write back

Logginggg g

MemoryLog




P1 P2

P1 writes P2P1MyProducers

MyConsumersMyProducersMyConsumers

P1 SP1

LW-ID

DP1

MemoryLog




P1 P2Clear Dep registers

Clear LW ID

P1 checkpoints P2P1MyProducers

MyConsumersMyProducersMyConsumers

p g

P1P1 S

W it b k

Clear LW-ID

LW-ID shouldremain set tillth li i

LW-ID

P1 DWritebacks

Logging

the line ischeckpointed

MemoryLog



Distributed Checkpointing Protocol in SW

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers– Built using MyProducers

P1P1 P2 P3 P4 InteractionSet : P1

P1

chk

initiatecheckpointcheckpoint


16




P1P1 P2 P3 P4 InteractionSet : P1, P2, P3

P1

P2 P3

Ck? Ck?chk

initiatecheckpointcheckpoint


17





P1

P2 P3

Ck? Ck?chk

initiatecheckpoint

P4

Ck?

checkpoint


18





P1

P2 P3

Ck? Ck?chk

initiatecheckpoint

P4

Ck?

checkpoint


19





P1

P2 P3

Ck? Ck?chk

initiatecheckpoint

P4

Ck?

• Rollback handled similarly using MyConsumers

checkpoint


20

Optimization1 : Delayed Writebacks

Inte

rval

I1Ti

me

nter

val

I1

Stall

WB dirty lines

sync

sync

Che

ckpo

int

Stallsync

WB dirty lines

eckp

oint

Innt

erva

l I2

Stall

C

Inte

rval

I2

syncCh In

• Checkpointing overhead dominated by data writebacks

• Delayed Writeback optimization• Processors synchronize and resume execution• Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back

Still d t d i t th d d d d l d d t


• Still need to record inter-thread dependences on delayed data

21

Delayed Writeback Pros/Cons

+ Significant reduction in checkpoint overhead

- Additional support:Each processor has two sets of Dep. registers E h h li h d l d bitEach cache line has a delayed bit

- Increased vulnerabilityyA rollback event forces both intervals to roll back


22

Optimization2 : Multiple Checkpoints

• Problem: Fault detection is not instantaneous– Checkpoint is safe only after max fault-detection latency (L)

Dep registers 1Ckpt 1

p y y ( )

ectio

nen

cy Dep registers 2Rol

lbac

k

Ckpt 2

Fault

Det

eLa

te

tf

• Solution: Keep multiple checkpoints– On fault, roll back interacting processors to safe checkpoints


• No Domino Effect 23

Multiple Checkpoints: Pros/Cons

+ Realistic system: supports non-instantaneous fault detection

- Additional support:Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency

- Need to track communication across checkpointsNeed to track communication across checkpoints

- Combination with Delayed Writebacks: one more Dep register set


24

Optimization3 : Hiding Chkpt behind Global Barrier

• Global barriers require that all processors communicateLeads to global checkpoints– Leads to global checkpoints

• Optimization:p– Proactively trigger a global checkpoint at a global barrier– Hide checkpoint overhead behind barrier imbalance spins


25

Evaluation Setup

• Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim• Applications: SPLASH-2 some PARSEC ApacheApplications: SPLASH 2 , some PARSEC, Apache• Simulated CMP architecture with up to 64 threads• Checkpoint interval : 5 – 8 ms• Modeled several environments:

• Global: baseline global checkpointing• Rebound: Local checkpointing scheme with delayed writeback• Rebound: Local checkpointing scheme with delayed writeback.• Rebound_NoDWB: Rebound without the delayed writebacks.


26

Avg. Interaction Set: Set of Producer Processors

64

38

• Most apps: interaction set is a small setMost apps: interaction set is a small set– Justifies coordinated local checkpointing– Averages brought up by global barriers


27

Checkpoint Execution Overhead

30

40

nt

GlobalRebound_NoDWBR b d

10

20

% C

heck

poi

Ove

rhea

d Rebound

2

15

0

Bar

nes

Cho

lesk

y Fft

Fmm

Rad

ix

Lu-C

Lu-N

C

Volre

nd

Wat

er-

Sp

Wat

er-

Nsq

Rad

iosi

ty

Oce

an

Ray

trace

SP

2-AV

G

%

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for GlobalCompared to 15% for Global


28

Checkpoint Execution Overhead

30

40

nt

GlobalRebound_NoDWBR b d

10

20

% C

heck

poi

Ove

rhea

d Rebound

0

Bar

nes

Cho

lesk

y Fft

Fmm

Rad

ix

Lu-C

Lu-N

C

Volre

nd

Wat

er-

Sp

Wat

er-

Nsq

Rad

iosi

ty

Oce

an

Ray

trace

SP

2-AV

G

%

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for GlobalCompared to 15% for Global

• Delayed Writebacks complement local checkpointing


29

Rebound Scalability

Constant problem size

• Rebound is scalable in checkpoint overhead• Delayed Writebacks help scalability


30

Also in the Paper

• Delayed write backs also useful in GlobalBarrier optimi ation is effecti e b t not ni ersall applicable• Barrier optimization is effective but not universally applicable

• Power increase due to hardware additions < 2%• Rebound leads to only 4% increase in coherence trafficy


31

Conclusions

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

• Leverages directory protocol• Boosts checkpointing efficiency:

p g y

• Boosts checkpointing efficiency:• Delayed write-backs• Multiple checkpoints• Barrier optimization

• Avg. execution overhead for 64 procs: 2%

• Future work:• Apply Rebound to non-hardware coherent machines

SR. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

• Scalability to hierarchical directories32

Rebound: Scalable Checkpointing for Coherent Shared Memorfor Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep TorrellasD f C S iDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edup

Date post:	10-Jun-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Rebound: Scalable Checkpointing for Coherent Shared...

Documents