+ All Categories
Home > Documents > Rebound: Scalable Checkpointing for Coherent Shared...

Rebound: Scalable Checkpointing for Coherent Shared...

Date post: 10-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
33
Rebound: Scalable Checkpointing for Coherent Shared Memor for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas D fC Si Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
Transcript
Page 1: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Rebound: Scalable Checkpointing for Coherent Shared Memorfor Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep TorrellasD f C S iDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edup

Page 2: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Checkpointing in Shared-Memory MPs

rollback

Faultsave chkpt

save chkpt

• HW-based schemes for small CMPs use Global checkpointing– All procs participate in system-wide checkpoints

P1 P2 P3 P4

checkpoint

h k i t

P1 P2 P3 P4

• Global checkpointing is not scalable

checkpoint

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

– Synchronization, bursty movement of data, loss in rollback…2

Page 3: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Alternative: Coordinated Local Checkpointing

• Idea: threads coordinate their checkpointing in groups• Rationale:

– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant

P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

GlobalChkpt

LocalChkptLocal

Chkpt

+ Scalable: Checkpoint and rollback in processor groupsC l it R d i t th d d d d i ll

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

3

– Complexity: Record inter-thread dependences dynamically.

Page 4: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Contributions

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

• Leverages directory protocol to track inter-thread deps.

p g y

• Opts to boost checkpointing efficiency:• Delaying write-back of data to safe memory at checkpoints• Supporting multiple checkpoints• Optimizing checkpointing at barrier synchronization

• Avg. performance overhead for 64 procs: 2%• Compared to 15% for global checkpointing

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

p g p g

4

Page 5: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Background: In-Memory Checkpt with ReVive

P1 P2 P3Register

[Pvrulovic-02]

ExecutionP1 P2 P3Register

Dump

Caches

CHK

Dirty Cache

Displacement

Writebacks

Writeback

W W W W WBDirty Cache linesCheckpoint

ApplicationStalls

MemoryLogLogging

Stalls

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

5

Page 6: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Background: In-Memory Checkpt with ReVive

[Pvrulovic-02]Old Register

restoredP3P2P1

FaultCHK

P3P2

Caches

P1

Cache Invalidated

Memory LinesR d

W W W W WB

Reverted

Log Memory

GlobalBroadcast protocol

Local CoordinatedScalable protocol

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

6

Page 7: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Coordinated Local Checkpointing Rules

P1 P2 P1 P2 P1 P2

wr x

P1 P2 P1 P2 P1 P2

rd x

Producerrollback

Consumerrollback

Producerchkpoint

Consumerchkpoint

chkptchkpt

rollback rollback chkpoint chkpoint

P checkpoints P’s producers checkpointP rolls back P’s consumers rollback

• Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

P rolls back P s consumers rollback

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

based machines [Banatre96]

7

Page 8: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Rebound Fault Model

Log (in SW)Main Memory

Chip Multiprocessor

Log (in SW)

• Any part of the chip can suffer transient or permanent faults.• A fault can occur even during checkpointing• Off-chip memory and logs suffer no fault on their own (e g NVM)Off chip memory and logs suffer no fault on their own (e.g. NVM)• Fault detection outside our scope:

• Fault detection latency has upper-bound of L cycles

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

8

Page 9: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Rebound Architecture

Main Memory

Chip Multiprocessor

P+L1

L2DirectoryCache

MyProducerMyConsumer

DepRegister

LW-ID

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

9

Page 10: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Rebound Architecture

Main Memory

Chip Multiprocessor

P+L1

L2DirectoryCache

MyProducerMyConsumer

DepRegister

• Dependence (Dep) registers in the L2 cache controller:

LW-ID

p ( p) g• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc that consumed data producedMyConsumers : bitmap of proc. that consumed data produced

by the local proc.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

10

Page 11: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Rebound Architecture

Main Memory

Chip Multiprocessor

P+L1

L2DirectoryCache

MyProducerMyConsumer

DepRegister

• Dependence (Dep) registers in the L2 cache controller:

LW-ID

p ( p) g• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc that consumed data producedMyConsumers : bitmap of proc. that consumed data produced

by the local proc. • Processor ID in each directory entry:

LW ID l t it t th li i th t h k i t i t l

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• LW-ID : last writer to the line in the current checkpoint interval.

11

Page 12: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Recording Inter-Thread Dependences

P1 P2

Write

P1 writes MyProducersMyConsumers

MyProducersMyConsumers

DP1

Write

LW-ID

Log Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Assume MESI protocol12

Page 13: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Recording Inter-Thread Dependences

P1 P2MyConsumers P2

P2 reads

y

MyProducers P1

MyProducersMyConsumers

MyProducersMyConsumersP2

P1

DP1 S

LW-ID

Write back

Logginggg g

MemoryLog

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Assume MESI protocol13

Page 14: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Recording Inter-Thread Dependences

P1 P2

P1 writes P2P1MyProducers

MyConsumersMyProducersMyConsumers

P1 SP1

LW-ID

DP1

MemoryLog

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Assume MESI protocol14

Page 15: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Recording Inter-Thread Dependences

P1 P2Clear Dep registers

Clear LW ID

P1 checkpoints P2P1MyProducers

MyConsumersMyProducersMyConsumers

p g

P1P1 S

W it b k

Clear LW-ID

LW-ID shouldremain set tillth li i

LW-ID

P1 DWritebacks

Logging

the line ischeckpointed

MemoryLog

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Assume MESI protocol15

Page 16: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Distributed Checkpointing Protocol in SW

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers– Built using MyProducers

P1P1 P2 P3 P4 InteractionSet : P1

P1

chk

initiatecheckpointcheckpoint

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

16

Page 17: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Distributed Checkpointing Protocol in SW

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers– Built using MyProducers

P1P1 P2 P3 P4 InteractionSet : P1, P2, P3

P1

P2 P3

Ck? Ck?chk

initiatecheckpointcheckpoint

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

17

Page 18: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Distributed Checkpointing Protocol in SW

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers– Built using MyProducers

P1P1 P2 P3 P4 InteractionSet : P1, P2, P3

P1

P2 P3

Ck? Ck?chk

initiatecheckpoint

P4

Ck?

checkpoint

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

18

Page 19: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Distributed Checkpointing Protocol in SW

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers– Built using MyProducers

P1P1 P2 P3 P4 InteractionSet : P1, P2, P3

P1

P2 P3

Ck? Ck?chk

initiatecheckpoint

P4

Ck?

checkpoint

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

19

Page 20: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Distributed Checkpointing Protocol in SW

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers– Built using MyProducers

P1P1 P2 P3 P4 InteractionSet : P1, P2, P3

P1

P2 P3

Ck? Ck?chk

initiatecheckpoint

P4

Ck?

• Rollback handled similarly using MyConsumers

checkpoint

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

20

Page 21: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Optimization1 : Delayed Writebacks

Inte

rval

I1Ti

me

nter

val

I1

Stall

WB dirty lines

sync

sync

Che

ckpo

int

Stallsync

WB dirty lines

eckp

oint

Innt

erva

l I2

Stall

C

Inte

rval

I2

syncCh In

• Checkpointing overhead dominated by data writebacks

• Delayed Writeback optimization• Processors synchronize and resume execution• Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back

Still d t d i t th d d d d l d d t

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Still need to record inter-thread dependences on delayed data

21

Page 22: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Delayed Writeback Pros/Cons

+ Significant reduction in checkpoint overhead

- Additional support:Each processor has two sets of Dep. registers E h h li h d l d bitEach cache line has a delayed bit

- Increased vulnerabilityyA rollback event forces both intervals to roll back

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

22

Page 23: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Optimization2 : Multiple Checkpoints

• Problem: Fault detection is not instantaneous– Checkpoint is safe only after max fault-detection latency (L)

Dep registers 1Ckpt 1

p y y ( )

ectio

nen

cy Dep registers 2Rol

lbac

k

Ckpt 2

Fault

Det

eLa

te

tf

• Solution: Keep multiple checkpoints– On fault, roll back interacting processors to safe checkpoints

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• No Domino Effect 23

Page 24: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Multiple Checkpoints: Pros/Cons

+ Realistic system: supports non-instantaneous fault detection

- Additional support:Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency

- Need to track communication across checkpointsNeed to track communication across checkpoints

- Combination with Delayed Writebacks: one more Dep register set

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

24

Page 25: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Optimization3 : Hiding Chkpt behind Global Barrier

• Global barriers require that all processors communicateLeads to global checkpoints– Leads to global checkpoints

• Optimization:p– Proactively trigger a global checkpoint at a global barrier– Hide checkpoint overhead behind barrier imbalance spins

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

25

Page 26: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Evaluation Setup

• Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim• Applications: SPLASH-2 some PARSEC ApacheApplications: SPLASH 2 , some PARSEC, Apache• Simulated CMP architecture with up to 64 threads• Checkpoint interval : 5 – 8 ms• Modeled several environments:

• Global: baseline global checkpointing• Rebound: Local checkpointing scheme with delayed writeback• Rebound: Local checkpointing scheme with delayed writeback.• Rebound_NoDWB: Rebound without the delayed writebacks.

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

26

Page 27: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Avg. Interaction Set: Set of Producer Processors

64

38

• Most apps: interaction set is a small setMost apps: interaction set is a small set– Justifies coordinated local checkpointing– Averages brought up by global barriers

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

27

Page 28: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Checkpoint Execution Overhead

30

40

nt

GlobalRebound_NoDWBR b d

10

20

% C

heck

poi

Ove

rhea

d Rebound

2

15

0

Bar

nes

Cho

lesk

y Fft

Fmm

Rad

ix

Lu-C

Lu-N

C

Volre

nd

Wat

er-

Sp

Wat

er-

Nsq

Rad

iosi

ty

Oce

an

Ray

trace

SP

2-AV

G

%

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for GlobalCompared to 15% for Global

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

28

Page 29: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Checkpoint Execution Overhead

30

40

nt

GlobalRebound_NoDWBR b d

10

20

% C

heck

poi

Ove

rhea

d Rebound

0

Bar

nes

Cho

lesk

y Fft

Fmm

Rad

ix

Lu-C

Lu-N

C

Volre

nd

Wat

er-

Sp

Wat

er-

Nsq

Rad

iosi

ty

Oce

an

Ray

trace

SP

2-AV

G

%

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for GlobalCompared to 15% for Global

• Delayed Writebacks complement local checkpointing

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

29

Page 30: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Rebound Scalability

Constant problem size

• Rebound is scalable in checkpoint overhead• Delayed Writebacks help scalability

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

30

Page 31: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Also in the Paper

• Delayed write backs also useful in GlobalBarrier optimi ation is effecti e b t not ni ersall applicable• Barrier optimization is effective but not universally applicable

• Power increase due to hardware additions < 2%• Rebound leads to only 4% increase in coherence trafficy

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

31

Page 32: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Conclusions

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

• Leverages directory protocol• Boosts checkpointing efficiency:

p g y

• Boosts checkpointing efficiency:• Delayed write-backs• Multiple checkpoints• Barrier optimization

• Avg. execution overhead for 64 procs: 2%

• Future work:• Apply Rebound to non-hardware coherent machines

SR. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

• Scalability to hierarchical directories32

Page 33: Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

Rebound: Scalable Checkpointing for Coherent Shared Memorfor Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep TorrellasD f C S iDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edup


Recommended