Rebound: Scalable Checkpointing for Coherent Shared Memorfor Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep TorrellasD f C S iDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edup
Checkpointing in Shared-Memory MPs
rollback
Faultsave chkpt
save chkpt
• HW-based schemes for small CMPs use Global checkpointing– All procs participate in system-wide checkpoints
P1 P2 P3 P4
checkpoint
h k i t
P1 P2 P3 P4
• Global checkpointing is not scalable
checkpoint
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
– Synchronization, bursty movement of data, loss in rollback…2
Alternative: Coordinated Local Checkpointing
• Idea: threads coordinate their checkpointing in groups• Rationale:
– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant
P1 P2 P3 P4 P5 P1 P2 P3 P4 P5
GlobalChkpt
LocalChkptLocal
Chkpt
+ Scalable: Checkpoint and rollback in processor groupsC l it R d i t th d d d d i ll
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
3
– Complexity: Record inter-thread dependences dynamically.
Contributions
Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory
• Leverages directory protocol to track inter-thread deps.
p g y
• Opts to boost checkpointing efficiency:• Delaying write-back of data to safe memory at checkpoints• Supporting multiple checkpoints• Optimizing checkpointing at barrier synchronization
• Avg. performance overhead for 64 procs: 2%• Compared to 15% for global checkpointing
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
p g p g
4
Background: In-Memory Checkpt with ReVive
P1 P2 P3Register
[Pvrulovic-02]
ExecutionP1 P2 P3Register
Dump
Caches
CHK
Dirty Cache
Displacement
Writebacks
Writeback
W W W W WBDirty Cache linesCheckpoint
ApplicationStalls
MemoryLogLogging
Stalls
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
5
Background: In-Memory Checkpt with ReVive
[Pvrulovic-02]Old Register
restoredP3P2P1
FaultCHK
P3P2
Caches
P1
Cache Invalidated
Memory LinesR d
W W W W WB
Reverted
Log Memory
GlobalBroadcast protocol
Local CoordinatedScalable protocol
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
6
Coordinated Local Checkpointing Rules
P1 P2 P1 P2 P1 P2
wr x
P1 P2 P1 P2 P1 P2
rd x
Producerrollback
Consumerrollback
Producerchkpoint
Consumerchkpoint
chkptchkpt
rollback rollback chkpoint chkpoint
P checkpoints P’s producers checkpointP rolls back P’s consumers rollback
• Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]
P rolls back P s consumers rollback
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
based machines [Banatre96]
7
Rebound Fault Model
Log (in SW)Main Memory
Chip Multiprocessor
Log (in SW)
• Any part of the chip can suffer transient or permanent faults.• A fault can occur even during checkpointing• Off-chip memory and logs suffer no fault on their own (e g NVM)Off chip memory and logs suffer no fault on their own (e.g. NVM)• Fault detection outside our scope:
• Fault detection latency has upper-bound of L cycles
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
8
Rebound Architecture
Main Memory
Chip Multiprocessor
P+L1
L2DirectoryCache
MyProducerMyConsumer
DepRegister
LW-ID
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
9
Rebound Architecture
Main Memory
Chip Multiprocessor
P+L1
L2DirectoryCache
MyProducerMyConsumer
DepRegister
• Dependence (Dep) registers in the L2 cache controller:
LW-ID
p ( p) g• MyProducers : bitmap of proc. that produced data consumed by
the local proc.• MyConsumers : bitmap of proc that consumed data producedMyConsumers : bitmap of proc. that consumed data produced
by the local proc.
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
10
Rebound Architecture
Main Memory
Chip Multiprocessor
P+L1
L2DirectoryCache
MyProducerMyConsumer
DepRegister
• Dependence (Dep) registers in the L2 cache controller:
LW-ID
p ( p) g• MyProducers : bitmap of proc. that produced data consumed by
the local proc.• MyConsumers : bitmap of proc that consumed data producedMyConsumers : bitmap of proc. that consumed data produced
by the local proc. • Processor ID in each directory entry:
LW ID l t it t th li i th t h k i t i t l
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• LW-ID : last writer to the line in the current checkpoint interval.
11
Recording Inter-Thread Dependences
P1 P2
Write
P1 writes MyProducersMyConsumers
MyProducersMyConsumers
DP1
Write
LW-ID
Log Memory
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Assume MESI protocol12
Recording Inter-Thread Dependences
P1 P2MyConsumers P2
P2 reads
y
MyProducers P1
MyProducersMyConsumers
MyProducersMyConsumersP2
P1
DP1 S
LW-ID
Write back
Logginggg g
MemoryLog
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Assume MESI protocol13
Recording Inter-Thread Dependences
P1 P2
P1 writes P2P1MyProducers
MyConsumersMyProducersMyConsumers
P1 SP1
LW-ID
DP1
MemoryLog
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Assume MESI protocol14
Recording Inter-Thread Dependences
P1 P2Clear Dep registers
Clear LW ID
P1 checkpoints P2P1MyProducers
MyConsumersMyProducersMyConsumers
p g
P1P1 S
W it b k
Clear LW-ID
LW-ID shouldremain set tillth li i
LW-ID
P1 DWritebacks
Logging
the line ischeckpointed
MemoryLog
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Assume MESI protocol15
Distributed Checkpointing Protocol in SW
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers– Built using MyProducers
P1P1 P2 P3 P4 InteractionSet : P1
P1
chk
initiatecheckpointcheckpoint
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
16
Distributed Checkpointing Protocol in SW
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers– Built using MyProducers
P1P1 P2 P3 P4 InteractionSet : P1, P2, P3
P1
P2 P3
Ck? Ck?chk
initiatecheckpointcheckpoint
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
17
Distributed Checkpointing Protocol in SW
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers– Built using MyProducers
P1P1 P2 P3 P4 InteractionSet : P1, P2, P3
P1
P2 P3
Ck? Ck?chk
initiatecheckpoint
P4
Ck?
checkpoint
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
18
Distributed Checkpointing Protocol in SW
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers– Built using MyProducers
P1P1 P2 P3 P4 InteractionSet : P1, P2, P3
P1
P2 P3
Ck? Ck?chk
initiatecheckpoint
P4
Ck?
checkpoint
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
19
Distributed Checkpointing Protocol in SW
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers– Built using MyProducers
P1P1 P2 P3 P4 InteractionSet : P1, P2, P3
P1
P2 P3
Ck? Ck?chk
initiatecheckpoint
P4
Ck?
• Rollback handled similarly using MyConsumers
checkpoint
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
20
Optimization1 : Delayed Writebacks
Inte
rval
I1Ti
me
nter
val
I1
Stall
WB dirty lines
sync
sync
Che
ckpo
int
Stallsync
WB dirty lines
eckp
oint
Innt
erva
l I2
Stall
C
Inte
rval
I2
syncCh In
• Checkpointing overhead dominated by data writebacks
• Delayed Writeback optimization• Processors synchronize and resume execution• Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back
Still d t d i t th d d d d l d d t
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Still need to record inter-thread dependences on delayed data
21
Delayed Writeback Pros/Cons
+ Significant reduction in checkpoint overhead
- Additional support:Each processor has two sets of Dep. registers E h h li h d l d bitEach cache line has a delayed bit
- Increased vulnerabilityyA rollback event forces both intervals to roll back
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
22
Optimization2 : Multiple Checkpoints
• Problem: Fault detection is not instantaneous– Checkpoint is safe only after max fault-detection latency (L)
Dep registers 1Ckpt 1
p y y ( )
ectio
nen
cy Dep registers 2Rol
lbac
k
Ckpt 2
Fault
Det
eLa
te
tf
• Solution: Keep multiple checkpoints– On fault, roll back interacting processors to safe checkpoints
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• No Domino Effect 23
Multiple Checkpoints: Pros/Cons
+ Realistic system: supports non-instantaneous fault detection
- Additional support:Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency
- Need to track communication across checkpointsNeed to track communication across checkpoints
- Combination with Delayed Writebacks: one more Dep register set
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
24
Optimization3 : Hiding Chkpt behind Global Barrier
• Global barriers require that all processors communicateLeads to global checkpoints– Leads to global checkpoints
• Optimization:p– Proactively trigger a global checkpoint at a global barrier– Hide checkpoint overhead behind barrier imbalance spins
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
25
Evaluation Setup
• Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim• Applications: SPLASH-2 some PARSEC ApacheApplications: SPLASH 2 , some PARSEC, Apache• Simulated CMP architecture with up to 64 threads• Checkpoint interval : 5 – 8 ms• Modeled several environments:
• Global: baseline global checkpointing• Rebound: Local checkpointing scheme with delayed writeback• Rebound: Local checkpointing scheme with delayed writeback.• Rebound_NoDWB: Rebound without the delayed writebacks.
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
26
Avg. Interaction Set: Set of Producer Processors
64
38
• Most apps: interaction set is a small setMost apps: interaction set is a small set– Justifies coordinated local checkpointing– Averages brought up by global barriers
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
27
Checkpoint Execution Overhead
30
40
nt
GlobalRebound_NoDWBR b d
10
20
% C
heck
poi
Ove
rhea
d Rebound
2
15
0
Bar
nes
Cho
lesk
y Fft
Fmm
Rad
ix
Lu-C
Lu-N
C
Volre
nd
Wat
er-
Sp
Wat
er-
Nsq
Rad
iosi
ty
Oce
an
Ray
trace
SP
2-AV
G
%
• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for GlobalCompared to 15% for Global
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
28
Checkpoint Execution Overhead
30
40
nt
GlobalRebound_NoDWBR b d
10
20
% C
heck
poi
Ove
rhea
d Rebound
0
Bar
nes
Cho
lesk
y Fft
Fmm
Rad
ix
Lu-C
Lu-N
C
Volre
nd
Wat
er-
Sp
Wat
er-
Nsq
Rad
iosi
ty
Oce
an
Ray
trace
SP
2-AV
G
%
• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for GlobalCompared to 15% for Global
• Delayed Writebacks complement local checkpointing
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
29
Rebound Scalability
Constant problem size
• Rebound is scalable in checkpoint overhead• Delayed Writebacks help scalability
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
30
Also in the Paper
• Delayed write backs also useful in GlobalBarrier optimi ation is effecti e b t not ni ersall applicable• Barrier optimization is effective but not universally applicable
• Power increase due to hardware additions < 2%• Rebound leads to only 4% increase in coherence trafficy
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
31
Conclusions
Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory
• Leverages directory protocol• Boosts checkpointing efficiency:
p g y
• Boosts checkpointing efficiency:• Delayed write-backs• Multiple checkpoints• Barrier optimization
• Avg. execution overhead for 64 procs: 2%
• Future work:• Apply Rebound to non-hardware coherent machines
SR. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
• Scalability to hierarchical directories32
Rebound: Scalable Checkpointing for Coherent Shared Memorfor Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep TorrellasD f C S iDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edup