Towards a Hardware-Software
Co-Designed Resilient System
Man-Lap (Alex) Li, Pradeep Ramachandran,
Sarita Adve, Vikram Adve, Yuanyuan Zhou
University of Illinois at Urbana-Champaign
In collaboration with
Pradip Bose (IBM) and Subhasish Mitra (Stanford)
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Motivation
• Failures will happen in the field
– Design defects
– Aging
– Soft errors
– Inadequate burn-in
– Aggressive design for power/performance/reliability
– …
• Low-cost method to detect/recover from all sources of failure?
– Reliability problem pervasive across many markets
– Traditional solutions (e.g. nMR) too expensive
– Must incur low performance, power overhead
Pradeep Ramachandran, University of Illinois, Urbana Champaign
A Low-Cost, Unified Reliability Solution
• Need handle only faults that propagate to software
– Hardware faults appear as software bugs
– Leverage software reliability solutions for hardware?
• One-size-fits-all near-100% coverage often unnecessary
– Solution must be customizable to application needs
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Outline
• Motivation of Framework
• Unified Framework for H/W + S/W Reliability
• Understanding the Impact of H/W Failures on S/W
• Future Work
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Unified Framework for H/W + S/W Reliability
• Unified hardware/software co-designed framework
– Tackles hardware and software faults
– Software-centric solutions with near-zero H/W overhead
– Customizable to app needs, flexible for new error sources
Error undetected
Fault
Error
CHECKPOINT
CHECKPOINT
Error detected
CHECKPOINT
Detection with more overhead
Fault
Error
Testing
CHECKPOINT
Repair, recovery
No error
Fault
Error
Symptom detected
Recovery
CHECKPOINT
CHECKPOINT
Ideal: symptom-based detection
Repair
Diagnosis
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Framework Components
• Detection: Software symptoms, online testing
• Recovery: Software/hardware checkpoint and rollback
• Diagnosis: Firmware layer for rollback/replay, online testing
• Repair/reconfiguration: Redundant, reconfigurable hardware
• Need to understand how hardware faults propagate to S/W
– How do hardware faults become visible to software?
– What is the latency?
– Do H/W faults affect application and/or system state?
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Methodology
• Microarchitecture-level fault injection
– Trade-off between accuracy and simulation time
– GEMS timing models for out-of-order processor, memory
– Simics full-system simulation of Solaris + UltraSPARC III
– SPEC workloads for ten million instructions
• Fault model
– Stuck-at, bridging faults in many micro-arch structures
• Fault detection
– Crashes detected through hardware generated fatal traps
Misaligned memory access, RED state, watchdog reset, etc.
– Hangs detected using simple hardware hang detector
Pradeep Ramachandran, University of Illinois, Urbana Champaign
How do Hardware Faults Propagate to Software?
• 97% faults (w/o FPU) detectable with simple H/W & S/W– Need H/W support or S/W monitoring for FPU
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
s-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vcc
Decoderfault
INT ALUfault
FP ALU fault Reg Dbusfault
Int reg fault ROB fault RAT fault
OtherHangCrashMask
Pradeep Ramachandran, University of Illinois, Urbana Champaign
How do Hardware Faults Propagate to Software?
• 97% faults (w/o FPU) detectable with simple H/W & S/W– Need H/W support or S/W monitoring for FPU
• > 50% crashes/hangs in OS
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
s-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vccs-a-0B-gnd
s-a-1B-Vcc
Decoderfault
INT ALUfault
FP ALUfault
Reg Dbusfault
Int reg fault ROB fault RAT fault
OtherHang-OSHang-AppCrash-OSCrash-AppMask
Pradeep Ramachandran, University of Illinois, Urbana Champaign
S/W Components Corrupted
• 62% of faults corrupt system state
– Need to recover system state
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
CrashHang CrashHang CrashHang CrashHang CrashHang CrashHang
Decoderfault
INT ALUfault
Reg Dbusfault
Int regfault
ROB fault RATfault
NoneSystemApp
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Latency to Detection from Application Corruption
• 80% have latency < 100K instr, amenable to H/W recovery
– Buffering for 50µs on 2 GHz processor
• May need to use software checkpoint/recovery for others
Total instructions executed between app state corruption and detection
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
CrashHang CrashHang CrashHang CrashHang CrashHang CrashHang
Decoderfault
INT ALUfault
Reg Dbusfault
Int reg fault ROB fault RATfault
<10M
<100k
<1k
<100
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Latency to Detection from OS Corruption
• 92% of injections result in latency of < 100K OS instructions
– Amenable to hardware recovery
OS-only instructions executed between OS state corruption and detection
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
CrashHang CrashHang CrashHang CrashHang CrashHang CrashHang
Decoderfault
INT ALUfault
Reg Dbusfault
Int reg fault ROB fault RATfault
<10M
<100k
<1k
<100
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Summary so far
• Hardware faults highly visible
– Over 97% of faults in 6 structures result in crashes/hangs
– Simple H/W and S/W sufficient
• Recovery through checkpointing
– S/W and/or H/W checkpoints for application recovery
– H/W checkpoints and buffering for OS recovery
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Next Steps (1 of 3)• Improving understanding of fault propagation
– Accurate fault models, effect of transients, intermittents
– Lower-level simulations
– Better workloads
• Detection
– More software level monitoring Software signals, invariants, perturbations, …
– H/W support to aid detection in some structures (e.g., FPU)
– Selective backup testing
• Recovery
– Enhanced detection may reduce latency
– Explore software vs. hardware, application customizability
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Next Steps (2 of 3)• Diagnosis
– Assume rollback/restart mechanism, multicore system
Original symptom doesn’t recur Original symptom recurs
Transient h/w bug, ornon-deterministic s/w bug
Continue execution…
Deterministic s/w bug, orPermanent h/w bug
Rollback, restart on different core
Permanent defect in original core
Bug detected
Rollback to previous checkpoint, restart on original core
No symptom
Deterministic s/w bug
Symptom
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Next Steps (3 of 3)• Repair/reconfigure
– What should be the right field configurable unit?
– Core, FU, array entries?
• Avoidance
– Dynamic reliability management
• Implementation architecture
– Hardware + firmware + OS
– Itanium machine check architecture has hooks
Thank You
Questions?
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Backup Slides
Pradeep Ramachandran, University of Illinois, Urbana Champaign
Types of fatal traps
• Faults cause different fatal traps thrown before crashes
– Junk data access leads to memory misalignment
– Repeatedly trapping leads to RED state
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Decoderfault
INT ALUfault
Reg Dbusfault
Int regfault
ROB fault RAT fault
OS - Red state
OS - Memory Misaligned
OS - Watchdog Reset
OS - Illegal Instruction
OS - Division By Zero
OS - Data Acc. Exception
App - Memory MisAligned
App - Watchdog Reset
App - Illegal Instruction
App - Division By Zero
App - Data Acc. Exception