+ All Categories
Home > Documents > SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva...

SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva...

Date post: 05-Jan-2016
Category:
Upload: hortense-gardner
View: 216 times
Download: 1 times
Share this document with a friend
Popular Tags:
43
SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Rob Smolinski, Sarita Adve, Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign [email protected]
Transcript
Page 1: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

SWAT: Designing Resilient Hardware by

Treating Software Anomalies

Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Rob Smolinski, Sarita Adve,

Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou

Department of Computer Science

University of Illinois at Urbana-Champaign

[email protected]

Page 2: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

2

Motivation

• Hardware will fail in-the-field due to several reasons

Þ Need in-field detection, diagnosis, recovery, repair

• Reliability problem pervasive across many markets

– Traditional redundancy solutions (e.g., nMR) too expensive

Need low-cost solutions for multiple failure sources* Must incur low area, performance, power overhead

Transient errors(High-energy particles )

Wear-out(Devices are weaker)

Design Bugs … and so on

Page 3: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

3

Observations

• Need handle only hardware faults that propagate to software

• Fault-free case remains common, must be optimized

Watch for software anomalies (symptoms)

– Zero to low overhead “always-on” monitors

Diagnose cause after symptom detected

− May incur high overhead, but rarely invoked

SWAT: SoftWare Anomaly Treatment

Page 4: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

4

SWAT Framework Components

• Detection: Symptoms of software misbehavior

• Recovery: Checkpoint and rollback

• Diagnosis: Rollback/replay on multicore

• Repair/reconfiguration: Redundant, reconfigurable hardware

• Flexible control through firmware

Fault Error Symptomdetected

Recovery

Diagnosis Repair

Checkpoint Checkpoint

Page 5: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

5

Advantages of SWAT

• Handles all faults that matter– Oblivious to low-level failure modes and masked faults

• Low, amortized overheads– Optimize for common case, exploit SW reliability solutions

• Customizable and flexible– Firmware control adapts to specific reliability needs

• Holistic systems view enables novel solutions– Synergistic detection, diagnosis, recovery solutions

• Beyond hardware reliability– Long term goal: unified system (HW+SW) reliability

– Potential application to post-silicon test and debug

Page 6: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

SWAT Contributions

In-situ diagnosis [DSN’08]

Very low-cost detectors [ASPLOS’08, DSN’08]

Low SDC rate, latency

Diagnosis

Fault Error Symptomdetected

Recovery

Repair

Checkpoint Checkpoint

Accurate fault modeling[HPCA’09]

Multithreaded workloads [MICRO’09]

Application-Aware SWATEven lower SDC, latency

Page 7: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

7

Outline

• Motivation

• Detection

• Recovery analysis

• Diagnosis

• Conclusions and future work

Page 8: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

8

Simple Fault Detectors [ASPLOS ’08]

• Simple detectors that observe anomalous SW behavior

• Very low hardware area, performance overhead

SWAT firmware

Fatal Traps

Division by zero,RED state, etc.

Kernel Panic

OS enters panicState due to fault

High OS

High contiguousOS activity

Hangs

Simple HW hangdetector

App Abort

Application abort due to fault

Page 9: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

9

Evaluating Fault Detectors

• Simulate OpenSolaris on out-of-order processor

– GEMS timing models + Simics full-system simulator

• I/O- and compute-intensive applications

– Client-server – apache, mysql, squid, sshd

– All SPEC 2K C/C++ – 12 Integer, 4 FP

• µarchitecture-level fault injections (single fault model)

– Stuck-at, transient faults in 8 µarch units

– ~18,000 total faults statistically significant

10M instr

Timing simulation

If no symptom in 10M instr, run to completion

Functional simulation

Fault

Masked orPotential Silent Data Corruption (SDC)

Page 10: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

10

Metrics for Fault Detection

• Potential SDC rate

– Undetected fault that changes app output

– Output change may or may not be important

• Detection Latency

– Latency from architecture state corruption to detection* Architecture state = registers + memory

* Will improve later

– High detection latency impedes recovery

Page 11: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

SDC Rate of Simple Detectors: SPEC, permanents

• 0.6% potential SDC rate for permanents in SPEC, without FPU

• Faults in FPU need different detectors

– Mostly corrupt only data

De

co

de

r

INT

AL

U

Re

g D

bu

s

Int

reg

RO

B

RA

T

AG

EN

FP

AL

U

To

tal N

o F

P

0%

20%

40%

60%

80%

100%Permanent Faults

SDC

Detected

MaskedTo

tal i

nje

cti

on

s

0.4 0.7 0.6 0.6

Page 12: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

12

Potential SDC Rate

• SWAT detectors highly effective for hardware faults

– Low potential SDC rates across workloads

Server SPEC Server SPECPermanents Transients

0%

20%

40%

60%

80%

100%

Potential SDC

Detected

Masked

Inje

cte

d F

au

lts

0.4 0.7 0.6 0.6

Page 13: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

13

Detection Latency

• 90% of the faults detected in under 10M instructions

• Existing work claims these are recoverable w/ HW chkpting

– More recovery analysis follows later

<10K <100K <1M <10M >10M40%

50%

60%

70%

80%

90%

100% Server workloads

PermanentTransient

Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

<10K <100K <1M <10M >10M40%

50%

60%

70%

80%

90%

100% SPEC workloads

Detection Latency (Instructions)

Page 14: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Exploiting Application Support for Detection

• Techniques inspired by software bug detection

– Likely program invariants: iSWAT* Instrumented binary, no hardware changes

* <5% performance overhead on x86 processors

– Detecting out-of-bounds addresses* Low hardware overhead, near-zero performance impact

• Exploiting application-level resiliency

Page 15: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Exploiting Application Support for Detection

• Techniques inspired by software bug detection

– Likely program invariants: iSWAT* Instrumented binary, no hardware changes

* <5% performance overhead on x86 processors

– Detecting out-of-bounds addresses* Low hardware overhead, near-zero performance impact

• Exploiting application-level resiliency

Page 16: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

16

Low-Cost Out-of-Bounds Detector

• Sophisticated detector for security, software bugs

– Track object accessed, validate pointer accesses

– Require full-program analysis, changes to binary

• Bad addresses from HW faults more obvious

– Invalid pages, unallocated memory, etc.

• Low-cost out-of-bounds detector

– Monitor boundaries of heap, stack, globals

– Address beyond these bounds HW fault

- SW communicates boundaries to HW

- HW enforces checks on ld/st address

App Code

Globals

Heap

Stack

Libraries

Empty

Reserved

App Address Space

Page 17: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

17

Impact of Out-of-Bounds Detector

Lower potential SDC rate in server workloads

– 39% lower for permanents, 52% for transients

For SPEC workloads, impact is on detection latency

Server Workloads SPEC Workloads

SWAT OoB SWAT OoBPermanents Transients

0%

20%

40%

60%

80%

100%

Potential SDC

Detect-Other

Detect-OoB

Masked

0.38 0.23 0.58 0.28

SWAT OoB SWAT OoBPermanents Transients

0%

20%

40%

60%

80%

100%

Inje

cte

d F

au

lts

0.67 0.63 0.65 0.65

Page 18: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

18

Application-Aware SDC Analysis

• Potential SDC undetected faults that corrupt app output

• But many applications can tolerate faults

– Client may detect fault and retry request

– Application may perform fault-tolerant computations* E.g., Same cost place & route, acceptable PSNR, etc.

Þ Not all potential SDCs are true SDCs

- For each application, define notion of fault tolerance

• SWAT detectors cannot detect such acceptable changesshould not?

Page 19: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

19

Application-Aware SDCs for Server

• 46% of potential SDCs are tolerated by simple retry

• Only 21 remaining SDCs out of 17,880 injected faults

– Most detectable through application-level validity checks

SWAT

SWAT +

OoB

w/ a

pp tole

rance

0

10

20

30

40

50

60Permanent Faults

Nu

mb

er

of

Fa

ult

s

34(0.38%)

21(0.23%)

12(0.13%)

SWAT

SWAT +

OoB

w/ a

pp tole

rance

0

10

20

30

40

50

60

Transient Faults52

(0.58%)

25(0.28%)

9(0.10%)

Page 20: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

20

• Only 62 faults show >0% degradation from golden output

• Only 41 injected faults are SDCs at >1% degradation

– 38 from apps we conservatively classify as fault intolerant* Chess playing apps, compilers, parsers, etc.

Application-Aware SDCs for SPEC

SWAT+OoB >0% >0.01% >1%0

10

20

30

40

50

60

70Permanent Faults

Nu

mb

er o

f F

ault

s

56(0.6%)

16(0.2%) 8

(0.1%)11

(0.1%)

SWAT+OoB >0% >0.01% >1%0

10

20

30

40

50

60

70Transient Faults

58(0.6%)

46(0.5%)

33(0.4%)

37(0.4%)

Page 21: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

21

Reducing Potential SDCs further (future work)

• Explore application-specific detectors

– Compiler-assisted invariants like iSWAT

– Application-level checks

• Need to fundamentally understand why, where SWAT works

– SWAT evaluation largely empirical

– Build models to predict effectiveness of SWAT

* Develop new low-cost symptom detectors

* Extract minimal set of detectors for given sets of faults

* Reliability vs overhead trade-offs analysis

Page 22: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

22

• SWAT relies on checkpoint/rollback for recovery

• Detection latency dictates fault recovery

– Checkpoint fault-free fault recoverable

• Traditional defn. = arch state corruption to detection

• But software may mask some corruptions!

• New defn. = Unmasked arch state corruption to detection

Reducing Detection Latency: New Definition

Bad SW state

New Latency

Bad arch state

Old latency

FaultDetection

Recoverablechkpt

Recoverablechkpt

Page 23: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

23

Measuring Detection Latency

• New detection latency = SW state corruption to detection

• But identifying SW state corruption is hard!

– Need to know how faulty value used by application

– If faulty value affects output, then SW state corrupted

• Measure latency by rolling back to older checkpoints

– Only for analysis, not required in real system

FaultDetection

Bad arch state Bad SW state

New latency

ChkptRollback &

Replay

SymptomChkpt Fault effectmasked

Rollback &Replay

Page 24: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

24

Detection Latency - SPEC

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%

100%Permanent Faults in Server

Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%

100%Transient Faults in Server

Old Latency SWAT

Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

Page 25: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

25

Detection Latency - SPEC

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%

100%Permanent Faults in Server

Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%

100%Transient Faults in Server

New Latency SWAT

Old Latency SWAT

Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

Page 26: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

26

Detection Latency - SPEC

• Measuring new latency important to study recovery

• New techniques significantly reduce detection latency

- >90% of faults detected in <100K instructions

• Reduced detection latency impacts recoverability

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%

100%Permanent Faults in Server

Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%

100%Transient Faults in Server

New Latency out-of-bounds

New Latency SWAT

Old Latency SWAT

Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

Page 27: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

27

Detection Latency - Server

• Measuring new latency important to study recovery

• New techniques significantly reduce detection latency

- >90% of faults detected in <100K instructions

• Reduced detection latency impacts recoverability

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%

100%Permanent Faults in Server

Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%

100%Transient Faults in Server

New Latency out-of-bounds

New Latency SWAT

Old Latency SWAT

Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

Page 28: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

28

Implications for Fault Recovery

• Checkpointing

– Record pristine arch state for recovery

– Periodic registers snapshot, log memory writes

• I/O buffering

– Buffer external events until known to be fault-free

– HW buffer records device reads, buffers device writes

“Always-on” must incur minimal overhead

Checkpointing I/O buffering

Recovery

Page 29: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

29

Overheads from Memory Logging

• New techniques reduce chkpt overheads by over 60%

– Chkpt interval reduced to 100K from millions of instrs.

10K 100K 1M 10M0

250

500

750

1000

1250

1500

1750

2000

2250

2500apache

sshd

squid

mysql

Checkpoint Interval (Instructions)

Me

mo

ry L

og

Siz

e (

in K

B)

Page 30: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

30

Overheads from Output Buffering

• New techniques reduce output buffer size to near-zero

– <5KB buffer for 100K chkpt interval (buffer for 2 chkpts)

– Near-zero overheads at 10K interval

10K 100K 1M 10M0

2

4

6

8

10

12

14

16

18

20

apache

sshd

squid

mysql

Checkpoint Interval (Instructions)

Ou

tpu

t B

uff

er

siz

e (

in K

B)

Page 31: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

31

Low Cost Fault Recovery (future work)

• New techniques significantly reduce recovery overheads

– 60% in memory logs, near-zero output buffer

• But still do not enable ultra-low cost fault recovery

– ~400KB HW overheads for memory logs in HW (SafetyNet)

– High performance impact for in-memory logs (ReVive)

• Need ultra low-cost recovery scheme at short intervals

– Even shorter latencies

– Checkpoint only state that matters

– Application-aware insights – transactional apps, recovery

domains for OS, …

Page 32: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Fault Diagnosis

• Symptom-based detection is cheap but

– May incur long latency from activation to detection

– Difficult to diagnose root cause of fault

• Goal: Diagnose the fault with minimal hardware overhead

– Rarely invoked higher perf overhead acceptable

SW Bug Transient Fault

PermanentFault

Symptom

?

Page 33: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

SWAT Single-threaded Fault Diagnosis [Li et al., DSN ‘08]

• First, diagnosis for single threaded workload on one core

– Multithreaded w/ multicore later – several new challenges

Key ideas

• Single core fault model, multicore fault-free core available

• Chkpt/replay for recovery replay on good core, compare

• Synthesizing DMR, but only for diagnosis

Traditional DMR

P1 P2

=

Always on expensive

P1 P2

=

P1

Synthesized DMR

Fault-freeDMR only on fault

Page 34: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

SW Bug vs. Transient vs. Permanent

• Rollback/replay on same/different core

• Watch if symptom reappears

No symptom Symptom

Deterministic s/w orPermanent h/w bug

Symptom detected

Faulty Good

Rollback on faulty core

Rollback/replay on good core

Continue Execution

Transient or non-deterministic s/w bug

Symptom

Permanenth/w fault,

needs repair!

No symptom

Deterministic s/w bug(send to s/w layer)

Page 35: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

µarch-level Fault Diagnosis

Permanent fault

Microarchitecture-levelDiagnosis

Unit X is faulty

Symptomdetected

Diagnosis

Softwarebug

Transientfault

Page 36: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Trace Based Fault Diagnosis (TBFD)

• µarch-level fault diagnosis using rollback/replay

• Key: Execution caused symptom trace activates fault

– Deterministically replay trace on faulty, fault-free cores

– Divergence faulty hardware used diagnosis clues

• Diagnose faults to µarch units of processor

– Check µarch-level invariants in several parts of processor

– Diagnosis in out-of-order logic (meta-datapath) complex

Page 37: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Trace-Based Fault Diagnosis: Evaluation

• Goal: Diagnose faults at reasonable latency

• Faults diagnosed in 10 SPEC workloads

– ~8500 detected faults (98% of unmasked)

• Results

– 98% of the detection successfully diagnosed

– 91% diagnosed within 1M instr (~0.5ms on 2GHz proc)

Page 38: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

SWAT Multithreaded Fault Diagnosis [Hari et al., MICRO ‘09]

• Challenge 1: Deterministic replay involves high overhead

• Challenge 2: Multithreaded apps share data among threads

• Symptom causing core may not be faulty

• No known fault-free core in system

Core 2

Fault

Core 1

Symptom Detectionon a fault-free core

Store

Memory

Load

Page 39: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

mSWAT Diagnosis - Key Ideas

Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Isolated deterministic

replayEmulated TMRKey Ideas

TA TB TC TD

TA

TA TB TC TD

TA

A B C D

TA

A B C D

Page 40: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

mSWAT Diagnosis - Key Ideas

Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Isolated deterministic

replayEmulated TMRKey Ideas

TA TB TC TD

TA

TA TB TC TD

A B C DA B C D

TD TA TB TC

TC TD TA TB

Page 41: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

mSWAT Diagnosis: Evaluation

• Diagnose detected perm faults in multithreaded apps

– Goal: Identify faulty core, TBFD for µarch-level diagnosis

– Challenges: Non-determinism, no fault-free core known

– ~4% of faults detected from fault-free core

• Results

– 95% of detected faults diagnosed* All detections from fault-free core diagnosed

– 96% of diagnosed faults require <200KB buffers* Can be stored in lower level cache low HW overhead

• SWAT diagnosis can work with other symptom detectors

Page 42: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Summary: SWAT works!

In-situ diagnosis [DSN’08]

Very low-cost detectors [ASPLOS’08, DSN’08]

Low SDC rate, latency

Diagnosis

Fault Error Symptomdetected

Recovery

Repair

Checkpoint Checkpoint

Accurate fault modeling[HPCA’09]

Multithreaded workloads [MICRO’09]

Application-Aware SWATEven lower SDC, latency

Page 43: SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Future Work

• Formalization of when/why SWAT works

• Near zero cost recovery

• More server/distributed applications

• App-level, customizable resilience

• Other core and off-core parts in multicore

• Other fault models

• Prototyping SWAT on FPGA w/ Michigan

• Interaction with safe programming

• Unifying with s/w resilience


Recommended