Fault-Tolerant Computing – It’s Time to Cross the Layer for Cost-Effectiveness Qiang XU CUhk...

Fault-Tolerant Computing – Fault-Tolerant Computing – It’s Time to Cross the Layer for It’s Time to Cross the Layer for

Cost-EffectivenessCost-Effectiveness

Qiang XUQiang XU

CUCUhk hk REREliable computing laboratory (CURE)liable computing laboratory (CURE)Department of Computer Science & EngineeringDepartment of Computer Science & Engineering

The Chinese University of Hong KongThe Chinese University of Hong Kong

Effects– Manufacturing defects– Process variation– Transient errors from radiation– Noise fluctuations– Fragile devices with shortened

lifetimes

Technology Scaling Continues…Technology Scaling Continues…

Feature size shrinks to tens of atoms across!

Ever-Increasing Defect DensityEver-Increasing Defect Density

• IBM’s 8-core Cell processor chips: 10-20% yield

• Testing is responsible for ensuring the quality of shipped products

Defective Chip IdentificationO

ccur

renc

e F

requ

ency

GOODPopulation

BADPopulation

DecisionThreshold

In the Past …

Redraw from [O’Neill-itc07]

Where is the Decision Threshold?Where is the Decision Threshold?O

ccur

renc

e F

requ

ency

GOODPopulation

BADPopulation

Nowadays …

Redraw from [O’Neill-itc07]

DecisionThreshold

TESTESCAPE

FALSEREJECT

Manufacturing Test is NOT Reliable Any More!

Process variation

Func./test mode discrepancy

Current Solution for Yield ImprovementCurrent Solution for Yield Improvement

• Yield-driven redundancy – Cisco’s 192-core Metro network processor contains 4 spares– nVidia’s 128-core GeForce 8800 GPU can be degraded to

96-core version if some cores are faulty

• Simple solution but …• More and more redundant circuitries are

necessary• Require precise offline testing

• Hard errors– Time dependent dielectric breakdown (TDDB)– Electromigration (EM)– Negative bias temperature instability (NBTI)– Stress migration (SM)

• Soft errors– Alpha particles; Neutron

• Intermittent faults

Other Reliability ThreatsOther Reliability Threats

Permanent

Transient

Burst for a Period of Time

Hardware solution, again, more redundant circuitries!

The Impact of Reliability Threats with ScalingThe Impact of Reliability Threats with Scaling

Fai

lure

Rat

e

Time

Useful LifeUseful Life

Faster aging

Difficult Burn-in

Higher failure rate

To Keep Scaling …To Keep Scaling …

Cos

t pe

r T

rans

isto

r

Year

Transistor Cost

Reliability Cost

Total Cost

To Achieve Cost-Effective Scaling To Achieve Cost-Effective Scaling

Unlike old days, defective/Vulnerable ICs will be shipped to customers!

Cross-layer solution as a remedy for resilient system design!

Cross-Layer ReliabilityCross-Layer Reliability

• Tolerate critical defects and soft/hard error with high failure rates at hardware level

• Mask non-critical defects and soft/hard errors with low failure rates at Hw.-dependent software level

• Take advantage of error-tolerance at application level

Applications

Defective/VulnerableICs

Hw.-dependent Sw.

Key Questions in Cross-Layer ReliabilityKey Questions in Cross-Layer Reliability

• @ Circuit-level• Which defects, soft/hard errors are critical enough requiring

hardware redundancy? • Protect at which granularity?• Traditional pass/fail testing methodology no longer stands,

what would be the new metrics for testing?• Ever-increasingly important online test and diagnosis

Differentiate the impact of various reliability threatsand tackle them at different layers!


• @ Hardware-dependent software level• How to model various hardware faults accurately at this

level?• How to allocate workloads intelligently to mitigate such

errors?

• @ Application level• How to take application reliability requirements into

account? • Is it possible to generalize such solutions?



• @ System-level - Low-cost resilient designs under performance, power, and reliability constraint

• How to monitor the system’s reliability changes?• How do we evaluate the cross-layer reliability for the entire

system?• Can we separate the layers clearly with only FIT or BER

information?


High-Level Lifetime Reliability Modeling and High-Level Lifetime Reliability Modeling and Simulation FrameworkSimulation Framework

– Functionality

– Expected service life

– Power consumption

– Area constraint

– Thermal issue

– …

SPECIFICATION

IC DESIGN

DPM / DTMDVFS

Timeout

Thermal throttling

Power gating

…

RedundancyLevel

Quantity

…

Task AllocationRound-robin

Energy-driven

…

The ChallengeThe Challenge

• Wear-out effects of hard errors• Reliability at a specific time point depends on

– current reliability-related factors (e.g., temperature)

– aging effects due to past usage

• Significant temperature variation• Temperature simulation is time-consuming

TemperatureVariationExample

Only short simulation time is affordable!

The Challenge – Simulation FrameworkThe Challenge – Simulation Framework

• Apparently, it is not possible to trace temperature and aging-related execution parameters in a fine-grained manner throughout the entire lifetime

• What if we conduct coarse-grained tracing and compute lifetime reliability with average operational temperature?– The ignorance of temperature variation results in lack of

accuracy

• How to achieve efficient yet accurate lifetime reliability simulation with limited fine-grained trace information, when failure mechanisms follow arbitrary failure distributions?

Aging Rate CalculationAging Rate Calculation

• The key issue is to compute a time-independent aging rate Ω effectively with limited fine-grained traced information– Given general failure distribution R (t), e.g., Weibull distribution

express it as R (t) = R (Θ۰Ω۰t) , we then have

• Two steps– Deduct a close-form lifetime reliability function with time-varying

operational states and temperature– Extract the time-independent aging rate parameter from this

function

( )t

e

( )te

Lifetime Reliability Simulation Framework – Lifetime Reliability Simulation Framework – AgeSimAgeSim

• Evaluate lifetime reliability under various usage strategy and workload– DPM / DTM

– Trigger mechanism

– Load-sharing strategy

– Redundancy scheme

• Applicable for any failure distribution

• Output performance and energy consumption also

• Chip multiprocessor with increasing number of processor cores

• However, technology scaling also results in …– Defective cores on-chip– Cores with distinct performance

Asymmetry-Aware Processor Allocation for Asymmetry-Aware Processor Allocation for Chip MultiprocessorChip Multiprocessor

• Performance-asymmetry– Process variation

• Significant frequency deviation on a chip (up to 40%)

– Dynamic power-performance adaptation

• Topology-asymmetry– Manufacturing defects– Wearout effect

Asymmetric Chip MultiprocessorAsymmetric Chip Multiprocessor

Hide Hardware Defects @ OS LevelHide Hardware Defects @ OS Level

Applications

Chip Multiprocessor

OS

7

1

9

5 6 8

2 3 4

1210 11 Fault-free core

Faulty core

Router

A unified topology

Underlying hardware

• We propose two contiguous processor allocation methodologies with different computing power representations considering – Performance including communication overhead– Processor allocation time

Asymmetry-Aware Processor AllocationAsymmetry-Aware Processor Allocation

System Load = Mean Application Service Rate / Mean Application Arrival Rate

Thank you for your attention !Thank you for your attention !

Date post:	13-Dec-2015
Category:	Documents
Upload:	brett-abner-gray
View:	214 times
Download:	0 times

Fault-Tolerant Computing – It’s Time to Cross the Layer for Cost-Effectiveness Qiang XU CUhk...

Documents