+ All Categories
Home > Documents > Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Date post: 29-Mar-2015
Category:
Upload: parker-downes
View: 218 times
Download: 3 times
Share this document with a friend
Popular Tags:
31
Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd
Transcript
Page 1: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAsDan Fisher, Addison Floyd

Page 2: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Outline

• Introduction

• Fault Detection - Motivation, Methods, etc.

• Fault Diagnosis - Motivation, Methods, etc.

• Fault Toleranceo Single FPGAo Multiple FPGAso Single Faultso Multiple Faults

• Conclusion

Page 3: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Introduction

• FPGA Background

• Importance

• Applications

• Motivation for Fault Tolerance

http://en.wikipedia.org/wiki/Field-programmable_gate_array

Page 4: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Detection - Motivation

Main Causes of Faults

• Degradation

• Manufacturing Defects

• Single Event Upsets(SEUs)

Page 5: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Detection - Judgement Criteria

Detection Methods are judged on:

• Speed of Detection

• Coverage

• Resource Overhead

• Performance Overhead

• Detection Granularity

Page 6: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Detection - Criteria In-Depth

Detection Granularity - how specific one is when detecting an error.

FPGA made up of Tiles containing:• Logic Blocks

• Connection Blocks - connect tiles

• Switch Blocks - connect tiles, allow for direction change

Page 7: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Detection - Comparison

Page 8: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Detection - SEDC Method

o The Method Explained• Partition data and Encode with SEDC codes

• Calculate and Store check bits

• Generate check bits as circuit operates

• Compare calculated and generated values

o Better than Berger and TMR

Page 9: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Detection - Nazar Method

• CED method providing single error detection

• Takes advantage of properties of LUTs

• Major Drawback - LUT insertion

• Area Improvement over DWC

Page 10: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Nazar Method - LUT Properties Explained*

1st Advantage: A LUT can be viewed as combinational circuit independent from others. Area overhead is avoided since you don’t need to replicate sub-expressions that form circuit outputs

2nd Advantage: A K-input LUT can compute any function with up to K inputs. So as long as our selected group is no more than K different inputs than the parity can be calculated using just one LUT. If the selected group also has no more than K-1 different outputs, then the checker can be made of just one LUT(with the last input the parity bit).

This picture shows upside-down triangles as LUTs, with a one parity LUT for each K-1 outputs. Also show is the checker which would be composed of just one LUT. Separate LUTs in the same checker group can’t overlap (otherwise they wouldn’t be independent) but in order to provide coverage different checker group LUTs can overlap.

*Note:This slide wasn’t in the original presentation but was added to try to better explain the method since some mentioned wanting to know more

Page 11: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Detection - Roving Stars

o New method for online detection o Detected faults do not affect working logico STARs and BISTERso Better than other methods

*Picture added after presentation to attempt to help

clear up any confusion.

Page 12: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Detection - Injection Topic 1

• Which modules most sensitive to SEU

• 1.4% sensitive(83% routing/16% logic)

• Density matrix

Page 13: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Detection - Injection Topic 2

• HW module to test efficiency of SEU mitigation schemes

• How to emulate SEUs - 2 step process

• Example Results

• Scrubbing Rate

Page 14: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Diagnosis - Roving Stars

• Diagnose both interconnect & plb faults

• Partial Reuse

• Future - Do we allow for retest of fault?

Page 15: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Diagnosis - More Abramovici

• BIST-based method in 2000

• 2004 paper further extending Roving Stars

Page 16: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Diagnosis - Niamat - MATS++

• Diagnose multiple stuck at faults

• Use of MATS++ algorithm

• Goal of speeding up diagnosis

Page 17: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Diagnosis - Tahoori’s Method

• Diagnose a single fault in interconnect or logic

• Application Dependent

• Basic Idea

Page 18: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance

• Single FPGA platform

• Multi FPGA platform

• Single Fault

• Multiple Faults

Page 19: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Single FPGA

Dynamic Fault Tolerance via Partial Reconfiguration● online - handles faulty PLBs without system stopping● uses spare logic cells

Stroud et al

Page 20: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Single FPGA

Online Fault Tolerance for FPGA Logic Blocks● reuse defective blocks to increase the number of spares

and extend mission life● uses commercial CAD tools to implement

Stroud et al

Page 21: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Single FPGAUsing Relocatable Bitstreams for Fault Tolerance

● combines passive and active techniques● standardized relocatable modules, which are copied

and stored

Montminy et al

Page 22: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Multi FPGA

A Reliable Reconfiguration Controller for Fault-Tolerant Embedded Systems on Multi-FPGA platforms

● multiple FPGAs in a mesh topology● hardening achieved by TMR● distributed solution

Bolchini et al

Page 23: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Single Fault

Designing Fault Tolerant Systems into SRAM-based FPGAs

● for use in space● Duplication with Comparison and Concurrent Error

Detection

Lima et al

Page 24: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Single Fault

TMR and Partial Dynamic Reconfiguration to Mitigate SEU Faults in FPGAs

● passive Triple Modular Redundancy

Bolchini et al

Page 25: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Single Fault

IPR: In-Place Reconfiguration for FPGA Fault Tolerance

● preserves function and topology of LUT-based logic network

● algorithm applied post-layout

Zhe et al

Page 26: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Single Fault

A Novel SRAM-Based FPGA Architecture for Efficient TMR Fault Tolerance Support

● Architectural level● augments LUTs with TMR● minimize number of

reconfigurations

Kyriakoulakos et al

Page 27: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Multiple Faults

Placement of Repair Circuits for In-Field FPGA Repair

● utilize unused FPGA resources

● repair circuits identified before faults occur

● alternate repair circuits cached locally or remotely

Wirthlin et al

Page 28: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Multiple Faults

Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing

● dynamic self-adaptation

● high reliability vs. high performance

Jacobs et al

Page 29: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Fault Tolerance - Multiple Faults

Exploiting Partially Defective LUTs: Why You Don’t Need Perfect Fabrication

● because of shrinking feature size, transistor variability and failure rates are going up

● identifies partially defective LUTs for reuse

DeHon et al

Page 30: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Conclusion

• Importance of FPGAs

• FPGA applications

• Future of FPGA fault tolerance

Page 31: Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs Dan Fisher, Addison Floyd.

Questions?


Recommended