EE1411
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1
Chapter 3Chapter 3
Fault-Tolerant DesignFault-Tolerant Design
EE1412
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 2
What is this chapter about?What is this chapter about?
Gives Overview of Fault-Tolerant Design
Focus on Basic Concepts in Fault-Tolerant Design Metrics Used to Specify and Evaluate Dependability Review of Coding Theory Fault-Tolerant Design Schemes
– Hardware Redundancy– Information Redundancy– Time Redundancy
Examples of Fault-Tolerant Applications in Industry
EE1413
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 3
Fault-Tolerant DesignFault-Tolerant Design
Introduction Fundamentals of Fault Tolerance Fundamentals of Coding Theory Fault Tolerant Schemes Industry Practices Concluding Remarks
EE1414
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 4
IntroductionIntroduction Fault Tolerance
Ability of system to continue error-free operation in presence of unexpected fault
Important in mission-critical applications E.g., medical, aviation, banking, etc. Errors very costly
Becoming important in mainstream applications Technology scaling causing circuit behavior to
become less predictable and more prone to failures Needing fault tolerance to keep failure rate within
acceptable levels
EE1415
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 5
FaultsFaults Permanent Faults
Due to manufacturing defects, early life failures, wearout failures
Wearout failures due to various mechanisms– e.g., electromigration, hot carrier degradation, dielectric
breakdown, etc.
Temporary Faults Only present for short period of time Caused by external disturbance or marginal design
parameters
EE1416
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 6
Temporary FaultsTemporary Faults
Transient Errors (Non-recurring errors) Cause by external disturbance
– e.g., radiation, noise, power disturbance, etc.
Intermittent Errors (Recurring errors) Cause by marginal design parameters Timing problems
– e.g., races, hazards, skew
Signal integrity problems– e.g., crosstalk, ground bounce, etc.
EE1417
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 7
RedundancyRedundancy
Fault Tolerance requires some form of redundancy Time Redundancy Hardware Redundancy Information Redundancy
EE1418
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 8
Time RedundancyTime RedundancyPerform Same Operation Twice
See if get same result both times If not, then fault occurred Can detect temporary faults Cannot detect permanent faults
– Would affect both computations
Advantage Little to no hardware overhead
Disadvantage Impacts system or circuit performance
EE1419
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 9
Hardware RedundancyHardware Redundancy
Replicate hardware and compare outputs From two or more modules Detects both permanent and temporary faults
Advantage Little or no performance impact
Disadvantage Area and power for redundant hardware
EE14110
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 10
Information RedundancyInformation Redundancy
Encode outputs with error detecting or correcting code Code selected to minimize redundancy for
class of faultsAdvantage
Less hardware to generate redundant information than replicating module
Drawback Added complexity in design
EE14111
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 11
Failure RateFailure Rate (t) = Component failure rate
Measured in FITS (failures per 109 hours)
EE14112
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 12
System Failure RateSystem Failure Rate
System constructed from componentsNo Fault Tolerance
Any component fails, whole system fails
k
iicsys
1,
EE14113
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 13
ReliabilityReliability
If component working at time 0 R(t) = Probability still working at time t
Exponential Failure Law If failure rate assumed constant
– Good approximation if past infant mortality period
tetR )(
EE14114
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 14
Reliability for Series SystemReliability for Series System
Series System All components need to work for system to
work
A B C
CBAsys RRRR
EE14115
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 15
System Reliability with RedundancySystem Reliability with Redundancy
System reliability with component B in Parallel Can tolerate one component B failing
AB
CB
CBBACBAsys RRRRRRRR )2()1(1 22
EE14116
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 16
Mean-Time-to-Failure (MTTF)Mean-Time-to-Failure (MTTF)
Average time before system fails Equal to area under reliability curve
For Exponential Failure Law
dttRMTTF
0
)(
1
0
dteMTTF t
EE14117
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 17
MaintainabilityMaintainability If system failed at time 0
M(t) = Probability repaired and operational at time t
System repair time divided into Passive repair time
– Time for service engineer to travel to site
Active repair time– Time to locate failing component,
repair/replace, and verify system operational– Can be improved through designing system so
easy to locate failed component and verify
EE14118
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 18
Repair Rate and MTTRRepair Rate and MTTR
= rate at which system repaired Analogous to failure rate
Maintainability often modeled as
Mean-Time-to-Repair (MTTR) = 1/
tetM 1)(
EE14119
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 19
AvailabilityAvailability
System Availability Fraction of time system is operational
t0 t1 t2 t3 t4 t
S1
0
failures
Normal system operation
MTTRMTTF
MTTFilabilitysystem ava
EE14120
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 20
AvailabilityAvailability
Telephone Systems Required to have system availability of
0.9999 (“four nines”)High-Reliability Systems
May require 7 or more ninesFault-Tolerant Design
Needed to achieve such high availability from less reliable components
EE14121
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 21
Coding TheoryCoding TheoryCoding
Using more bits than necessary to represent data
Provides way to detect errors– Errors occur when bits get flipped
Error Detecting Codes Many types Detect different classes of errors Use different amounts of redundancy Ease of encoding and decoding data varies
EE14122
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 22
Block CodeBlock Code
Message = Data Being EncodedBlock code
Encodes m messages with n-bit codeword
If no redundancy m messages encoded with log2(m) bits
minimum possible
n
mredundancy 2log
1
EE14123
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 23
Block CodeBlock Code
To detect errors, some redundancy needed Space of distinct 2n blocks partitioned into
codewords and non-codewordsCan detect errors that cause codeword
to become non-codewordCannot detect errors that cause
codeword to become another codeword
EE14124
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 24
Separable Block CodeSeparable Block Code
Separable n-bit blocks partitioned into
– k information bits directly representing message– (n-k) check bits
Denoted (n,k) Block CodeAdvantage
k-bit message directly extracted without decoding
Rate of Separable Block Code = k/n
EE14125
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 25
Example of Separable Block CodeExample of Separable Block Code
(4,3) Parity Code Check bit is XOR of 3 message bits message 101 codeword 1010
Single Bit Parity
nn
kn
n
k
nn
mredundancy
k 11
)2(log1
log1 22
n
n
n
krate
1
EE14126
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 26
Example of Non-Separable Block CodeExample of Non-Separable Block Code
One-Hot Code Each Codeword has single 1 Example of 8-bit one-hot
– 10000000, 01000000, 00100000, 00010000 00001000, 00000100, 00000010, 00000001
Redundancy = 1 - log2(8)/8 = 5/8
n
n
n
mredundancy
)(log1
log1 22
EE14127
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 27
Linear Block CodesLinear Block Codes
Special class Modulo-2 sum of any 2 codewords also
codeword Null space of (n-k)xn Boolean matrix
– Called Parity Check Matrix, H
For any n-bit codeword c cHT = 0 All 0 codeword exists in any linear code
EE14128
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 28
Linear Block CodesLinear Block Codes
Generator Matrix, G kxn Matrix
Codeword c for message m c = mG
GHT = 0
EE14129
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 29
Systematic Block CodeSystematic Block Code
First k-bits correspond to message Last n-k bits correspond to check bits
For Systematic Code G = [Ikxk : Pkx(n-k)]
H = [I(n-k)x(n-k) : PT(n-k)xk]
Example
1111H
1
1
1
100
010
001
G
EE14130
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 30
Distance of CodeDistance of CodeDistance between two codewords
Number of bits in which they differDistance of Code
Minimum distance between any two codewords in code
If n=k (no redundancy), distance = 1 Single-bit parity, distance = 2
Code with distance d Detect d-1 errors Correct up to (d-1)/2 errors
EE14131
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 31
Error Correcting CodesError Correcting Codes
Code with distance 3 Called single error correcting (SEC) code
Code with distance 4 Called single error correcting and double
error detecting (SEC-DED) codeProcedure for constructing SEC code
Described in [Hamming 1950] Any H-matrix with all columns distinct and
no all-0 column is SEC
EE14132
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 32
Hamming CodeHamming CodeFor any value of n
SEC code constructed by– setting each column in H equal to binary
representation of column number (starting from 1)
Number of rows in H equal to log2(n+1) Example of SEC Hamming Code for n=7
1
1
1
010
100
111
101
110
000
H
EE14133
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 33
Error Correction in Hamming Error Correction in Hamming CodeCode
Syndrome, s s = HvT for received vector v If v is codeword
– Syndrome = 0
If v non-codeword and single-bit error– Syndrome will match one of columns of H– Will contain binary value of bit position in error
EE14134
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 34
Example of Error CorrectionExample of Error CorrectionFor (7,3) Hamming Code
Suppose codeword 0110011 has one-bit error changing it to 1110011
]001[
111
011
101
001
110
010
100
]1110011[
TvHs
EE14135
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 35
SEC-DED CodeSEC-DED Code
Make SEC Hamming Code SEC-DED By adding parity check over all bits Extra parity bit
– 1 for single-bit error– 0 for double-bit error
Makes possible to detect double bit error– Avoid assuming single-bit error and
miscorrecting it
EE14136
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 36
Example of Error CorrectionExample of Error CorrectionFor (7,4) SEC-DED Hamming Code
Suppose codeword 0110011 has two-bit error changing it to 1010011
– Doesn’t match any column in H
]0010[
1
1
1
111
011
101
1001
1110
1010
1100
]1010011[
TvHs
EE14137
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 37
Hsiao CodeHsiao CodeWeight of column
Number of 1’s in columnConstructing n-bit SEC-DED Hsiao Code
First use all possible weight-1 columns– Then all possible weight-3 columns– Then weight-5 columns, etc.
Until n columns formed Number check bits is log2(n+1) Minimizes number of 1’s in H-matrix
– Less hardware and delay for computing syndrome– Disadvantage: Correction logic more complex
EE14138
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 38
Example of Hsiao CodeExample of Hsiao Code
(7,3) Hsiao Code Uses weight-1 and weight-3 columns
1
0
1
1
1
1
0
1
1
1
1
0
0001
0010
0100
1000
H
EE14139
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 39
Unidirectional ErrorsUnidirectional ErrorsErrors in block of data which only cause
01 or 10, but not both Any number of bits in error in one direction
Example Correct codeword 111000 Unidirectional errors could cause
– 001000, 000000, 101000 (only 10 errors)
Non-unidirectional errors– 101001, 011001, 011011 (both10 and 01)
EE14140
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 40
Unidirectional Error Detecting CodesUnidirectional Error Detecting Codes
All unidirectional error detecting (AUED) Codes Detect all unidirectional errors in codeword Single-bit parity is not AUED
– Cannot detect even number of errors
No linear code is AUED– All linear codes must contain all-0 vector, so
cannot detect all 10 errors
EE14141
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 41
Two-Rail CodeTwo-Rail CodeTwo-Rail Code
One check bit for each information bit– Equal to complement of information bit
Two-Rail Code is AEUD 50% Redundancy
Example of (6,3) Two-Rail Code Message 101 has Codeword 101010 Set of all codewords
– 000111, 001110, 010101, 011100, 100110, 101010, 110001, 111000
EE14142
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 42
Berger CodesBerger Codes
Lowest redundancy of separable AUED codes For k information bits, log2(k+1) check bits
Check bits equal to binary representation of number of 0’s in information bits
Example Information bits 1000101
– log2(7+1)=3 check bits
– Check bits equal to 100 (4 zero’s)
EE14143
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 43
Berger CodesBerger Codes
Codewords for (5,3) Berger Code 00011, 00110, 01010, 01101, 10010,
10101, 11001, 11100 If unidirectional errors
Contain 10 errors– increase 0’s in information bits– can only decrease binary number in check bits
Contain 01 errors– decrease 0’s in information bits– can only increase binary number in check bits
EE14144
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 44
Berger CodesBerger Codes
If 8 information bits Berger code requires log28+1=4 check bits
(16,8) Two-Rail Code Requires 50% redundancy
Redundancy advantage of Berger Code Increases as k increased
%25
4
1
12
81
)2(log1
log1 22
nn
mredundancy
k
EE14145
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 45
Constant Weight CodesConstant Weight Codes
Constant Weight Codes Non-separable, but lower redundancy than
Berger Each codeword has same number of 1’s
Example 2-out-of-3 constant weight code 110, 011, 101
AEUD code Unidirectional errors always change number
of 1’s
EE14146
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 46
Constant Weight CodesConstant Weight Codes
Number codewords in m-out-of-n code
Codewords maximized when m close to n/2 as possible n/2-out-of-n when n even (n/2-0.5 or n/2+0.5)-out-of-n when n odd Minimizes redundancy of code
nmC
EE14147
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 47
ExampleExample
6-out-of-12 constant weight code
12-bit Berger Code Only 28 = 256 codewords
codewordsC 924126
%9.17
12
)924(log1
log1 22
n
mredundancy
%3.33
12
)2(log1
log1
822
n
mredundancy
EE14148
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 48
Constant Weight CodesConstant Weight Codes
Advantage Less redundancy than Berger codes
Disadvantage Non-separable Need decoding logic
– to convert codeword back to binary message
EE14149
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 49
Burst ErrorBurst ErrorBurst Error
Common, multi-bit errors tend to be clustered– Noise source affects contiguous set of bus lines
Length of burst error– number of bits between first and last error
Wrap around from last to first bit of codewordExample: Original codeword 00000000
00111100 is burst error length 4 00110100 is burst error length 4
– Any number of errors between first and last error
EE14150
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 50
Cyclic CodesCyclic Codes
Special class of linear code Any codeword shifted cyclically is another
codeword Used to detect burst errors Less redundancy required to detect burst
error than general multi-bit errors– Some distance 2 codes can detect all burst
errors of length 4– detecting all possible 4-bit errors requires
distance 5 code
EE14151
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 51
Cyclic Redundancy Check (CRC) CodeCyclic Redundancy Check (CRC) Code
Most widely used cyclic code Uses binary alphabet based on GF(2)
CRC code is (n,k) block code Formed using generator polynomial, g(x)
– called code generator– degree n-k polynomial (same degree as
number of check bits)
012
2...)( gxgxgxgxg knkn
)()()( xgxmxc
EE14152
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 52
Message m(x) g(x) c(x) Codeword
0000 0 x2 + 1 0 000000
0001 1 x2 + 1 x2 + 1 000101
0010 x x2 + 1 x3 + x 001010
0011 x + 1 x2 + 1 x3 + x2 + x + 1 001111
0100 x2 x2 + 1 x4 + x2 010100
0101 x2 + 1 x2 + 1 x4 + 1 010001
0110 x2 + x x2 + 1 x4 + x3 + x2 + x 011110
0111 x2 + x + 1 x2 + 1 x4 + x3 + x + 1 011011
1000 x3 x2 + 1 x5 + x3 101000
1001 x3 + 1 x2 + 1 x5 + x3 + x2 + 1 101101
1010 x3 + x x2 + 1 x5 + x 100010
1011 x3 + x + 1 x2 + 1 x5 + x2 + x + 1 100111
1100 x3 + x2 x2 + 1 x5 + x4 + x3 + x2 111100
1101 x3 + x2 + 1 x2 + 1 x5 + x4 + x3 + 1 111001
1110 x3 + x2 + x x2 + 1 x5 + x4 + x2 + x 110110
1111 x3 + x2 + x + 1 x2 + 1 x5 + x4 + x + 1 110011
EE14153
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 53
CRC CodeCRC Code
Linear block code Has G-matrix and H-matrix G-matrix shifted version of generator
polynomial
01
01
01
.
0
0
.
0
0
...
.
0
...00
....
...0
...
gg
g
g
gg
ggg
G
kn
kn
kn
EE14154
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 54
CRC Code ExampleCRC Code Example
(6,4) CRC code generated by g(x)=x2+1
101000
010100
001010
000101
G
EE14155
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 55
Systematic CRC CodesSystematic CRC Codes
To obtain systematic CRC code codewords formed using Galois division
– nice because LFSR can be used for performing division
)(
)()(
)()()(
xg
xxmofremainderxr
xrxxmxckn
kn
EE14156
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 56
Galois Division ExampleGalois Division ExampleEncode m(x)=x2+x with g(x)=x2+1
Requires dividing m(x)xn-k =x4+x3 by g(x)
Remainder r(x)=x+1– c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1
111101 11000
10111010111010111 remainder
EE14157
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 57
Message m(x) g(x) r(x) c(x) Codeword
0000 0 x2 + 1 0 0 000000
0001 1 x2 + 1 1 x2 + 1 000101
0010 x x2 + 1 x x3 + x 001010
0011 x + 1 x2 + 1 x + 1 x3 + x2 + x + 1 001111
0100 x2 x2 + 1 1 x4 + 1 010001
0101 x2 + 1 x2 + 1 0 x4 + x2 010100
0110 x2 + x x2 + 1 x + 1 x4 + x3 + x + 1 011011
0111 x2 + x + 1 x2 + 1 x x4 + x3 + x + 1 011110
1000 x3 x2 + 1 x x4 + x3 + x + 1 100010
1001 x3 + 1 x2 + 1 x + 1 x4 + x3 + x + 1 100111
1010 x3 + x x2 + 1 0 x4 + x3 + x + 1 101000
1011 x3 + x + 1 x2 + 1 1 x4 + x3 + x + 1 101101
1100 x3 + x2 x2 + 1 x + 1 x4 + x3 + x + 1 110011
1101 x3 + x2 + 1 x2 + 1 x x4 + x3 + x + 1 110110
1110 x3 + x2 + x x2 + 1 1 x4 + x3 + x + 1 111001
1111 x3 + x2 + x + 1 x2 + 1 0 x4 + x3 + x2 + x 111100
EE14158
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 58
Generating Check Bits for CRC CodeGenerating Check Bits for CRC CodeUse LFSR
With characteristic polynomial equal to g(x) Append n-k 0’s to end of message
Example: m(x)=x2+x+1 and g(x)=x3+x+1
0 0 0 111000Appended 0’s
Message
0 1 0
Final state after shifting equals remainder
EE14159
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 59
Checking CRC CodewordChecking CRC Codeword
Checking Received Codeword for Errors Shift codeword into LFSR
– with same characteristic polynomial as used to generate it
If final state of LFSR non-zero, then error
0 0 0 111010codeword to check
EE14160
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 60
Selecting Generator PolynomialSelecting Generator Polynomial
Key issue for CRC Codes If first and last bit of polynomial are 1
– Will detect burst errors of length n-k or less
If generator polynomial is mutliple of (x+1)– Will detect any odd number of errors
If g(x) = (x+1)p(x) where p(x) primitive of degree n-k-1 and n < 2n-k-1
– Will detect single, double, triple, and odd errors
EE14161
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 61
Commonly Used CRC GeneratorsCommonly Used CRC Generators
CRC code Generator Polynomial
CRC-5 (USB token packets) x5+x2+1
CRC-12 (Telecom systems) x12+x11+x3+x2+x+1
CRC-16-CCITT (X25, Bluetooth) x16+x12+x5+1
CRC-32 (Ethernet) x32+x26+x23+x22+x16+x12+x11+x10+x8
+x7+x5+x4+x+1
CRC-64 (ISO) x64+x4+x3+x+1
EE14162
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 62
Fault Tolerance SchemesFault Tolerance Schemes
Adding Fault Tolerance to Design Improves dependability of system Requires redundancy
– Hardware– Time– Information
EE14163
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 63
Hardware RedundancyHardware Redundancy Involves replicating hardware units
At any level of design– gate-level, module-level, chip-level, board-level
Three Basic Forms Static (also called Passive)
– Masks faults rather than detects them
Dynamic (also called Active)– Detects faults and reconfigures to spare hardware
Hybrid– Combines active and passive approaches
EE14164
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 64
Static RedundancyStatic Redundancy
Masks faults so no erroneous outputs Provides uninterrupted operation Important for real-time systems
– No time to reconfigure or retry operation
Simple self-contained– No need to update or rollback system state
EE14165
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 65
Triple Module Redundancy (TMR)Triple Module Redundancy (TMR)Well-known static redundancy scheme
Three copies of module Use majority voter to determine final output Error in one module out-voted by other two
Module3
Module2
Module1
MajorityVoter
EE14166
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 66
TMR Reliability and MTTFTMR Reliability and MTTF
TMR works if any 2 modules work Rm = reliability of each module Rv = reliability of voter
MTTF for TMR
)23()]1([ 32232
3mmvmmmvTMR RRRRRCRRR
vmvm
tttmmvTMRTMR dteeedtRRRdtRMTTF mmv
3
2
2
3
)23()23(0
32
0
32
0
EE14167
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 67
Comparison with SimplexComparison with Simplex
Neglecting fault rate of voter
TMR has lower MTTF, but Can tolerate temporary faults Higher reliability for short mission times
simplexmmm
TMR MTTFMTTF6
51
6
5
3
2
2
3
EE14168
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 68
Comparison with SimplexComparison with Simplex
Crossover point
RTMR > Rsimplex when Mission time shorter than 70% of MTTF
simplexm
ttt
simplexTMR
MTTFtSolve
eee
RR
mmm
7.02ln
23 32
EE14169
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 69
N-Modular Redundancy (NMR)N-Modular Redundancy (NMR)
NMR N modules along with majority voter
– TMR special case
Number of failed modules masked = (N-1)/2 As N increases, MTTF decreases
– But, reliability for short missions increases
If goal only to tolerate temporary faults TMR sufficient
EE14170
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 70
Interwoven LogicInterwoven Logic
Replace each gate with 4 gates using inconnection pattern
that automatically corrects errorsTraditionally not as attractive as TMR
Requires lots of area overhead Renewed interest by researchers
investigating emerging nanoelectronic technologies
EE14171
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 71
Interwoven Logic with 4 NOR GatesInterwoven Logic with 4 NOR Gates
++
+
++
X
Y
1
2
3
4
1b
+1c
+1d
+1a
+2b
+2c
+2d
+2a
+3b
+3c
+3d
+3a
+4b
+4c
+4d
+4a
X
Y
EE14172
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 72
Example of Error on Third Y InputExample of Error on Third Y Input
+1b
+1c
+1d
+1a
+2b
+2c
+2d
+2a
+3b
+3c
+3d
+3a
+4b
+4c
+4d
+4a
X
Y
0
0
0
0
0010
1
1
0
0
0
0
0
0
1
1
1
1
0000
++
++
X
Y
1
2
3
4
EE14173
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 73
Dynamic RedundancyDynamic Redundancy
Involves Detecting fault Locating faulty hardware unit Reconfiguring system to use spare fault-free
hardware unit
EE14174
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 74
Unpowered (Cold) SparesUnpowered (Cold) Spares
Advantage Extends lifetime of spares
Equations Assume spare not failing until powered Perfect reconfiguration capability
2
)1(
_/
_/
sparecoldw
tsparecoldw
MTTF
etR
EE14175
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 75
Unpowered (Cold) SparesUnpowered (Cold) SparesOne cold spare doubles MTTF
Assuming faults always detected and reconfiguration circuitry never fails
Drawback of cold spare Extra time to power and initialize Cannot be used to help in detecting faults Fault detection requires either
– periodic offline testing– online testing using time or information
redundancy
EE14176
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 76
Powered (Hot) SparesPowered (Hot) SparesCan use spares for online fault detectionOne approach is duplicate-and-compare
If outputs mismatch then fault occurred– Run diagnostic procedure to determine which
module is faulty and replace with spare
Any number of spares can be used
ModuleB
SpareModule
ModuleA
Compare
Output
Agree/Disagree
EE14177
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 77
Pair-and-a-SparePair-and-a-SpareAvoids halting system to run diagnostic
procedure when fault occurs
ModuleB
ModuleA
Compare
Output
Agree/Disagree
ModuleD
ModuleC
Compare
Output
Agree/Disagree
Switch
EE14178
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 78
TMR/SimplexTMR/Simplex
When one module in TMR fails Disconnect one of remaining modules Improves MTTF while retaining advantages
of TMR when 3 good modulesTMR/Simplex
Reliability always better than either TMR or Simplex alone
EE14179
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 79
Comparison of Reliability vs TimeComparison of Reliability vs Time
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
NORMALIZED MISSION TIME (T/MTTF)
RE
LIA
BIL
ITY
SIMPLEX
TMR
TMR/SIMPLEX
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
NORMALIZED MISSION TIME (T/MTTF)
RE
LIA
BIL
ITY
SIMPLEX
TMR
TMR/SIMPLEX
EE14180
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 80
Hybrid RedundancyHybrid Redundancy
Combines both static and dynamic redundancy Masks faults like static Detects and reconfigures like dynamic
EE14181
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 81
TMR with SparesTMR with Spares
If TMR module fails Replace with spare
– can be either hot or cold spare
While system has three working modules– TMR will provide fault masking for
uninterrupted operation
EE14182
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 82
Self-Purging RedundancySelf-Purging Redundancy
Uses threshold voter instead of majority voter Threshold voter outputs 1 if number of
input that are 1 greater than threshold– Otherwise outputs 0
Requires hot spares
EE14183
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 83
Self-Purging RedundancySelf-Purging Redundancy
Module3
Module2
Module1
ThresholdVoter2
Elem.Switch
Elem.Switch
Elem.Switch
Module4
Elem.Switch
Module5
Elem.Switch Voter
Module
FlipFlop
&
RS
Initialization
Elementary Switch
EE14184
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 84
Self-Purging RedundancySelf-Purging Redundancy
Compared with 5MR Self-purging with 5 modules
– Tolerate up to 3 failing modules (5MR cannot)– Cannot tolerate two modules simultaneously
failing (5MR can)
Compared with TMR with 2 spares Self-purging with 5 modules
– simpler reconfiguration circuitry– requires hot spares (3MR w/spares can use
either hot or cold spares)
EE14185
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 85
Time RedundancyTime Redundancy
Advantage Less hardware
Drawback Cannot detect permanent faults
If error detected System needs to rollback to known good
state before resuming operation
EE14186
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 86
Repeated ExecutionRepeated Execution
Repeat operation twice Simplest time redundancy approach Detects temporary faults occurring during
one execution (but not both)– Causes mismatch in results
Can reuse same hardware for both executions
– Only one copy of functional hardware needed
EE14187
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 87
Repeated ExecutionRepeated Execution
Requires mechanism for storing and comparing results of both executions In processor, can store in memory or on
disk and use software to compareMain cost
Additional time for redundant execution and comparison
EE14188
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 88
Multi-threaded Redundant ExecutionMulti-threaded Redundant Execution
Can use in processor-based system that can run multiple threads Two copies of thread executed concurrently Results compared when both complete Take advantage of processor’s built-in
capability to exploit processing resources– Reduce execution time– Can significantly reduce performance penalty
EE14189
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 89
Multiple Sampling of OuputsMultiple Sampling of Ouputs
Done at circuit-level Sample once at end of normal clock cycle Same again after delay of t Two samples compared to detect mismatch
– Indicates error occurred
Detect fault whose duration is less than t Performance overhead depends on
– Size of t relative to normal clock period
EE14190
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 90
Multiple Sampling of OutputsMultiple Sampling of Outputs
Simple approach using two latches
Clk
MainLatch
Clk+t
ShadowLatch
ErrorSignal
EE14191
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 91
Multiple Sampling of OutputsMultiple Sampling of OutputsApproach using stability checker at output
NormalClock Period t
NormalClock Period t
StabilityChecking
Period
StabilityChecking
Period
&
&
+
+& Error
CheckingPeriod
Signal
EE14192
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 92
Diverse RecomputationDiverse Recomputation
Use same hardware, but perform computation differently second time Can detect permanent faults that affects
only one computationFor arithmetic or logical operations
Shift operands when performing second computation [Patel 1982]
Detects permanent fault affecting only one bit-slice
EE14193
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 93
Information RedundancyInformation Redundancy
Based on Error Detecting and Correcting Codes
Advantage Detects both permanent and temporary
faults Implemented with less hardware overhead
than using multiple copies of moduleDisadvantage
More complex design
EE14194
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 94
Error DetectionError Detection
Error detecting codes used to detect errors If error detected
– Rollback to previous known error-free state– Retry operation
EE14195
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 95
RollbackRollback
Requires adding storage to save previous state Amount of rollback depends on latency of
error detection mechanism Zero-latency error detection
– rollback implemented by preventing system state from updating
If errors detected after n cycles– need rollback restoring system to state at least
n clock cycles earlier
EE14196
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 96
CheckpointCheckpoint
Execution divided into set of operations Before each operation executed
– checkpoint created where system state saved
If any error detected during operation– rollback to last checkpoint and retry operation
If multiple retries fail– operation halts and system flags that
permanent fault has occurred
EE14197
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 97
Error DetectionError Detection
Encode outputs of circuit with error detecting code Non-codeword output indicates error
m
m
k
c
Inputs
Checker
FunctionalLogic
Check BitGenerator
k
Outputs
ErrorIndication
EE14198
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 98
Self-Checking CheckerSelf-Checking Checker
Has two outputs Normal error-free case (1,0) or (0,1) If equal to each other, then error (0,0) or (1,1) Cannot have single error indicator output
– Stuck-at 0 fault on output could never be detected
EE14199
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 99
Totally Self-Checking CheckerTotally Self-Checking Checker
Requires three properties Code Disjoint
– all codeword inputs mapped to codeword outputs
Fault Secure– for all codeword inputs, checker in presence of
fault will either procedure correct codeword output or non-codeword output (not incorrect codeword)
Self-Testing– For each fault, at least one codeword input gives
error indication
EE141100
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 100
Duplicate-and-CompareDuplicate-and-CompareEquality checker indicates error
Undetected error can occur only if common-mode fault affecting both copies
Only faults after stems detected Over 100% overhead (including checker)
FunctionalLogic
FunctionalLogic
Stems
EqualityChecker
ErrorIndication
PrimaryInputs
EE141101
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 101
Single-Bit Parity CodeSingle-Bit Parity Code
Totally self-checking checker formed by removing final gate from XOR tree
EI0
FunctionalLogic
ParityPrediction
EI1
EE141102
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 102
Single-Bit Parity CodeSingle-Bit Parity Code
Cannot detect even bit errors Can ensure no even bit errors by
generating each output with independent cone of logic
– Only single bit errors can occur due to single point fault
– Typically requires a lot of overhead
EE141103
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 103
Parity-Check CodesParity-Check Codes
Each check bit is parity for some set of output bits
Example: 6 outputs and 3 check bits
Z1 Z2 Z3 Z4 Z5 Z6 c1 c2 c3
Parity Group 1 1 0 0 1 1 0 1 0 0
Parity Group 2 0 1 1 0 0 0 0 1 0
Parity Group 3 0 0 0 0 0 1 0 0 1
EE141104
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 104
Parity-Check CodesParity-Check Codes
For c check bits and k functional outputs 2ck possible parity check codes Can choose code based on structure of
circuit to minimize undetected error combinations
Fanouts in circuit determine possible error combinations due to single-point fault
EE141105
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 105
Checker for Parity-Check CodesChecker for Parity-Check Codes
Constructed from single-bit parity checkers and two-rail checkers
ParityChecker
Two-RailChecker
Z1Z4
Z5c1
ParityChecker
Z2
Z3
c2
ParityChecker
Z6
c3
Two-RailChecker
E0
E1
EE141106
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 106
Two-Rail CheckersTwo-Rail Checkers
Totally self-checking two-rail checker
C0+
&
&
+
&
&C1
A0
B0
A1
B1
EE141107
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 107
Berger CodesBerger Codes
Inverter-free circuit Inverters only at primary inputs Can be synthesized using only algebraic
factoring [Jha 1993] Only unidirectional errors possible for
single point faults– Can use unidirectional code– Berger code gives 100% coverage
EE141108
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 108
Constant Weight CodesConstant Weight Codes
Non-separable with lower redundancy Drawback: need decoding logic to convert
codeword back to its original binary value Can use for encoding states of FSM
– No need for decoding logic
EE141109
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 109
Error CorrectionError Correction
Information redundancy can also be used to mask errors Not as attractive as TMR because logic for
predicting check bits very complex However, very good for memories
– Check bits stored with data– Error do not propagate in memories as in logic
circuits, so SEC-DED usually sufficient
EE141110
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 110
Error CorrectionError Correction
Memories very dense and prone to errors Especially due to single-event upsets (SEUs)
from radiationSEC-DED check bits stored in memory
32-bit word, SEC-DED requires 7 check bits– Increases size of memory by 7/32=21.9%
64-bit word, SEC-DED requires 8 check bits– Increases size of memory by 8/64=12.5%
EE141111
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 111
Memory ECC ArchitectureMemory ECC Architecture
GenerateCheckBits Memory
GenerateSyndromeCorrect
Data
CalculatedCheck Bits
WriteCheck Bits
Read Data Word
Write Data Word
Data WordIn
ReadCheck Bits
Data WordOut
EE141112
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 112
Hamming Code for ECC RAMHamming Code for ECC RAM
RAM Core
N words Z+c+1
bits/word
Z
c
Input Data
Parity Bit Generator
Z
c
Hamming Check Bit Generator
Parity Check
Hamming Check c
Bit Error Correction Circuit
Output Data
Generate Detect/Correct
Hamming Check Bit Generator
Parity Bit Generator
Z
Error Type Condition No bit error Hamming check bits match, no parity error
Single-bit correctable error Hamming check bits mismatch, parity error Double-bit error detection Hamming check bits mismatch, no parity error
Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 c1 c2 c3 c4 Parity Group 1 1 1 0 1 1 0 1 0 1 0 0 0 Parity Group 2 1 0 1 1 0 1 1 0 0 1 0 0 Parity Group 3 0 1 1 1 0 0 0 1 0 0 1 0 Parity Group 4 0 0 0 0 1 1 1 1 0 0 0 1
EE141113
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 113
Memory ECCMemory ECC
SEC-DED generally very effective Memory bit-flips tend to be independent
and uniformly distributed If bit-flip occurs, gets corrected next time
memory location accessed Main risk is if memory word not access for
long time – Multiple bit-flips could accumulate
EE141114
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 114
Memory ScrubbingMemory Scrubbing
Every location in memory read on periodic basis Reduces chance of multiple errors
accumulating in a memory word Can be implemented by having memory
controller cycle through memory during idle periods
EE141115
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 115
Multiple-Bit Upsets (MBU)Multiple-Bit Upsets (MBU)Can occur due to single SEU
Typically occur in adjacent memory cellsMemory interleaving used
To prevent MBUs from resulting in multiple bit errors in same word
Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4Bit1 Bit1 Bit1 Bit1 Bit2 Bit2 Bit2 Bit2 Bit3 Bit3 Bit3 Bit3
Memory
EE141116
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 116
Type Issues Goal Examples Techniques
Long-LifeSystems
Difficult orExpensive to Repair
MaximizeMTTF
SatellitesSpacecraft
Implanted Biomedical
DynamicRedundancy
ReliableReal-TimeSystems
Error or Delay Catastrophic
Fault Masking Capability
AircraftNuclear Power PlantAir Bag Electronics
Radar
TMR
High Availability
Systems
DowntimeVery Costly
HighAvailability
Reservation SystemStock Exchange
Telephone Systems
No Single Point of Failure;
Self-Checking Pairs; Fault Isolation
High Integrity Systems
Data CorruptionVery Costly
HighData Integrity
BankingTransaction ProcessingDatabase
Checkpointing,Time Redundancy; ECC; Redundant
Disks
Mainstream Low-Cost Systems
Reasonable Level of Failures Acceptable
Meet Failure Rate Expectationsat Low Cost
Consumer Electronics Personal Computers
Often None; Memory ECC; Bus
Parity; Changing as Technology Scales
EE141117
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 117
Concluding RemarksConcluding Remarks
Many different fault-tolerant schemesChoosing scheme depends on
Types of faults to be tolerated– Temporary or permanent– Single or multiple point failures– etc.
Design constraints– Area, performance, power, etc.
EE141118
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 118
Concluding RemarksConcluding Remarks
As technology scales Circuits increasingly prone to failure Achieving sufficient fault tolerance will be
major design issue