Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | cassandra-robyn-fox |
View: | 213 times |
Download: | 0 times |
12-14 September 2005 12-14 September 2005
Consensus-based EvaluationConsensus-based Evaluation for for Fault Isolation Fault Isolation and On-line Evolutionary Regenerationand On-line Evolutionary Regeneration
K. Zhang, R. F. DeMara, and C. A. SharmaK. Zhang, R. F. DeMara, and C. A. SharmaUniversity of Central FloridaUniversity of Central Florida
K. Zhang, R. F. DeMara, and C. A. SharmaK. Zhang, R. F. DeMara, and C. A. SharmaUniversity of Central FloridaUniversity of Central Florida
Technical Objective:Autonomous FPGA Regeneration
Redundancy
increases with amount of spare capacity
restricted at design-time
based on time required to select spare resource
determined by adequacy of spares available (?)
yes
Regeneration
weakly-related to number
recovery capacity
variable at recovery-time
based on time required to find suitable recovery
affected by multiple characteristics (+ or -)
yes
Overhead from Unutilized Spares weight, size, power
Granularity of Fault Coverage resolution where fault handled
Fault-Resolution Latency availability via downtime required to handle fault
Quality of Repair likelihood and completeness
Autonomous Operation recover without outside intervention
Increased availability without pre-configured spares …
everyday example spare tire can of fix-a-flat
NASA Moon, Mars, and Beyond:
Realize 10’s years service life ???
Stardust: 110 FPGAs …
Approach Online Recovery
Basis for Recovery
Test Vectors
Availability Externally-supplied Elements
Resource Recycling
Pre-determined
Limits
Power Consumption
TMR with Jiggling [Garvie,
Thompson]
Yes
Requires 2 datapaths
are operational
Pseudo-Exhaustive
100% for single fault,
0% thereafter 2 of 3 Majority Voter Yes Single
datapath
3n+v
[Vigander01] No Design complexity
Exhaustive Non-deterministic
GA Controller, function test vectors
Yes None 3n+v+r
[Lohn, Larchev, DeMara03]
No Design complexity
Pseudo-Exhaustive Functional
Test
Non-deterministic
GA Controller, function test vectors
Yes None 2n+r
[Lach98] No Available spares
Not Addressed
Either cmplete or
none
Device test vectors and controller
No Only one
faulty CLB per tile
2n+r
STARS
[Abramovici01] Yes Available
spares
Exhaustive Resource
Test
Only ~93% regardless of
fault occurrence
Test Reconfiguration Controller + device
test vectors Yes
Available spares within
routing chokepoints
s • (c+r)
[Keymeulen, Stoica,
Zebulum00] No
Depends on characteristics at design
time
Exhaustive during or
after evolution
Non-deterministic
None at runtime No Depends on redundancy
during design n • (1 + f(g))
Competitive Runtime
Reconfiguration (CRR)
[DeMara05]
Yes Recovery complexity
None Adaptable
Optional RAM … RAM coverage is
intrinsic
No test vectors
Yes None 2n+r
Fault Recovery Characteristics of Selected ApproachesFault Recovery Characteristics of Selected Approaches
Previous Work on Fault Recovery
Normalized Power Consumption (Energy per Operation):
n-plex solution using n redundant devices
Reconfiguration cost r
Gate-Level redundancy g
Updated with scan rate s
on c CLBs
Exploiting Population Information
• Population contains more robust information than individualsPopulation contains more robust information than individuals Utilize this information for robust fault detection, faster Utilize this information for robust fault detection, faster
regeneration, increased diversity for adaptationregeneration, increased diversity for adaptation• Detect Failure and Isolate Faulty ResourcesDetect Failure and Isolate Faulty Resources
Detect by inconsistencies among the populationDetect by inconsistencies among the population Isolate faults using outlier identification and agingIsolate faults using outlier identification and aging
• Realize RegenerationRealize Regeneration Recovery Complexity << Design ComplexityRecovery Complexity << Design Complexity
utilize diverse raw material during regeneration vs. isolated re-designutilize diverse raw material during regeneration vs. isolated re-design
Temporal consensus directs searchTemporal consensus directs search• Adaptable Performance based on Online InputsAdaptable Performance based on Online Inputs
The population evolves to changing physical environment, input The population evolves to changing physical environment, input vectors, and target application while increasing availabilityvectors, and target application while increasing availability
Procedural Flow under Consensus-Based Evaluation
Initialization Population partitioned into
functionally-identical yetphysically-distincthalf-configurations
Fitness Adjustment
update fitness of onlyL and R based ondetection results
either L's or R'sfitness < Repair
Threshold?
Selectionchoose
FPGA configuration(s)labeled L and R
Detectionapply functional inputs
to compute FPGAoutputs using L, R
Adjust Controlsdetection mode, overlap interval, ...
invoke
GeneticOperators only once
and only on L or R
L=R
L=R
PRIMARYLOOP
discrepancyfree
L, R results
NO
YES
is
InitializationInitializationPartition P into sub-populations of size |P|/2 to designate
physical FPGA left-half or right-half resource utilization
Consensus Based EvaluationConsensus Based EvaluationDiscrepancy Operator: CL CRFour Fitness States :Pristine Suspect Under Repair Refurbished
RegenerationRegenerationGenetic Operators recover based on Reintroduction Rate Operators only applied once then offspring returned to “service” without concern about increasing fitness
Consensus-Based Evaluation (CBE)Overview
• Uses a Relative Fitness MeasureUses a Relative Fitness Measure Pairwise discrepancy checking yields relative fitness measurePairwise discrepancy checking yields relative fitness measure Broad temporal consensus in the population used to determine Broad temporal consensus in the population used to determine
fitness metricfitness metric Transition between Transition between Fitness States Fitness States occurs in the populationoccurs in the population Provides graceful degradation in presence of changing Provides graceful degradation in presence of changing
environments, applications and inputs, since this is a moving environments, applications and inputs, since this is a moving measuremeasure
• Test Inputs = Normal Inputs for Data ThroughputTest Inputs = Normal Inputs for Data Throughput CBE does not utilizes additional functional nor resource test CBE does not utilizes additional functional nor resource test
vectorsvectors Potential for higher availability as regeneration is integrated Potential for higher availability as regeneration is integrated
with normal operationwith normal operation
pristine
suspect
refurbished
under repair
partial repair
L R
L = R
complete repair
primordial
L = R
L R
L R
L = R
L = R
LR
1
2
3
4
5
6
7
8
fi fOT
:L = R
: fi fOT
9
10
11
fi < fRT
L R:
fi < fRT
L R:
integral w ith
:fi fRT
:fi < fOT
COMPETITION
C O M P E T I T I O N
E V O L U T I O N
States Transitions during lifetime of States Transitions during lifetime of
iithth Half-Configuration Half-Configuration
Configuration Health States
Discrepancy OperatorDiscrepancy Operator• Baseline Discrepancy Operator is dyadic operator with binary output:
• Z(Ci) is FPGA data throughput output of configuration Ci
Othewise
CZCZCC
Ri
LiR
iLi
)()(
1
0
Rji
Ljii CEORC ,,j =RS:
(Hamming Distance)
Rji
Ljii CEORC ,,j ^ =WTA:
(Equivalence)
Selection and Repair Process
Maintain AvailabilityMaintain Availability Choose Pristine, Suspect, Refurbished individuals in that orderChoose Pristine, Suspect, Refurbished individuals in that order
Enable RegenerationEnable Regeneration Choose Under-Repair individuals subject to Re-introduction rate (Choose Under-Repair individuals subject to Re-introduction rate (RR))
Fitness State Adjustment / Repair
Discrepancy?
Increase L's & R 's DV
Is the individual
Pristine?
Mark individual as Suspect
Is its fi >DVR?
YES
NO
NO
YES
Mark individual as Under Repair
Invoke Genetic Operators only once and only on L or R
Mark individual as Refurbished
Is individual Under
Repair?
Is its fi <DVO?
YES
adjust controls & goto Selection process
NO
Evaluation Occurence
> EW?
YES
YES
Is individual Refurbished?
NO
YES YES
Is individual Suspect?
NO
NO
NO
YES
NO
Calculate the DVo,DVR
for this EW and isolate faulty individuals over the Sliding
Window samples by three Std Dev
Individual’s Fitness: Evaluation Window
Number of Selections with ReplacementPro
ba
bili
ty o
f S
ele
ctio
n C
on
tain
ing
all
K it
em
s
Each individual subjected to sufficient random operational inputs for accurately assessmentEach individual subjected to sufficient random operational inputs for accurately assessment For combinational logic, EFor combinational logic, EWW is determined on the basis of input word width is determined on the basis of input word width Genetic operators invoked once every EGenetic operators invoked once every EW W iterations on Under-Repair individuals to avoid iterations on Under-Repair individuals to avoid
unnecessary modificationsunnecessary modifications EW = 600 Random run-time inputs provide a 99.5% certainty of the test being exhaustive EW = 600 Random run-time inputs provide a 99.5% certainty of the test being exhaustive
and conclusiveand conclusive
Population Comparison: Fitness Indices
Population Consensus Sliding WindowPopulation Consensus Sliding Window Population behavior is periodically sampled to determine Population behavior is periodically sampled to determine
current oracle value for global fitness metriccurrent oracle value for global fitness metric Thresholds need to be current but not updated more Thresholds need to be current but not updated more
frequently than necessaryfrequently than necessary Updating thresholds occurs after 25% ofUpdating thresholds occurs after 25% of individuals individuals
completed Ecompleted EWW
Ensures aEnsures a fast-moving fast-moving relativerelative measure for adaptability measure for adaptability Case study: Case study:
• |C|=20 individuals … |CL|=|CR |= |C|/2• Sliding Window = 5 EEWW
• 5/20 = 25% individuals evaluated == “sufficient”
Integer Multiplier Case Study
Automated Creation of a Population of Multipliers:Automated Creation of a Population of Multipliers:– Building blocks Building blocks
Half-Adder: 18 templates createdHalf-Adder: 18 templates created Full-Adder: 24 templatesFull-Adder: 24 templates Parallel-And : 1 template createdParallel-And : 1 template created
– OR, AND, XOR, NOR, NAND and NOT functions can be OR, AND, XOR, NOR, NAND and NOT functions can be assigned to a LUTassigned to a LUT
– Randomly select templates for instantiation in modulesRandomly select templates for instantiation in modules– Strict Feed-Forward flow enforced Strict Feed-Forward flow enforced – XOR function excluded from initial designs to increase design XOR function excluded from initial designs to increase design
spacespace– Average of 21 CLBs utilized for a 3bit x 3bit MultiplierAverage of 21 CLBs utilized for a 3bit x 3bit Multiplier– Configurations divided into two groups, each subset using Configurations divided into two groups, each subset using
exclusive resourcesexclusive resources
GA Parameters & Experiments
SpeciationSpeciation Two-point crossover between individuals from same sub-groupTwo-point crossover between individuals from same sub-group Crossover points chosen to prevent intra-CLB crossoverCrossover points chosen to prevent intra-CLB crossover Breeding occurs exclusively among members of sub-populationsBreeding occurs exclusively among members of sub-populations Maintains non-interfering resource use among Maintains non-interfering resource use among L, RL, R
GA operatorsGA operatorsExternal-Module-CrossoverExternal-Module-CrossoverInternal-Module-Crossover Internal-Module-Crossover Internal-Module-MutationInternal-Module-Mutation
GA parametersGA parametersPopulation size : 20 individuals Population size : 20 individuals Crossover rate : 5% Crossover rate : 5% Mutation rate : up to 80% per bitMutation rate : up to 80% per bit
Fault Isolation CharacteristicsFault Isolation Characteristics Regenerative ExperimentsRegenerative Experiments
Demonstrate …Demonstrate … Objective fitness function replaced Objective fitness function replaced
by the Consensus-based by the Consensus-based Evaluation Approach and Relative Evaluation Approach and Relative FitnessFitness
Elimination of additional test vectorsElimination of additional test vectors
Experiments …Experiments …
Isolation of a single faulty individual with 1-out-of-64 impact
• Outliers are identified after EW iterations have elapsed• Expected D.V. = (1/64)*600 = 9.375 from individual impacted by fault• Isolated faulty individual’s DV differs from the average DV by 33 after 1 or more observation intervals of
length EW
instantaneous DV (point
values) for a sample
individual in population
and
population oracles (solid
lines)
Sliding Window
Isolation of a single faulty L individual with 10-out-of-64 impact
Compare with 1-out-of-64 fault impactCompare with 1-out-of-64 fault impact Expected DV of (10/64)*600 = 93.75 for faulty configuration One isolation will be complete approx. once in every 93.75/5 = 19 Sliding Windows Fault Isolation achieved is 100%
Isolation of 8 faulty individuals L4&R4 with 1-out-of-64 impact
• Expected isolations do not occur approx. 40% of the timeExpected isolations do not occur approx. 40% of the time Average discrepancy value of the population is higher Outlier isolation difficult Multiple faulty individual, Discrepancies scattered
Regeneration PerformanceRegeneration Performance
Difference (vs. Hamming Distance)Evaluation Window, Ew = 600Suspect Threshold: DVS = 1-6/600=99%Repair Threshold: DVR = 1-4/600 = 99.3%Re-introduction rate: r = 0.1
ParametersParameters:
Repairs evolvedRepairs evolved in-situ, in real-time, without additional test in-situ, in real-time, without additional test vectors, vectors, while allowing device to remainwhile allowing device to remain partially online. partially online.
3x3 Multiplier Experiment
Number Fault Location
Failure Type
Correctness
after Fault
Total
Iterations
Discrepant Iterations
Repair Iterations
Final Correctness
Effective Throughput
1 CLB3,LUT0,Input1 Stuck-at-1 52 / 64 17920100 421123 1194 64 / 64 97.65
2 CLB6,LUT0,Input1 Stuck-at-0 33 / 64 802050 17034 47 64 / 64 97.87
3 CLB5,LUT2,Input0 Stuck-at-1 22 / 64 3134660 68027 193 64 / 64 97.83
4 CLB7,LUT2,Input0 Stuck-at-0 38 / 64 8158280 185193 513 64 / 64 97.73
5 CLB9,LUT0,Input1 Stuck-at-0 40 / 64 2332670 71613 219 64 / 64 96.93
Average 32.6 / 64 6469550 152598 433 64 / 64 97.6
Conclusion
• Repair ComplexityRepair Complexity should be more tractable that Design Complexity, given should be more tractable that Design Complexity, given
diverse “spare” designsdiverse “spare” designs
• Population-Centric AssessmentPopulation-Centric Assessment Provides adaptability and self-calibrating autonomy with a Provides adaptability and self-calibrating autonomy with a
relative assessment methodrelative assessment method
• Run-time Fault ManagementRun-time Fault Management Can be realized using consensus-driven assessment Can be realized using consensus-driven assessment
methods, and using information contained in the populationmethods, and using information contained in the population Integrate Detection, Isolation, Repair under a single Integrate Detection, Isolation, Repair under a single
Population-based techniquePopulation-based technique