Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 2 times |
1 University of MichiganElectrical Engineering and Computer Science
A Microarchitectural Analysis of Soft Error Propagation in a Production-Level
Embedded Microprocessor
Jason Blome, Scott Mahlke, Daryl Bradley*, Krisztián Flautner*
Advanced Computer Architecture Lab, University of Michigan*ARM Ltd.
2 University of MichiganElectrical Engineering and Computer Science
Embedded Everywhere
Patterson and Hennessy 2005
• Not just cellphones• Safety critical applications:
► Automotive► Healthcare
3 University of MichiganElectrical Engineering and Computer Science
Embedded Domain Constraints• Power efficient performance
► Longer clock cycle times► Increased logic depth between stages► Higher area ratio of combinational logic to state elements
• Less speculative state► Potentially less masking
• Limited real estate
All of these high level constraints affect the behavior of faults and the potential of fault tolerance techniques
4 University of MichiganElectrical Engineering and Computer Science
Objectives• Understand the effects of transient faults on a
typical embedded design► Architectural contributions to soft error effects► Production-grade core
• Reference synthesis flow• Design for test methodologies
• Simulate faults in both combinational and sequential logic
5 University of MichiganElectrical Engineering and Computer Science
Soft Error Rate Contributions
Shivakumar 2002
Soft Error Rate Contributions
Mitra 2005
49%
11%
40%
StaticCombinationalLogicUnprotectedSRAMs
SequentialElements
Increasing contribution of faults in combinational logic to the overall soft error rate
6 University of MichiganElectrical Engineering and Computer Science
Processor Model
RegisterBank
RegisterBank
Data InterfaceData Interface
InstructionAddress
Logic
InstructionAddress
Logic
DataAddress
Logic
DataAddress
Logic
MultiplyMultiply ALU
ShiftShift
Instruction DecodeInstruction Decode
ARM926EJ-S
Instruction FetchInstruction Fetch
Datacache
Datacache
MMUMMU
Instructioncache
Instructioncache
MMUMMU
Bus Interface
Write Buffer/Bus Interface
MuxArray
MuxArray
• ARM926EJ-S• Cell library characterized for 130 nm• 5 ns clock cycle time
7 University of MichiganElectrical Engineering and Computer Science
Analysis Infrastructure
testbench
referencedesign
testdesign
report generationreport generation
benchmarkbenchmark
fault injection/error analysis framework
error checkingand logging
fault injectionscheduler
8 University of MichiganElectrical Engineering and Computer Science
Fault Masking
• Logical: faulted value does not affect logical operation of the circuit
0
0
• Latching-Window: the fault pulse does not reach a state element within the latching window
• Electrical: the fault pulse is electrically attenuated by subsequent gates in the circuit
• Architectural/Software: incorrect state is written before it is read
CLK
tsetup thold
9 University of MichiganElectrical Engineering and Computer Science
Observed Error Rates
Error Site Error Rate Masking Rate
Microarchitectural State
94% 6%
Architectural State 7% 93%
Top-level Ports 4% 96%
Error Site Error Rate Masking Rate
Microarchitectural State
16% 84%
Architectural State 4% 96%
Top-level Ports 3% 97%
Faults Occurring in Registers
Faults Occurring in Combinational Logic
At the software interface, error rates within 3%
94%
16%
7%
4%
10 University of MichiganElectrical Engineering and Computer Science
Observed Error Rates
Cycle Average Bit Errors
1 1.26
2 3.19
3 3.06
4 5.52
Faults Occurring in Registers
Faults Occurring in Combinational Logic
Cycle Average Bit Errors
1 41.49
2 45.33
3 47.76
4 49.54
Faults in combinational logic have a much more dramatic effect on system state
11 University of MichiganElectrical Engineering and Computer Science
Architectural Errors per Cycle
00.10.20.30.40.50.60.70.80.9
1
1 10 100 1000
Number of Architectural Errors
Rela
tive F
req
uen
cy
Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10
Faults Occurring in Registers
Faults Occurring in Combinational Logic
00.10.20.30.40.50.60.70.80.9
1
1 10 100 1000
Number of Architectural Errors
Rela
tive F
req
uen
cy
Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10
12 University of MichiganElectrical Engineering and Computer Science
Architectural Corruption Characteristics
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 6 11 16 21 26 31
Corrupt Bits per Architectural Register
Rela
tive F
req
uen
cy
Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 6 11 16 21 26
Number of Corrupted Architectural Registers
Rela
tive F
req
uen
cy
Cycle 1
Cycle 2Cycle 3
Cycle 4Cycle 5
Cycle 6Cycle 7
Cycle 8Cycle 9
Cycle 10
Bits per Architectural Register Corrupted
Number of Architectural Registers Corrupted
13 University of MichiganElectrical Engineering and Computer Science
Results Summary• Faults occurring in logic:
► Will likely be much more frequent in embedded design► Tend to have a more dramatic effect on system state► Multi-bit/multi-register architectural errors common
• Design for test methodologies can greatly impact soft error characteristics
• Error rates at the software interface consistent with those observed in high-performance microprocessors
14 University of MichiganElectrical Engineering and Computer Science
Traditional Error Detection/Protection
• Reliable Encoding► ECC/Parity
• Limited use for faults in logic• Unclear where/how much to protect
• Redundant Computation► In space
• Area/energy overhead
► In time• Energy overhead• Requires performance slack
15 University of MichiganElectrical Engineering and Computer Science
Case Study I
RegisterBank
RegisterBank
Data InterfaceData Interface
InstructionAddress
Logic
InstructionAddress
Logic
DataAddress
Logic
DataAddress
Logic
MultiplyMultiply ALU
ShiftShift
Instruction DecodeInstruction Decode
Instruction FetchInstruction Fetch
Datacache
Datacache
MMUMMU
Instructioncache
Instructioncache
MMUMMU
Bus Interface
Write Buffer/Bus Interface
MuxArray
MuxArray
IRoute
Cycle 1: 51 Errorsinstr_reg_ID[0, 16, 22, 31]ID_decode_info[0, 16, 31]
stored_instr[29, 30]Cycle 2: 51 Errors
instr_reg_EX[0, 16, 22, 31]EX_decode_info[0, 16, 31]Cycle 3: 17 ErrorsALU_out[0, 1, 2, 3, 4, 5, 6]
Cycle 4: 18 ErrorsALU_result_wb[0,1,2,3,4,5,6]
Cycle 5: 29 ErrorsReg0_reg[0, 1, 2, 3, 4, 5, 6]
16 University of MichiganElectrical Engineering and Computer Science
Case Study II
RegisterBank
RegisterBank
Data InterfaceData Interface
InstructionAddress
Logic
InstructionAddress
Logic
DataAddress
Logic
DataAddress
Logic
MultiplyMultiply ALU
ShiftShift
Instruction DecodeInstruction Decode
Instruction FetchInstruction Fetch
Datacache
Datacache
MMUMMU
Instructioncache
Instructioncache
MMUMMU
Bus Interface
Write Buffer/Bus Interface
MuxArray
MuxArray
IPipeCycle 1: 9 Errorsinstr_reg_ID[3,12,17, 18,24,26,29,30,31]
Cycle 4: 183 Errorswriteback and forwarding state
register bank
Cycle 2: 62 Errorsinstr_reg_EX
shifter_data_opEx_regShifter_data_reg
alu_cc_reg
Cycle 3: 49 ErrorsShifter_data_EX
alu_out_reg
17 University of MichiganElectrical Engineering and Computer Science
Fault Characteristics• Case Study I: uCORE.uIRoute.U600
► First cycle error sites: 51 errors• uIRoute.INSTRHeld_reg[0]• uIRoute.INSTRHeld_reg[16]• uIRoute.INSTRHeld_reg[22]• uIRoute.INSTRHeld_reg[31]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[0]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[16]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[31]• u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg[29]• u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg[30]
• Case Study II: uCORE.u9EJ.uARM9.uCORECTL.uIPIPE.U3626► First cycle error sites: 9 errors
• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[3]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[12]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[17]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[18]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[24]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[26]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[29]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[30]• u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[31]
18 University of MichiganElectrical Engineering and Computer Science
Embedded Design Space Potential
• Leverage significant signal fanout• Determine that a fault has occurred during the
cycle that it occurs► Transition detection circuits
• Selectively deploy fault detection units► Intersection of high fanout fault targets► No roll-back necessary – simply flush the pipeline► Low cost/area overhead critical for embedded
designs
19 University of MichiganElectrical Engineering and Computer Science
Conclusion
• Design domain critical:► Affects fault behavior► Limits applicable tolerance techiques
• Key observations:► Faults in combinational logic much more likely in
embedded designs► Faults in combinational logic behave dramatically
different than those in state elements► Fault fanout offers potential for low overhead
detection
20 University of MichiganElectrical Engineering and Computer Science
Soft Error Terminology
transient fault soft error
transistor
21 University of MichiganElectrical Engineering and Computer Science
Dependence on Fault Duration
0
0.02
0.04
0.06
0.08
0.1
0.12
1500 2500 3500 4500
Fault Duration
Fre
qu
en
cy
of
Ex
pre
ss
ed
Err
ors
22 University of MichiganElectrical Engineering and Computer Science
Pulse Detection
D
CLK
Q
~Q
error
flip-flop
shadow latch
23 University of MichiganElectrical Engineering and Computer Science
Microarchitectural Errors per Cycle
00.10.20.30.40.50.60.70.80.9
1
1 10 100 1000 10000
Number of Microarchitectural Errors
Rela
tive F
req
uen
cy Cycle 1
Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10
Faults Occurring in Registers
Faults Occurring in Combinational Logic
Multi-bit errors common for Faults in combinational logic
00.1
0.20.3
0.40.5
0.60.70.8
0.91
1 10 100 1000 10000
Number of Microarchitectural Errors
Rela
tive F
req
uen
cy
Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10