ID. 20 Towards Wearout-aware and Accelerated
Self-healing Digital SystemsStudent: Xinfei Guo; Advisor: Mircea R. Stan
ECE Department, University of Virginia, USA
PHD FORUM
MotivationWearout Issues BTI, HCI, TDDB, EM, etc.
More significant with extremely scaling technology
Increase design margin and worsen metrics
Cross-layer Issues
Both Reversible and Irreversible Part
Previous Solutions Design for the worst case (Guard band)
Hard to predict wearout;
The worst case becomes even worse;
Power, performance and area (PPA) overhead.
Track and monitor them, dynamically adapt to wearout
Through the whole life time;
The average case is skewed;
Power, performance and area (PPA) overhead.
Reduce the stress during operation
Not applicable for high performance systems.
Lack irreversible wearout solution
The boundary between reversible and irreversible is unclear
Let the system sleep when it gets tired –
Completely Avoid Irreversible Wearout [2]Reversible vs. Irreversible Wearout Permanent part of wearout exits even for BTI
Majority of the trapped electrons are at low energy
Trap energy is much higher (~2eV)
Fast traps: Reversible; Slow traps: Irreversible
Natural Recovery vs. Accelerated Recovery Natural Recovery: 0V, Room Temperature
Accelerated Recovery: High Temperature,
Negative Voltage, Other energy sources (e.g. UV)
The irreversible part can be recovered!
The boundary is “soft” and controllable!
Circadian Rhythms: Completely Avoid Irreversible Wearout! The irreversible wearout is totally gone under an optimized sleep/active ratio
Reduced Design Margin (O(ln(days)) vs. O(ln(years)))
Improved average performance increase with time!
&0
0
0
Q
QSET
CLR
S
R
Slow!
Timing Error!
Failure!High Power!
∆V
th
Tra
nsis
tor
Sta
te
ON
OFF
Time
Vth Net Increase
Stress Recovery Stress Recovery Stress Recovery
Gate
Oxide
Trapping De-Trapping
Traps
Charge
Carrier
Body
Source Drain
Accelerated Self-Healing [1]Main Idea
Sleep should be used as an active recovery period for future electronics.
Electronic systems will benefit from such sleep periods with active
rejuvenation during which some of the effects of wearout (e.g. BTI) can be
reversed by several techniques (high temperature, negative voltage, UV light,
reverse current, etc.), thus leading to effective self-healing.
Test Setup
Commercial 40nm FPGA chips
Accelerated Testing Methodology
Knobs: V, T, AC/DC, Sleep/Active
Measurement Results and On-chip solutions
Recovery rate increased significantly
Utilize “Dark Silicon”
On-Chip Negative Voltage
On-chip reconfigurable elements
Reversible Weaout Irreversible Weaout
Boundary?
Cross-layer Optimization InfrastructureModelingCircuit-level: Transient Simulation, be compatible
with circuit simulators (e.g. SPICE, Spectre) [4];
Architecture-level: physically aware
parameterized high-level modes that
are integrated with simulators like gem5;
System-level: Optimized scheduling
algorithms that trade off between lifetime
and other metrics, like energy efficiency.
[1] X. Guo, W. Burleson, M. Stan, “Modeling and Experimental Demonstration of Accelerated Self-
Healing Techniques,” Proc. of ACM/IEEE Design Automation Conference (DAC), June, 2014.
[2] X. Guo, M. Stan, “Let the system sleep before getting tired – Avoid irreversible wearout by
periodic accelerated rejuvenation, ” Submitted.
[3] X. Guo, M. Stan, “MCPENS: Multiple-Critical-Path Embeddable NBTI Sensors for
Dynamic Wearout Management,” IEEE Workshop on Silicon Errors in Logic–System
Effects (SELSE-11), April, 2015.
[4] A. Roelke, X. Guo, M. Stan, “A SPICE-Compatible BTI Transient Model Considering
Accelerated Self-healing,” Ongoing.
[5] M. Stan, X. Guo, A. Roelke, “Modeling and Experimental Demonstration of Accelerated Self-
Healing Techniques in CMOS Circuits,” Proc. of GOMAC Tech, March, 2015.
Case Measurement(%) Model(%)
20°C and 0V 0.66% 1%
20°C and -0.3V 16.7% 14.4%
110°C and 0V 28.7% 29.6%
110°C and -0.3V 72.4% 70%
16-b
Counter
fref clk
in
Cout16
EnEn
75 LUTs
Circuit Under Test (CUT)rst
refoutosc
dfCf
T4
1
2
1
Test configuration
Measurement Setup
Thermal Chamber
(Chip Inside)
Motherboard
Data Sampling
Chip
FPGA Board and Interface Board
FPGA Chip
To FPGA
Programmer
To Mother Board
ProgrammerTo PC
Core 6
Core 1 Core 2 Core 3 Core 4
Core 5 Core 7
Shared L3 Cache
Core 8
Zzzzzz...
Zzzzzz...
Heat Heat
Heat
Heat
Heat Heat
On-chip Solution 3: Utilize Dark Silicon in Multicore
Logic
vdd
vdd_high
On-chip Solution 1: Negative
voltage to recover headerOn-chip Solution 2:
Reconfigurable self-healing tree
Logic
Block
Logic
Block
Logic
Block
Logic
Block
Logic
Block
Logic
BlockSelf-healing Tree
Reconfigurable
Self-heating Blocks
Pro
bab
ilit
y
19.072
19.082
19.092
19.102
19.112
19.122
19.132
19.142
0 100 200 300 400
Fre
qu
ency
(MH
z)
Recovery Time (min)
110 C and -0.3V 0V and Room Temperature(20C) Fresh
Irreversible Part from
Accelerated Recovery
Irreversible Part from
Regular Recovery
Irreversible Part under different recovery conditions
18.95
19
19.05
19.1
0 200 400 600 800 1000 1200 1400
Fre
qu
ency
(MH
z)
Time (min)
1 hr vs. 1 hr 4 hrs. vs. 4 hrs 2hrs vs. 2hrs 6 hrs vs. 6 hrs
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
1 hr vs. 1 hr
Case
2 hrs vs. 2 hrs.
Case
4 hrs vs. 4 hrs
Case
6 hrs vs. 6 hrs
Case
Per
man
ent
Par
t
(MH
z)
Electron Energy Distribution at
Room Temperature
19.03
19.04
19.05
19.06
19.07
19.08
19.09
19.1
0 100 200 300 400 500 600 700 800
Fre
qu
ency
(MH
z)
Time (min)
Stress for 6 hours
Reversible
Part Recovered
Part
Accelerated Recovery for 6 hours
Sequentiality of reversible and irreversible wearout
Irreversible wearout for different cases19.02
19.03
19.04
19.05
19.06
19.07
19.08
6 hr vs. 6
hr
4 hrs. vs.
4 hrs
2 hrs vs.
2 hrs
1 hr vs. 1
hr
Av
erag
e F
req
uen
cy(M
HZ
) Average Frequency for 1 Day
6 hr vs. 6
hr
4 hrs. vs.
4 hrs
2 hrs vs.
2 hrs
1 hr vs. 1
hr
Average Frequency for 2 Days
Device Level
Circuit Level
Architecture Level
System Level
Accelerated
Self-healing
Embeddable Wearout Sensors [3]• Track both wearout and recovery
• Small, fast and accurate
• Wearout-induced Path Re-rankingAdaptive Solutions
DVFS
Body BiasCircadian
Rhythms
Core 2Core 1
Accelerated
Self-healing
MCPENS
Path<N:0>
MCPENS
Path<N:0>
Core 3
MCPENS
Path<N:0>
Core 5Core 4
MCPENS
Path<N:0>
MCPENS
Path<N:0>
Core 6
MCPENS
Path<N:0>
A Wearout-aware and
Accelerated Self-healing Robust System
24.5
24.7
24.9
25.1
25.3
25.5
25.7
25.9
26.1
Fre
qu
ency
(MH
z)
Wearout for 48 hours
Accelerated
Recovery for
12 hours
Illustration of wearout vs. accelerated self-healing
72.4%
~ kT (0.026eV) at
room temperature
Energy that causes
detrapping