Robust Systems
for Scaled CMOS and Beyond
Acknowledgment: Students & Collaborators
Subhasish Mitra
Robust Systems Group
Department of EE & Department of CS
Stanford University
Robust System Design
2
� Complexity: detect & fix design bugs
� CMOS reliability limits: tolerate errors
� Beyond silicon-CMOS: imperfection-immune logic
Perform correctly despite complexity & disturbances
What’s New ?
Traditional Thinking New approach
Design bugs Pre-silicon Post-silicon
Reliability failures Avoid Tolerate at low cost
Beyond silicon-CMOS
Material processingImperfection-immune
design
� Existing approaches: inadequate, expensive
3
Outline
� Introduction
� CMOS reliability limits: tolerate errors
� Beyond silicon-CMOS: imperfection-immune logic
� Conclusion
4
Technology Reliability Challenges
� System soft error rates increasing
� Fatal flip-flop errors
� Early-life failures (ELF)
� Burn-in: difficult, expensive
� Circuit aging & variations
� Worst-case guardbands expensive
Soft error rates
Comb. logic
5
Flip-flop
SRAM (no
ECC)
Circuit Failure Prediction
New failure signature � ultra low-cost
WearoutEarly-life failures (ELF)
Lifetime Time
Fa
ilure
ra
te
Burn-in difficult Iddq
ineffective
Circuit aging Guardbandsexpensive
Soft Error Resilience
BISER + LEAP:
Errors reduced: 2,000X
Software-orchestrated global optimization a MUST
Low-Cost Resilience
6
BISER: Built-In Soft Error Resilience
45nm: up to 1,000X fewer errors vs. D-flip-flop
D
C
D
C
Latch
Redundant Latch (Scan Test & Debug reuse)
Q
Q
Weak keeper
OUT
Combinational logicIN
ClockC-element
A
B
7
Single Error Assumption Inadequate
� Single event multiple upsets increasing
8
2,000X fewer errors vs. D-flip-flop
LEAP: Layout by Error Aware transistor Positioning
Optimized Resilience Essential
Select application-critical
flip-flops
Optimize for cross-layer
resilience
9
20 40 60 80 100
0%
10%
20%
Power cost
% critical flip-flops protected with logic parity, BISER for rest
Power cost
20 40 60 80 100
% flip-flops critical
Chip-level error rate
0.1
1
0.5
Circuit Failure Prediction
New failure signature � ultra low-cost
WearoutEarly-life failures (ELF)
Lifetime Time
Fa
ilure
ra
te
Burn-in difficult Iddq
ineffective
Circuit aging Guardbands expensive
Software-orchestrated global optimization a MUST
Low-Cost Resilience
10
Soft Error Resilience
BISER + LEAP:
Errors reduced: 2,000X
New Gate-Oxide ELF Signature
� Delay fluctuations over time
� Before functional failure
� Demonstrated: 45, 32nm
� 28, 22, 15nm in progress
� Enables
� On-line failure prediction
Stress time
Delay fluctuations
Fu
nc
tio
na
l F
ail
ure
ELF Delay fluctuations
11
On-line Failure Prediction
12
Failure Prediction Error Detection
Before errors appear After errors appear
+ No corruption – Corrupt data & states
+ Low cost – High cost
+ Self-diagnostics – Limited diagnostics
How ?
On-line self-test and diagnostics
On-Line Self-Test and Diagnostics
On-line self-test & diagnostics
CASP
High on-line test coverage
No visible system downtime
1% power, 1% area, 3% performance impact
Ultra low-cost
OpenSPARC T2 SoC
Task1
Task2
TaskN
Task N+1
TaskN+2
TaskM
13
Uncore very important
Outline
� Introduction
� CMOS reliability limits: tolerate errors
� Beyond silicon-CMOS: imperfection-immune logic
� Conclusion
14
Ideal CNFET Inverter
N+ doped
Semiconducting
CNTs
Gates
Input
P+ doped
Semiconducting
CNTs Lithographic pitch
4nm
Output
Vdd
Gnd
16
CNFETs: BIG Promise, BUT
� Major barriers for a decade
� Mis-positioned CNTs
� Metallic CNTs
� Processing alone inadequate
Imperfection-immune
design essential
17Collaborator: Prof. H.-S.P. Wong, Stanford
Wanted: (A+C) (B+D)
Got : B+D
Out
A B
C D
Vdd
A C
B D
Gnd
Wanted: A′C′ + B′D′
Got: A′C′ + B′D′ + A′D′
Mis-positioned CNTs: Incorrect Logic
18
BA
A
B
Out
1. Grow CNTs
2. Extended gate & contacts
CRUCIAL
20
Mis-positioned-CNT-Immune NAND
Vdd
Gnd
BA
A
B
Out
1. Grow CNTs
2. Extended gate & contacts
3. Etch gate & CNTs
4. Dope P & N regions
Vdd
Gnd
21
Mis-positioned-CNT-Immune NAND
BA
A
B
Out
1. Grow CNTs
2. Extended gate & contacts
3. Etch gate & CNTs
4. Dope P & N regions
Etched region
ESSENTIAL
Vdd
Gnd
22
Mis-positioned-CNT-Immune NAND
BA
A
B
Out
1. Grow CNTs
2. Extended gate & contacts
3. Etch gate & CNTs
4. Dope P & N regions
Etched region
ESSENTIAL
� Graph algorithms
� All possible functions
� VLSI
� Processing & design
Vdd
Gnd
23
Mis-positioned-CNT-Immune NAND
VMR: VLSI Metallic CNT Removal
� Metallic-CNT-immune design
☺ Sufficient: all possible logic designs
☺ VLSI processing & design
24
25
Quartz wafer with catalyst
Aligned CNT growth
First Wafer-Scale Aligned CNT Growth
2 µm 2 µm
Before transfer
Quartz substrate
Quartz wafer
99.5% CNTs aligned
After transfer
SiO2/Si substrate
D-latch
Imperfection-immune circuits Arithmetic & storage
Adder sum
D-latch
VLSI Integration Wafer-scale & monolithic 3D
Conventional via, NOT TSV
First Experimental Demonstrations
26Multi-layer CNFET circuits
CNFET Variations Significant
27
CNFET Ion variations
Yield
Energy penalty
0%
Very
high
High (99%+)
Unique layouts+ Co-optimized processing
Naïve transistor upsizing
Low (0%)
Metallic CNT
induced
Grown CNT
density
OthersNo
design
change
Outline
� Introduction
� CMOS reliability limits: tolerate errors
� Beyond silicon-CMOS: imperfection-immune logic
� Conclusion
28
Thanks to our Sponsors
30
Photo credits:Burn-in & test socket workshop, H. Dai, NEC, opensparc.net, Stanford