1
EE241 - Spring 2006Advanced Digital Integrated Circuits
Lecture 26:Future Perspectives
2
Embedded Software-Based Self-Testing for Programmable System Chips
BusInterface Master Wrapper
BusArbiter
Low-CostTester
On-ChipMemory
Test program
Responses
VCISignatures
DSP
VCI
IP CoreVCI
System MemoryVCI
On-chip Bus
BusInterfaceMaster Wrapper
BusInterface Target WrapperBusInterface Target Wrapper
Loading test program at low speed
Self-test at operational
speed
Unloading response
signature at low speed
Low-cost tester
High-quality at-speed test
Low test overhead
Non-intrusive
Test in normal operational mode
No violation of power consumption
More accurate speed-binning
CPU
2
3
Project Presentations Tomorrow
Tu May 9 2-5pm in 127 Dwinelle
Length of presentation: 8 + (N-1) *2 minutes (with N number of people in group)
3’ per group allocated for Q&A
Presentation should contain the following
1 slide outlining problem
Couple of slides discussing proposed solution and how it differs from what other people have done
Couple of results slides
Number of slides should NOT be larger than the number of minutes
BE CONCISE and DO NOT EXCEED THE ALLOTTED TIME
4
Project ReportsDue by Fr May 12th 5pmShould be in paper format – max of 6 pages (font 10 minimum)
Title of the project/ your names and e-mail addresses Abstract (100 words) Motivation Problem statement Possible solutions from literature (from midterm report) Proposed comparison/solution. Discuss why did you select this particular one. Conditions/assumptions of your design Analysis: Does it work? Analytical analysis, simulation results.Conclusion. What is this approach good for? What else could be done? References
3
5
Take-Home Final
Two options:Early Bird: Fr May 12, 5pm – Monday May 15, 10amRegular Offering: Monday May 15, 5pm – Th May 18, 10amSend e-mail to [email protected] and [email protected] if you want to subscribe to the early bird option
Submissions of answers preferred in electronic form (scans are ok). If impossible, paper version to Jessica in 558 Cory before 10am on respective days.THIS IS A PRIVATE AND PERSONAL EXERCISE. Honesty is appreciated.
6
Semester Look Back
What we did not cover …Interconnect
Key challenge
Interesting areas: high-speed serial interconnect, alternative interconnect, networks on a chip
Arithmetic
Trading off performance and power
Please send feedback on topics you would like to see covered in less or more detail
4
EE241, S06EE241, S06
The Silicon Age ― A Closer Look
19501950 19601960 19701970 1990199019801980 20002000
Co
mp
lexi
tyC
om
ple
xity
???
Some structureSome structure
StructuredStructured
UnstructuredUnstructured
CustomCustom
ASICASIC
DiscreteDiscrete
IP/SocIP/Soc
1 Billion Transistors1 Billion Transistors
EE241, S06EE241, S06
The Silicon Age Still on a Roll, But …
Medium High Very HighVariability
Energy scaling will slow down>0.5>0.5>0.35Energy/Logic Op scaling
0.5 to 1 layer per generation8-97-86-7Metal Layers
11111111RC Delay
Reduce slowly towards 2-2.5<3~3ILD (K)
Low Probability High ProbabilityAlternate, 3G etc
128
11
2016
High Probability Low ProbabilityBulk Planar CMOS
Delay scaling will slow down>0.7~0.70.7Delay = CV/I scaling
256643216842Integration Capacity (BT)
8162232456590Technology Node (nm)
2018201420122010200820062004High Volume Manufacturing
Some Major Hurdles on The Way!
2003 ITRS Roadmap2003 ITRS Roadmap2003 ITRS Roadmap
5
EE241, S06EE241, S06
The Challenges of the Next Decade(s)
•The Physics and Manufacturing Challenges
– A whole slew of static and dynamic variations and error mechanisms
•The Design Introduction Challenge
– Complexity, risk, time, cost
•The n-furcation of the Market
EE241, S06EE241, S06
Design at the End Of the Roadmap
An era of abundance,An era of abundance,
self-healing,self-healing,
resiliency,resiliency,
and beating the odds. and beating the odds.
6
EE241, S06EE241, S06
A Roadmap for Late-Silicon Age Design
20052005 20102010 The far beyondThe far beyondBeyondBeyond
Co
mp
lexi
tyC
om
ple
xity
20002000
Concurrency
And Flexibility
Concurrency
And Flexibility
Self-HealingSelf-Healing
EmbracingRandomnessEmbracing
Randomness
Error-resiliencyError-resiliency
Fully structured and regular fabrics
EE241, S06EE241, S06
A Roadmap for Late-Silicon Age Design
• Regularity and Structure
• Concurrency and Flexibility
• Self-Healing
• Error-Resiliency
• Embracing Randomness
Increasing necessity
Increasing necessity
Absolutely required for manufacturabilityDriven by photo-lithography and eventually self-assembly constraints
Also for variability, reliability, and time-to-market
Regular implementation fabricsRegular implementation fabrics
7
EE241, S06EE241, S06
Regular Fabrics – A Plethora of Choices
FPGAFPGA
VPGACMU
VPGACMU
River PLABerkeley
River PLABerkeley
Structured ASIC (e.g. LSI RapidChip)Structured ASIC (e.g. LSI RapidChip)
Trade-off between area, performance, power and
time-to-market (factors 5 to 10)
TradeTrade--off between area, off between area, performance, power and performance, power and
timetime--toto--market market (factors 5 to 10)(factors 5 to 10)
EE241, S06EE241, S06
Regular Fabrics - Example
CMU Regular Logic BricksStandard-cell library with fewer (~10),
coarser, configurable (w/ vias), micro-regular brick layouts…
…that exhibit macro-regularitywhen assembled at chip-level
2-D FFT plotsof poly-Si
patterns
ASIC “spatial” regularity2-D FFT plots
of poly-Si patterns
Brick “spatial” regularity
[Courtesy: Larry Pileggi, Andrzej Strojwas, CMU – C2S2]
8
EE241, S06EE241, S06
A Roadmap for Late-Silicon Age Design
• Regularity and Structure
• Concurrency and Flexibility
• Self-Healing
• Error-Resiliency
• Embracing Randomness
Immediate future
Immediate future
Concurrency and heterogeneity::• Driven by power density concerns• Alternative to higher clock frequencies
Flexibility:• Higher re-use, shorter time-to-market, in- field adaptation and upgrade
EE241, S06EE241, S06
The Age of Concurrency and Flexibility
AMD Dual Core Microprocessor
Heterogeneous concurrency now prevalent in wireless, automotive,consumer, media processing, graphics and gaming
Heterogeneous concurrency now prevalent in wireless, automotive,Heterogeneous concurrency now prevalent in wireless, automotive,consumer, media processing, graphics and gamingconsumer, media processing, graphics and gaming
Berkeley Pleiades
ARMARMARM
Heterogeneousreconfigurable
fabric
HeterogeneousHeterogeneousreconfigurablereconfigurable
fabricfabric
NTT Video codecwith 4 Tensilica coresNTT Video codecNTT Video codecwith 4 with 4 TensilicaTensilica corescores
IBM/Sony Cell ProcessorIBM/Sony Cell ProcessorIntel Dual Core
Xilinx Vertex 4
9
EE241, S06EE241, S06
Are We Ready for 1000 CPU’s per Chip?
Berkeley BEE-II: 2 TOPs system protytyping environment
• Compilers, operating systems, architectures are definitely not!• How to do research on 1000 CPU systems in compilers, OS, architecture “in parallel”?
30-40 TOPS (2 TFlops) Rack
One Solution: RAMPA Framework for Multi-Core System Development• Create 1000-CPU system from ~ 40 FPGAs• Distribute out-of-the-box Massively Parallel Processor that runs standard binaries of OS and application to all major research institutes in the US.• Provides uniform framework for architecture development and exploration
Core Team: D. Patterson, J. Warzyniek, J. Rabaey (UCB), James Hoe (CMU), Christos Kozyrakis (Stanford),Krste Asanovich (MIT), M. Oskin (Washington), D. Chiou (Texas), W. Hwu (Illinois), S. Lu (Intel)
EE241, S06EE241, S06
A Roadmap for Late-Silicon Age Design
• Regularity and Structure
• Concurrency and Flexibility
• Self-Healing
• Error-Resiliency
• Embracing RandomnessLater this decade
Later this decade
Self-Healing Architectures• On chip-test and diagnostics used to
correct for variations and stress• Static and dynamic
10
EE241, S06EE241, S06
Variations Becoming Pronounced
0.01
0.1
1
1980 1990 2000 2010 2020
micron
10
100
1000
nm
193nm193nm248nm248nm
365nm365nmLithographyLithographyWavelengthWavelength
65nm65nm90nm90nm
130nm130nm
GenerationGeneration
GapGap
45nm45nm
32nm32nm13nm 13nm EUVEUV
180nm180nm
Design becoming “statistical”• makes verification substantially harder• challenging synchronization strategies• “error-free” design untenable
Courtesy: Shekhar Borkar, Intel
XY 40
50
60
70
80
90
100
110
Tem
per
atu
re (
C)
130nm
30%
5X
0.90.9
1.01.0
1.11.1
1.21.2
1.31.3
1.41.4
11 22 33 44 55Normalized Leakage (Isb)Normalized Leakage (Isb)
No
rmal
ized
Fre
qu
ency
No
rmal
ized
Fre
quen
cy
EE241, S06EE241, S06
Self-Healing
• Introduce sensors that monitor key aspects of system
– Manufacturing and environmental conditions
Process variations, temperature, voltage, activity, etc
– Key properties that accelerate failure mechanisms
• Employ system-level intelligent control to reduce stress
– Temperature control via resource assignment
– Active management of voltage-reliability trade-offs
• Utilize tuning and healing to alleviate reliability threats
– NBTI reversal
– In-field clock tuning
Courtesy: T. AustinCourtesy: T. Austin
11
EE241, S06EE241, S06
Test Moving On-Line
• On-chip resources used to minimize test cost • Also available for dynamic re-evaluation and adaptation
On-chip noise samplersOn-chip noise samplers
BusInterface Master Wrapper
Low-CostTester
On-ChipMemory
Diag. test program
Responsemap
VCI
On-chip Bus
00001100000000000000000000000000000000100000000000100110000000001100010000000000111111111111111111111111111111110000000000000000
Logic failure map
CPU
On-chip leakage sensorOn-chip leakage sensor
EE241, S06EE241, S06
Adaptive Biasing Using On-Line Test
5
10
15
20
25
30
35
40
45
50
1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Path Delay (ps)
Esw
itch
ing
(fJ) Adaptive Tuning
Worst Case, w/o Vth tuningNominal, w/ Vth tuning
Energy-performance trade-off
ModuleTest
Module
Vbb
Test inputsand responses
Tclock
Vdd
Dynamically adjust supply and threshold design parameters to center the design in the presence of process variations!
Courtesy: K. Cao, Berkeley
10xEasier Again in Regular Fabrics
12
EE241, S06EE241, S06
Adaptive (Body) Biasing Impact
Courtesy: P. Gelsinger and S. Borkar, Intel (DAC04)
4.5 mm
5.3
mm
Multiplesubsites
4.5 mm
5.3
mm
Multiplesubsites
4.5 mm
5.3
mm
Multiplesubsites
4.5 mm
5.3
mm
Multiplesubsites
EE241, S06EE241, S06
Dynamic Resource Allocation
In the MultiIn the Multi--Processor SpaceProcessor SpaceCompiler combines load Compiler combines load assignment with DVSassignment with DVS
mdlmdl group at PSUgroup at PSU
405060708090
100
2 4 8 16 32
Num ber of Processors
Nor
mal
ized
Ene
rgy
3D DFE LU SPLAT MGRID WAVE5
More savings with more processors!More savings with more processors!
In the Interconnect SpaceIn the Interconnect SpaceUse routing throttling to Use routing throttling to perform thermal managementperform thermal management
ThermalHerdThermalHerd (L.S. Peh, Princeton)(L.S. Peh, Princeton)
13
EE241, S06EE241, S06
Rejuvenation
Source: D. Blaauw, UMichSource: D. Blaauw, UMich
Negative Bias Temperature InstabilityNegative Bias Temperature Instability
EE241, S06EE241, S06
A Roadmap for Late-Silicon Age Design
• Regularity and Structure
• Concurrency, Heterogeneity and Flexibility
• Self-Healing
• Error-Resiliency
• Embracing Randomness Beyond 2010
Beyond 2010Redundancy GaloreThe only way to provide true error-resiliency!
With billions of transistors, overhead factors of 2 to 3 are reasonable if leading to 100% yield, supreme performance, or new applications.
14
EE241, S06EE241, S06
Error-Resilient Systems
Incorporate facilities to push through system faults
• Error detection technologies
– Systems checkers, online testing, continuous functional verification
• Fault diagnosis
– Fine-grained testing, online testing
• System state recovery
– Microarchitectural checkpointing, algorithmic tolerance
• Physical repair
– Sparing, TMR
Courtesy: T. AustinCourtesy: T. Austin
EE241, S06EE241, S06
A Gradual Introduction Process
A “pseudo-synchronous”approach to address process variations and power minimization with minimal overhead by combining circuit and architectural techniques
Courtesy: T. Austin, D. Blaauw, MichiganCourtesy: T. Austin, D. Blaauw, Michigan
Example: Aggressive Deployment using “Razor”Example: Aggressive Deployment using “Razor”
recover
IF
Razo
r FF
ID
Razo
r FF
EX
Razo
r FF
MEM(read-only)
WB(reg/mem)
errorbubble
recover recover
Razo
r FF
Stab
ilizer
FF
PC
recover
flushID
bubble
errorbubble
flushID
errorbubble
flushID
FlushControl
flushID
error
recover
IF
Razo
r FF
Razo
r FF
ID
Razo
r FF
Razo
r FF
EX
Razo
r FF
Razo
r FF
MEM(read-only)
WB(reg/mem)
errorbubble
recover recover
Razo
r FF
Razo
r FF
Stab
ilizer
FF
Stab
ilizer
FF
PCPC
recover
flushID
bubble
errorbubble
flushID
errorbubble
flushID
FlushControl
flushID
error
“razored pipeline”“razored pipeline”
Shadow Latch
Error_L
Errorcomparator
clk_del
FF
clk
QD
Processor
Total
Optimal Voltage
RecovEnergy
Supply Voltage
Ene
rgy
Processor
Total
Optimal Voltage
RecovEnergy
Supply Voltage
Ene
rgy
15
EE241, S06EE241, S06
“Aggressive” Deployment At the Algorithm Level
][nx][nyaMain Block
Estimator
][ˆ ny| | >Th
][nye
Energy savings
Voltage
Pow
er
Pmain
PTOT
PEC
1.0
1.0
Courtesy: N. Shanbhagh, IllinoisCourtesy: N. Shanbhagh, Illinois
Voltage overscale Main Block.
Correct errors using Estimator.
Power savings ≥ 3X!
Voltage overscale Main Block.
Correct errors using Estimator.
Power savings ≥ 3X!
EE241, S06EE241, S06
• Core function validated by checker
• Checker relaxes burden of correctness on core processor
• Core does the heavy lifting, removes hazards that could slow the simple checker
speculativeinstructions
in-orderwith PC, inst,inputs, addr
IF ID REN REG
EX/MEM
SCHEDULER CHK CT
Performance Correctness
Core Checker
Courtesy: Todd Austin, Univ. of Michigan
205 mm2
Alpha 21264REMORA
Checker
12 mm2
Self-checking processor
Moving the Verification on the Chip
16
EE241, S06EE241, S06
“On-Line X”(X = Verification, Test, Tuning, Reliability, Resource,
Power and Leakage Management)
From Design time to Run Time Yield Improvement!
“Turning lemons into lemonade”
T. Austin
“Turning lemons into lemonade”
T. Austin
EE241, S06EE241, S06
Coordinated Forward Error RecoveryCoordinated Forward Error Recovery
Runtime Validation of Multithreaded Processors
0.99
1
1.01
1.02
1.03
1.04
1.05
1.06
FFT LU CHOLESKY BARNES FMM WATER-NSQUARED
WATER-SPATIAL
Runtime Validation Configuration Fault Rate = 1/1K Fault Rate = 1/1M
SM
T P
roce
sso
r
Reg. File Memory
Runtime Monitorin
g Hardware Context Status Register
Hardware Synchronization Unit
DIVA checker processor
DIVA checker processor
Per-thread retired instructions
dis
pat
ch
Correctness Correctness Properties of Properties of Multithreaded Multithreaded
ExecutionExecution
InterInter--thread thread CommunicationCommunication
InterInter--thread thread SynchronizationSynchronization
IntraIntra--thread thread Data FlowData Flow
IntraIntra--thread thread Control FlowControl Flow
17
EE241, S06EE241, S06
Towards malleable, resilient architectures
The Quest: Scaleable (hard and soft) architectures that provide flexible redundancy to accommodate systematic and random, static and dynamic errors while avoiding brittleness!
EE241, S06EE241, S06
A Roadmap for Late-Silicon Age Design
• Regularity and Structure
• Concurrency and Flexibility
• Self-Healing
• Error-Resiliency
• Embracing Randomness
The Far Beyond
The Far Beyond
Maintaining a purely deterministic Boolean abstraction ultimately becomes untenable! Maintaining our abstractions == Slowly abandon them !!
18
EE241, S06EE241, S06
The Search for (New) Scaleable and Stackable Abstractions
An Interesting Case Study:The “Neural Network” MOCProperties:Properties:• Works well on noisy signals• Uses “soft” decisions • Operates in the presence of failures of components and interconnections
Challenge: Limited scopeWorks mostly for classification problems
Artificial neuronArtificial neuron
Allow devices to make errorsand use models-of-computation that tolerate them
(signal processing, communication, coding, information theory)
EE241, S06EE241, S06
Example: Collaborative Networks
• Large number of states/nodes
• Bi-directional, non-linear, non-deterministic links
• Local coupling with globally emergent behavior
• Inherently redundant and resilient to failure
• Large number of states/nodes
• Bi-directional, non-linear, non-deterministic links
• Local coupling with globally emergent behavior
• Inherently redundant and resilient to failure
Sensor Network-on-a-chip
Source: N. Shangbah
19
EE241, S06EE241, S06
Distributed Collaborative Systems on a Chip
Example: A configurable radio architecture based on collaborative autonomous entities
Source: J. Roychowdhury, J. Rabaey
Array of locally-coupled cheaplow-power oscillator-based units• Known to exhibit complex, spontaneous pattern formation • Operation mode selected through choice of coupling factors and operational nodes
Emerging patternas a function of coupling factor
EE241, S06EE241, S06
The Mechanical Radio
The Ultimate ULP Tunable Wireless Transceiver?
Support BeamsWine-Glass
Disk
Anchor
InputElectrode
Coupling Beam
OutputElectrode
R = 32 μm
Source: C. Nguyen, UC Michigan
9 wine-glass disc oscillator-based GSMcompliant oscillator
20
EE241, S06EE241, S06
Transitioning to the Post-Silicon Age
Implementation platforms that work under very low SNR, are non-deterministic, unpredictable and unreliable…
Molecular
Organic
NanoOptics
Nanotube
EE241, S06EE241, S06
Some Concluding Remarks
Formidable challenges over the next decades to dramatically alter design paradigms
Variability and reliability to lead to novel micro-architectures and computational models
Regularity and redundancy central tenets
The opportunities:
Use the abundance of transistors to move the burden from pre- or post-manufacturing evaluation to on-line activities
Gradual incorporation of error-resilient computational models
Formidable challenges over the next decades to dramatically alter design paradigms
Variability and reliability to lead to novel micro-architectures and computational models
Regularity and redundancy central tenets
The opportunities:
Use the abundance of transistors to move the burden from pre- or post-manufacturing evaluation to on-line activities
Gradual incorporation of error-resilient computational models