Date post: | 25-Dec-2015 |
Category: |
Documents |
Upload: | reginald-maxwell |
View: | 214 times |
Download: | 0 times |
11
1
Process Variation in Near-threshold Wide SIMD Architectures
Sangwon Seo1, Ronald G. Dreslinski1, Mark Woh1, Yongjun Park1,Chaitali Chakrabarti2, Scott Mahlke1, David Blaauw1, Trevor Mudge1
University of Michigan1, Arizona State University2
22
2Near Threshold Computing
Super Threshold high performance
high energy consumption
Near Threshold 10x energy reduction
10x performance degradation
Sub Threshold exponentially decreasing
performance
increasing leakage becomes dominant
2
33
3Near-threshold Computing
Advantage: High energy efficiency
Disadvantage Low performance throughput
Compensated with very wide SIMD architecture
Sensitive to variations in threshold voltage
More critical issues in wide SIMD architectures Increased probability of timing errors
Expensive error recovery mechanisms
3
44
4Near-threshold Computing
Advantage: High energy efficiency
Disadvantage Low performance throughput
Compensated with very wide SIMD architecture
Sensitive to variations in threshold voltage
More critical issues in wide SIMD architectures Increased probability of timing errors
Expensive error recovery mechanisms
How bad is the delay variation in wide SIMD architectures running at near-threshold voltages?
How to mitigate the variation-induced timing errors?
4
66
6Delay Variations – f(Vdd=0.55V, N)
6
A long chain helps, but the effect diminishes as N increases.
Variations are exacerbated with technology scaling.
77
7Delay Variations – f(Vdd, N=50)
7
LER causes high variations in advanced technology nodes
Strict Design Rules
Metal-Gates w/ high-k material or SOI
Advanced lithography
88
8Delay Distribution – 90nm GP
8
1 critical path delay = delay of a chain of 50 FO4 inverters.
1-wide system delay = max (delays of 100 critical paths )
128-wide system delay = max (delays of 128 1-wide system)
Performance Drop
99
9Variation Effects on 128-wide SIMD Architecture
9
- Structural Duplication- Voltage margining- Frequency margining
1111
11Structural Duplication
11
SIMD Function Unit #7
SIMD Function Unit #6
SIMD Function Unit #5
SIMD Function Unit #4
SIMD Function Unit #3
SIMD Function Unit #2
SIMD Function Unit #1
SIMD Function Unit #0
SIMD Function Unit #9
SIMD Function Unit #8
Crossbar
Datapath#7
Datapath#6
Datapath#5
Datapath#4
Datapath#3
Datapath#2
Datapath#1
Datapath#0
8-wide+2-spare system
Increase number of processing resources
1212
12Structural Duplication
12
SIMD Function Unit #7
SIMD Function Unit #6
SIMD Function Unit #5
SIMD Function Unit #4
SIMD Function Unit #3
SIMD Function Unit #2
SIMD Function Unit #1
SIMD Function Unit #0
SIMD Function Unit #9
SIMD Function Unit #8
Crossbar
Datapath#6
Datapath#6
Datapath#5
Datapath#4
Datapath#3
Datapath#2
Datapath#1
Datapath#0
8-wide+2-spare system
Use the spares if required.
1313
13Structural Duplication – 90nm GP
13
6 spares are required to match the chip delay of baseline architecture.
1515
15Frequency Margining
Increase clock period
Applicable for applications with relaxed time constraints
For advanced technology nodes, this is impractical
Caveat
Consider its impact on system
SIMD subsystem clock period (Tclk@NTV)
memory subsystem clock period (Tclk@FV)
15
1717
17Combination of two schemes – 45nm GP
17
128-wide system @ 0.6V
26 spares
17mV boost
5mV + 8 spares
10mV + 2 spares
1919
19Conclusions
Near-threshold operation of wide SIMD system can have timing problems due to process variations.
Variation effects on a 128-wide SIMD architecture are marginal for 90nm technology node, but could be non-negligible for current/future technology nodes.
A combination of structural duplication and voltage margining provides a minimal power overhead solution to mitigate variation-induced timing problems in wide SIMD architectures.
19
2222
22Local Spares vs. Global Spares
22
Local Sparing 1 out of 4
(2 spares)
Global Sparing
(2 spares)
+ small overhead
- burst errors
+ burst errors
- Large overhead
2323
23Local Spares vs. Global Spares
23
Global sparing is better than local sparing.
XRAM crossbar supports global sparing.
128 + 8 global spares
128 + 32 local spares(1 out of 4)