Y. Shakhsheer, S. Khanna, K. Craig,
S. Arrabi, J. Lach, and B. H. Calhoun
University of Virginia, Charlottesville, VA
Custom Integrated Circuits Conference
September 21, 2011
1
Motivation
2
Portable applications require extended battery life & small form factor
Electronics need high performance for a fraction of their life
High performance struggles with power density
Classic Dynamic Voltage Scaling (DVS)
Adjust VDD to adjust speed to match workload
Usually chip or core level
Use DC-DC converters to adjust VDD = Slow, infrequent VDD transitions
Constrained by block with highest workload
Need for finer granularity in space and time
3
Panoptic DVS (PDVS) Structure
4
Single-VDD All components share one VDD
Multi-VDD Each component statically
tied to a VDD
PDVS PMOS header switches used
to select a specific VDD
Small set of voltage rails (2-4)
Uses common components
Fine temporal granularity
Fine spatial granularity
[1] Putic, M. et al., ICCD, pp.491-497, 01/10/2009.[2] Di, L. et al., ICCD, pp.605-611, 08/2008.
Contributions
5
This work:
Demonstrates DSP Processor using PDVS
Demonstrates single clock cycle VDD-switching & VDD dithering for near optimal energy scalability
Demonstrates switch efficiently between high performance DVS and subthreshold modes.
Demonstrates energy savings compared to single-VDD and multi-VDD alternatives.
Outline
Chip Architecture
Measured Results
Subthreshold Mode
Conclusion
6
Chip Architecture
7
32b Data Flow Processor• 4 Kogge Stone Adders• 4 Baugh Wooley Multipliers• PMOS Header • Level Converters
Execute arbitrary data flow graphs (DFGs)
32 Kb Data Memory
40 Kb Instruction Memory
Register File
160
32kb Data
Memory
40 kb Instruction
Memory
Control
VDDH VDDM VDDL
*
x4
LC
VDDH VDDM VDDL
+
x4
x8 General Purpose
32b
Coefficients x15
32b
Register Bank
Cro
ssbar
32
PDVS data path
Multi-VDD data path
Single-VDD data path
Sub-VT PDVS data path
VDDH VDDM VDDL
+
+
+
VDDH
+
+
e.g.
e.g.
Fig. 2. Block diagram of the PDVS data flow processor. SRAMs and control
serve four data paths for direct comparison of PDVS with SVDD & MVDD.
2-stage Pipelined SRAM
8
During the first phase of a conventional SRAM read access when row decode occurs, the SA lies idle
In our scheme, in the first cycle, row decode and BL droop development is completed
At the beginning of the second cycle, SA enable signal occurs
Col Mux acts as pipeline register
Helps lower the cycle time by TSA
23% reduction of the cycle time
WL
SAE
WL
BL
TDEC
TBC
TSA
Col Mux
WL
Drivers
SAs
CLK
WL
SAE
SA Output
BL
TDEC
TBC
TSA
C1 C2
Chip Vitals
9
PDVS MVDD Sub VT SVDD
Inst
Memory
Data
Memory
VCO & Inst
Block
3.3mm
Mult
Adder
Headers
for the
mult
Headers
for the
adder
4.3mm
Feature This Chip
Process 90nm CMOS Bulk w/ Dual VT
Area 4.3mm x 3.3mm
TransistorCount ~2 million
VDD 250mV – 1.2V
SRAMs 40kb & 32kb
Outline
Chip Architecture
Measured Results
Subthreshold Mode
Conclusion
10
Dithering
11
System Results
12
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
SVDD
MVDD
PDVS
Measu
red
En
ergy
Savin
gs
Measu
red
En
ergy
Savin
gs
Norm
ali
zed
En
ergy
Workload
System Results Cont’d
13
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
SVDD
MVDD
PDVS
Measu
red
En
ergy
Savin
gs
Measu
red
En
ergy
Savin
gs
Norm
ali
zed
En
ergy
Workload
System Results Cont’d
14
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
SVDD
MVDD
PDVS
Introducing slack increases savingsM
easu
red
En
ergy
Savin
gs
Are
a S
avin
gs
Norm
ali
zed
En
ergy
Workload
Area Savings vs MVDD
15
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
SVDD
MVDD
PDVS
PDVS area savings are a result of reducing the number of copies
Measu
red
En
ergy
Savin
gs
Are
a S
avin
gs
Norm
ali
zed
En
ergy
Workload
Measured Dithering
16
Time
Overheads
17
32b Adder 32b Mult.
Header Area
2.4% 1.7%
Level Conv. Delay
32.0% 2.0%
Level Conv. Energy
8.0% 0.3%
Level Conv. Area
11.4% 2.1%
Sw. Delay 10.4% 12.0%
Sw. Energy 215.3% 35.0%
BreakevenCycles (NBE)
< 4 < 1
switch
LowHigh
BEE
EEN
Overh
ead
vs
1 M
UL
T @
VD
DH
VDDL (V)
Overh
ead
vs
1 M
UL
T @
VD
DH
VDDL (V)
Multiplier level converter
Multiplier switching
Outline
Chip Architecture
Measured Results
Subthreshold Mode
Conclusion
18
Subthreshold Operation
19
Chip provides subthreshold mode E.g. VDDH/M/L @ 1V, 0.5V 0.25V
Minimum energy operating point
50 1000.2
0.4
0.6
0.8
1
1.2
Vir
tual V
DD
(V)
Time
Subthreshold Circuit Optimizations
20
Added PMOS headers to: Register bank
Crossbar
Tied circuit & Sub-VT
header body connection to Virtual VDD
Optimized level converter for subthreshold Converts from 0.25 up to 1.0V
Added bypass for these level converters
Out
Bypass
Bypass
In
VDDH VDDM VSUBVT
32b
Register Bank
Cro
ssb
ar
VDDH VSUBVT
Sub-VT data path
components
VSUBVT
VDDH
High VT
Out
Bypass
Bypass
In
VDDH VDDM VSUBVT
32b
Register Bank
Cro
ssb
ar
VDDH VSUBVT
Sub-VT data path
components
VSUBVT
VDDH
High VT
Outline
Chip Architecture
Measured Results
Subthreshold Mode
Conclusion
21
Conclusion
22
First processor using PDVS Demonstrate single clock
cycle VDD-switching & VDDdithering for near optimal energy scalability
Switch efficiently between high performance DVS and subthreshold modes.
Energy savings up to 50% and 46% compared to single-VDD and multi-VDD alternatives.
FeatureHoward
ISSCC10
Truong
VLSI08
Nam
ISSCC07
This
work
VDD
Granularity6 cores 1 core 1 core
Add,
Mult
Speed of
VDD change
>10µs
(e.g.[4])2-5ns
>10µs
(e.g. [4])<2ns
VDD
ditheringNo No No Yes
Sub-
thresholdNo No No Yes
[1] J. Howard et al., ISSCC , pp. 22-33, 2010.[2] D. Truong et al., Symp. VLSI, pp.22-23, 2008.[3] B. Nam et al., ISSCC, pp.278-603, 2007.[4] C. Zheng and D. Ma, ISSCC, pp.204-205, 2010.
Thank you!
Any Questions?
23