1
Achieving Energy Efficiency
by
HW/SW Co-design
Shekhar Borkar
Intel Corp.
Oct 28, 2013
This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the
official policies, either expressed or implied, of the U.S. Government.
2
Outline
Compute roadmap & technology outlook
Challenges & solutions for:
– Compute,
– Memory, and
– Interconnect
HW/SW Co-design–not just a buzz word!
Summary
3
Compute Performance Roadmap
1.E-04
1.E-02
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1960 1970 1980 1990 2000 2010 2020
GF
LO
P
Mega
Giga
Tera
Peta
Exa
12 Years 11 Years 10 Years
Client
Hand-held
4
From Giga to Exa, via Tera & Peta
1
10
100
1000
1986 1996 2006 2016
Re
lati
ve
Pro
c F
req
G
Tera
Peta
30X 250X
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1986 1996 2006 2016
G
Tera
Peta
36X
Exa
4,000X
Concurrency
2.5M X
Processor Performance
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1986 1996 2006 2016
G
Tera
Peta
80X
Exa
4,000X
1M X
Power
System performance increases faster
Parallelism continues to increase
Power & energy challenge continues
5
Where is the Energy Consumed?
50pJ per FLOP 50W
150W
100W
100W
600W
Decode and control Address translations… Power supply losses Bloated with inefficient architectural features
~1KW
Compute
Memory
Com
Disk
10TB disk @ 1TB/disk @10W
0.1B/FLOP @ 1.5nJ per Byte
100pJ com per FLOP
5W 2W
~5W ~3W 5W
Goal
~20W
Teraflop system today
The UHPC* Challenge
6
20 pJ/Operation
20MW, Exa 20W, Tera
20KW, Peta
*DARPA, Ubiquitous HPC Program
20 mW, Mega 2W, 100 Giga
20 mW, Giga
Technology Scaling Outlook
7
0
10
20
30
40
45nm 32nm 22nm 14nm 10nm 7nm 5nm
Rela
tive
Transistor Density
1.75 – 2X
0
0.5
1
1.5
45nm 32nm 22nm 14nm 10nm 7nm 5nm
Rela
tive
Frequency
Almost flat
0
0.2
0.4
0.6
0.8
1
1.2
45nm 32nm 22nm 14nm 10nm 7nm 5nm
Rela
tive
Supply Voltage
Almost flat
0.001
0.01
0.1
1
45nm 32nm 22nm 14nm 10nm 7nm 5nm
Re
lative
Energy
Some scaling
Ideal
Energy per Compute Operation
8
0
50
100
150
200
250
300
45nm 32nm 22nm 14nm 10nm 7nm
En
erg
y (
pJ)
pJ/bit Com
pJ/bit DRAM
DP RF Op
pJ/DP FP
FP Op
DRAM
Communication
Operands
75 pJ/bit
25 pJ/bit
10 pJ/bit
100 pJ/bit
Source: Intel
9
Voltage Scaling
0
2
4
6
8
10
0
0.2
0.4
0.6
0.8
1
1.2
0.3 0.5 0.7 0.9
No
rmali
zed
Vdd (Normal)
Freq
Total Power Leakage
Energy Efficiency
When designed to voltage scale
10
Near Threshold-Voltage (NTV)
1
101
103
104
102
10-2
10-1
1
101
102
0.2 0.4 0.6 0.8 1.0 1.2 1.4
Supply Voltage (V)
Maxim
um
Fre
qu
en
cy (
MH
z)
To
tal
Po
wer
(mW
)
320mV
65nm CMOS, 50°C
320mV S
ub
thre
sh
old
Reg
ion
9.6X
65nm CMOS, 50°C
10 -2
10 -1
1
10 1
0
50
100
150
200
250
300
350
400
450
0.2 0.4 0.6 0.8 1.0 1.2 1.4
Supply Voltage (V)
En
erg
y E
ffic
ien
cy (
GO
PS
/Watt
)
Acti
ve L
eakag
e P
ow
er
(mW
)
H. Kaul et al, 16.6: ISSCC08
4 O
rd
ers
< 3
Ord
ers
Experimental NTV Processor
11
IA-32 Core
Logic
Sc
an
R
O M
L1$-I L1$-D
Level Shifters + clk spine
1.1 mm
1.8
mm
Custom Interposer 951 Pin FCBGA Package
Legacy Socket-7 Motherboard
Technology 32nm High-K Metal Gate
Interconnect 1 Poly, 9 Metal (Cu)
Transistors 6 Million (Core)
Core Area 2mm2
S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012
Wide Dynamic Range
12
NTV
EN
ER
GY
EF
FIC
EN
CY
HIGH
LOW
VOLTAGE ZERO MAX
~5x
Demonstrated
Normal operating range Subthreshold
Ultra-low Power Energy Efficient High Performance
280 mV 0.45 V 1.2 V
3 MHz 60 MHz 915 MHz
2 mW 10 mW 737 mW
1500 Mips/W 5830 Mips/W 1240 Mips/W
Observations
13
0%
20%
40%
60%
80%
100%
Sub-Vt NTV FullVdd
Mem Lkg
Mem Dyn
Logic Lkg
Logic Dyn
Leakage power dominates Fine grain leakage power management is required
Integration of Power Delivery
14
For efficiency and management
Standard OLGA Packaging
Technology
TOP
BOTTOM
Converter
chipLoad chip
Inductors
Input
CapacitorsOutput
Capacitors
5mmRF
Launch
Integrated Voltage Regulator Testchip
70
75
80
85
90
0 5 10 15 20
Load Current [A]E
ffic
ien
cy [
%]
L = 1.9nH
L = 0.8nH2.4V to 1.5V
2.4V to 1.2V
60MHz
100MHz80MHz
Schrom et al, “A 100MHz 8-Phase Buck Converter Delivering 12A in 25mm2 Using Air-Core Inductors”, APEC 2007
Power delivery closer to the load for 1. Improved efficiency 2. Fine grain power management
Compare Memory Technologies
15
Source: Intel
DRAM for first level capacity memory NAND/PCM for next level storage
16
Revise DRAM Architecture
Page Page Page
RAS
CAS
Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins
Traditional DRAM New DRAM architecture
Page Page Page Addr
Addr
Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO’s (3D)
1
10
100
1000
90 65 45 32 22 14 10 7
(nm)
Need
exponentially
increasing BW
(GB/sec)
Need
exponentially
decreasing
energy (pJ/bit)
Package
3D-Integration of DRAM and Logic
Logic Buffer
Logic Buffer Chip
Technology optimized for:
High speed signaling
Energy efficient logic circuits
Implement intelligence
DRAM Stack
Technology optimized for:
Memory density
Lower cost
3D Integration provides best of both worlds
1Tb/s HMC DRAM Prototype
• 3D integration technology • 1Gb DRAM Array • 512 MB total DRAM/cube • 128GB/s Bandwidth • <10 pj/bit energy
Bandwidth Energy Efficiency
DDR-3 (Today) 10.66 GB/Sec 50-75 pJ/bit
Hybrid Memory Cube 128 GB/Sec 8 pJ/bit
10X higher bandwidth, 10X lower energy Source: Micron
Communication Energy
19
0.01
0.1
1
10
100
0.1 1 10 100 1000
En
erg
y/b
it (
pJ/b
it)
Interconnect Distance (cm)
On Die
Board to Board
On Die
Chip to chip
Board to Board
Between
cabinets
On-die Interconnect
Interconnect energy (per mm) reduces slower than compute
On-die data movement energy will start to dominate
0
0.2
0.4
0.6
0.8
1
1.2
90 65 45 32 22 14 10 7
Technology (nm) Source: Intel
Compute energy
On die IC energy
20
Network On Chip (NoC)
21
IMEM +
DMEM
21% 10-port
RF
4%
Router +
Links
28%
Clock
dist.
11%
Dual
FPMACs
36%
Global Clocking
1%
Routers & 2D-mesh10%
MC & DDR3-
80019%
Cores70%
21.7
2m
m
12.64mmI/O Area
I/O Area
PLL
single tile
1.5mm
2.0mm
TAP
21.7
2m
m
12.64mmI/O Area
I/O Area
PLL
single tile
1.5mm
2.0mm
TAP
8 X 10 Mesh
32 bit links
320 GB/sec bisection BW @ 5 GHz
80 Core TFLOP Chip (2006)
VRC
21
.4m
m
26.5mm
System Interface + I/O
DD
R3
MC
D
DR
3 M
C
DD
R3
MC
D
DR
3 M
C
PLL
TILE
TILE
JTAG
C C
C C
2 Core clusters in 6 X 4 Mesh
(why not 6 x 8?)
128 bit links
256 GB/sec bisection BW @ 2 GHz
48 Core Single Chip Cloud (2009)
On-chip Interconnect Analysis
22
Interconnect Structures
23
Buses over short distance
Shared bus
1 to 10 fJ/bit 0 to 5mm Limited scalability
Multi-ported Memory
Shared memory
10 to 100 fJ/bit 1 to 5mm Limited scalability
X-Bar
Cross Bar Switch
0.1 to 1pJ/bit 2 to 10mm Moderate scalability
1 to 3pJ/bit >5 mm, scalable
Packet Switched Network Board Cabinet
...
Second LevelSwitch
...
...
First levelSwitch
...
...
………………………… …………………………
…………………………
CabinetCluster
SwitchCluster
System
24
Hierarchical & Heterogeneous
Bus
C C
C C
Bus to connect over short distances
Bus
C C
C C
Bus
C C
C C
Bus
C C
C C
Bus
C C
C C
2nd Level Bus
Hierarchy of Busses Or hierarchical circuit and packet switched networks
R R
R R
Electrical Interconnect < 1 Meter
25
0.1
1
10
100
1000
1.2u 0.5u 0.25u 0.13u 65nm 32nm
Energy (pJ/bit)
Data rate (Gb/sec)
Source: ISSCC papers
BW and Energy efficiency improves, but not enough
Electrical Interconnect Advances
Low-loss flex
connector
Low-loss twinax
Employ, new, low-loss, non-traditional interconnects
Top of the package
connector
Co-optimization of interconnects
and circuits for energy efficiency
O’Mahony et al, “A 47x10Gb/s 1.4mW/(Gb/s) Parallel Interface in 45nm CMOS “, ISSCC 2010; and J. Jaussi, RESS004, IDF 2010
0
1
2
3
4
0 100 200 300 400E
ne
rgy (p
J/b
it)
Channel length (cm)
HDI
Flex
Twinax
State of the art
-0.5 -0.25 0 0.25 0.5
-10
0
10
-12
-9
-6
-3
0
-0.5 -0.25 0 0.25 0.5
10
0
-10
Optical Interconnect > 1 Meter
27
0
2
4
6
8
10
1% 10% 20%
En
erg
y (
pJ
/bit
)
Laser efficiency
Laser
100% Link
utilization
1
10
100
100% 50% 10%
En
erg
y (
pJ
/bit
)
Link utilization
Laser efficiency
1%, 10%, 20%
Energy in supporting electronics is very low
Link energy dominated by laser (efficiency)
Sustained, high link utilization required
Source: PETE Study group
Straw-man Exa— Interconnect
28
...
~0.8 PF
......... Cluster of 35
...
1,000 fibers each
...
............... 35 Clusters
............... 35 Clusters
8,000 fibers each
Assume: 40 Gbps, 10 pJ/b, $0.6/Gbps, 8B/FLOP, naïve tapering
$35M 217 MW
Bandwidth Tapering
29
24
6.65
1.13
0.19
0.03
0.0045
0.0005
0.00005
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
Core L1 L2 Chip Board Cab L1SysL2Sys
Byte
/FL
OP
8 Byte/Flop total
Naïve, 4X
Severe
0
0.5
1
1.5
L1 L2 Chip Board Cab Sys
Data
M
ovem
en
t P
ow
er
(MW
)
Total DM Power = 3 MW
0.1
1
10
100
1000
L1 L2 Chip Board Cab Sys
Data
M
ovem
en
t P
ow
er
(MW
)
Total DM Power = 217 MW
Intelligent BW tapering is necessary
HW-SW Co-design
30
Circuits & Design
Applications and SW stack provide guidance for efficient system design
Architecture
Programming Sys
System SW
Applications
Limitations, issues and opportunities to exploit
Bottom-up Guidance
31
1. NTV reduces energy but exacerbates variations
Small & Fast cores Random distribution Temp dependent
3. On-die Interconnect energy (per mm) does not reduce as much as compute
0
0.2
0.4
0.6
0.8
1
1.2
45 32 22 14 10 7
Re
lati
ve
Technology (nm)
Compute Energy
Interconnect Energy6X compute 1.6X interconnect
2. Limited NTV for arrays (memory) due to stability issues
Perf
orm
ance
Voltage
Compute
MemoryDisproportionate Memory arrays can be made larger
4. At NTV, leakage power is substantial portion of the total power
0%
10%
20%
30%
40%
50%
60%
45nm 32nm 22nm 14nm 10nm 7nm 5nm
SD L
eak
age
Po
we
r
100% Vdd
75% Vdd
50% Vdd
40% VddIncreasing Variations
Expect 50% leakage Idle hardware consumes energy
5. DRAM energy scales, but not enough
1
10
100
1000
90nm 65nm 45nm 32nm 22nm 14nm 10nm 7nm
DRAM Energy (pJ/b)
3D Hybrid Memory Cube
50 pJ/b today 8 pJ/b demonstrated Need < 2pJ/b
6. System interconnect limited by laser energy and cost
0
50
100
150
200
250
300
45nm 32nm 22nm 14nm 10nm 7nm 5nm
MW
Data Movement Power System
Cabinet
Boards
Die
Clusters
Islands
40 Gbps Photonic links @ 10 pJ/b
BW tapering and locality awareness necessary
Straw-man Architecture at NTV
32
Full Vdd 50% Vdd
Technology 7nm, 2018
Die area 500 mm2
Cores 2048
Frequency 4.2 GHz 600 MHz
TFLOPs 17.2 2.5
Power 600 Watts 37 Watts
E Efficiency 34 pJ/Flop 15 pJ/Flop
Compute energy efficiency close to Exascale goal
600K Transistors
Simplest Core
*
RF
Logic
C C C C
C C C C
Logi
cShared Cache
First level of hierarchy Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Processor
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
………..
………..
Last level cache
……
…..
……
…..
Interconnect
Reduced frequency and flops
Reduced power and improved E-efficiency
SW Challenges
33
1.Extreme parallelism (1000X due to Exa, additional 4X due to NTV)
2.Data locality—reduce data movement
3.Intelligent scheduling—move thread to data if necessary
4.Fine grain resource management (objective function)
5.Applications and algorithms incorporate paradigm change
Pro
gra
mm
ing
mo
del
Exe
cution m
ode
l
Programming & Execution Model
34
Event driven tasks (EDT)
Dataflow inspired, tiny codelets (self contained)
Non blocking, no preemption
Programming model:
Separation of concerns: Domain specification & HW mapping
Express data locality with hierarchical tiling
Global, shared, non-coherent address space
Optimization and auto generation of EDTs (HW specific)
Execution model:
Dynamic, event-driven scheduling, non-blocking
Dynamic decision to move computation to data
Observation based adaption (self-awareness)
Implemented in the runtime environment
Separation of concerns:
User application, control, and resource management
Over-provisioned Introspectively Resource Managed
System
35
F F
M F
S
S
F S S
Addressing variations
1. Provide more compute HW 2. Law of large numbers 3. Static profile
M
S
S
S M M F
1. Schedule threads based on objectives and resources
2. Dynamically control and manage resources
3. Identify sensors, functions in HW for implementation
System SW implements introspective execution model
F F
M F
S
S
F S S
M
S
S
S M M F
Dynamic reconfiguration: 1. Energy efficiency 2. Latency 3. Dynamic resource
management
Fine grain resource mgmt
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
8MB Shared LLC
Interconnect
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
8MB Shared LLC
Interconnect………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
8MB Shared LLC
Interconnect
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
8MB Shared LLC
Interconnect………..
64MB Shared LLC
………..
………..
Interconnect
Processor Chip (16 Clusters)
Sensors for introspection
1. Energy consumption 2. Instantaneous power 3. Computations 4. Data movement
36
Summary
Power & energy challenge continues
Opportunistically employ NTV operation
3D integration for DRAM
Communication energy will far exceed computation
Data locality will be paramount
Revolutionary software stack needed
Take HW/SW co-design beyond just a buzz word!