Achieving Energy Efficiency by HW/SW Co-design · 2017-09-05 · 3 Compute Performance Roadmap...

1

Achieving Energy Efficiency

by

HW/SW Co-design

Shekhar Borkar

Intel Corp.

Oct 28, 2013

This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the

official policies, either expressed or implied, of the U.S. Government.

2

Outline

Compute roadmap & technology outlook

Challenges & solutions for:

– Compute,

– Memory, and

– Interconnect

HW/SW Co-design–not just a buzz word!

Summary

3

Compute Performance Roadmap

1.E-04

1.E-02

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1960 1970 1980 1990 2000 2010 2020

GF

LO

P

Mega

Giga

Tera

Peta

Exa

12 Years 11 Years 10 Years

Client

Hand-held

4

From Giga to Exa, via Tera & Peta

1

10

100

1000

1986 1996 2006 2016

Re

lati

ve

Pro

c F

req

G

Tera

Peta

30X 250X

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

G

Tera

Peta

36X

Exa

4,000X

Concurrency

2.5M X

Processor Performance

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

G

Tera

Peta

80X

Exa

4,000X

1M X

Power

System performance increases faster

Parallelism continues to increase

Power & energy challenge continues

5

Where is the Energy Consumed?

50pJ per FLOP 50W

150W

100W

100W

600W

Decode and control Address translations… Power supply losses Bloated with inefficient architectural features

~1KW

Compute

Memory

Com

Disk

10TB disk @ 1TB/disk @10W

0.1B/FLOP @ 1.5nJ per Byte

100pJ com per FLOP

5W 2W

~5W ~3W 5W

Goal

~20W

Teraflop system today

The UHPC* Challenge

6

20 pJ/Operation

20MW, Exa 20W, Tera

20KW, Peta

*DARPA, Ubiquitous HPC Program

20 mW, Mega 2W, 100 Giga

20 mW, Giga

Technology Scaling Outlook

7

0

10

20

30

40

45nm 32nm 22nm 14nm 10nm 7nm 5nm

Rela

tive

Transistor Density

1.75 – 2X

0

0.5

1

1.5


Rela

tive

Frequency

Almost flat

0

0.2

0.4

0.6

0.8

1

1.2


Rela

tive

Supply Voltage

Almost flat

0.001

0.01

0.1

1


Re

lative

Energy

Some scaling

Ideal

Energy per Compute Operation

8

0

50

100

150

200

250

300

45nm 32nm 22nm 14nm 10nm 7nm

En

erg

y (

pJ)

pJ/bit Com

pJ/bit DRAM

DP RF Op

pJ/DP FP

FP Op

DRAM

Communication

Operands

75 pJ/bit

25 pJ/bit

10 pJ/bit

100 pJ/bit

Source: Intel

9

Voltage Scaling

0

2

4

6

8

10

0

0.2

0.4

0.6

0.8

1

1.2

0.3 0.5 0.7 0.9

No

rmali

zed

Vdd (Normal)

Freq

Total Power Leakage

Energy Efficiency

When designed to voltage scale

10

Near Threshold-Voltage (NTV)

1

101

103

104

102

10-2

10-1

1

101

102

0.2 0.4 0.6 0.8 1.0 1.2 1.4

Supply Voltage (V)

Maxim

um

Fre

qu

en

cy (

MH

z)

To

tal

Po

wer

(mW

)

320mV

65nm CMOS, 50°C

320mV S

ub

thre

sh

old

Reg

ion

9.6X

65nm CMOS, 50°C

10 -2

10 -1

1

10 1

0

50

100

150

200

250

300

350

400

450

0.2 0.4 0.6 0.8 1.0 1.2 1.4

Supply Voltage (V)

En

erg

y E

ffic

ien

cy (

GO

PS

/Watt

)

Acti

ve L

eakag

e P

ow

er

(mW

)

H. Kaul et al, 16.6: ISSCC08

4 O

rd

ers

< 3

Ord

ers

Experimental NTV Processor

11

IA-32 Core

Logic

Sc

an

R

O M

L1$-I L1$-D

Level Shifters + clk spine

1.1 mm

1.8

mm

Custom Interposer 951 Pin FCBGA Package

Legacy Socket-7 Motherboard

Technology 32nm High-K Metal Gate

Interconnect 1 Poly, 9 Metal (Cu)

Transistors 6 Million (Core)

Core Area 2mm2

S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012

Wide Dynamic Range

12

NTV

EN

ER

GY

EF

FIC

EN

CY

HIGH

LOW

VOLTAGE ZERO MAX

~5x

Demonstrated

Normal operating range Subthreshold

Ultra-low Power Energy Efficient High Performance

280 mV 0.45 V 1.2 V

3 MHz 60 MHz 915 MHz

2 mW 10 mW 737 mW

1500 Mips/W 5830 Mips/W 1240 Mips/W

Observations

13

0%

20%

40%

60%

80%

100%

Sub-Vt NTV FullVdd

Mem Lkg

Mem Dyn

Logic Lkg

Logic Dyn

Leakage power dominates Fine grain leakage power management is required

Integration of Power Delivery

14

For efficiency and management

Standard OLGA Packaging

Technology

TOP

BOTTOM

Converter

chipLoad chip

Inductors

Input

CapacitorsOutput

Capacitors

5mmRF

Launch

Integrated Voltage Regulator Testchip

70

75

80

85

90

0 5 10 15 20

Load Current [A]E

ffic

ien

cy [

%]

L = 1.9nH

L = 0.8nH2.4V to 1.5V

2.4V to 1.2V

60MHz

100MHz80MHz

Schrom et al, “A 100MHz 8-Phase Buck Converter Delivering 12A in 25mm2 Using Air-Core Inductors”, APEC 2007

Power delivery closer to the load for 1. Improved efficiency 2. Fine grain power management

Compare Memory Technologies

15

Source: Intel

DRAM for first level capacity memory NAND/PCM for next level storage

16

Revise DRAM Architecture

Page Page Page

RAS

CAS

Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins

Traditional DRAM New DRAM architecture

Page Page Page Addr

Addr

Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO’s (3D)

1

10

100

1000

90 65 45 32 22 14 10 7

(nm)

Need

exponentially

increasing BW

(GB/sec)

Need

exponentially

decreasing

energy (pJ/bit)

Package

3D-Integration of DRAM and Logic

Logic Buffer

Logic Buffer Chip

Technology optimized for:

High speed signaling

Energy efficient logic circuits

Implement intelligence

DRAM Stack

Technology optimized for:

Memory density

Lower cost

3D Integration provides best of both worlds

1Tb/s HMC DRAM Prototype

• 3D integration technology • 1Gb DRAM Array • 512 MB total DRAM/cube • 128GB/s Bandwidth • <10 pj/bit energy

Bandwidth Energy Efficiency

DDR-3 (Today) 10.66 GB/Sec 50-75 pJ/bit

Hybrid Memory Cube 128 GB/Sec 8 pJ/bit

10X higher bandwidth, 10X lower energy Source: Micron

Communication Energy

19

0.01

0.1

1

10

100

0.1 1 10 100 1000

En

erg

y/b

it (

pJ/b

it)

Interconnect Distance (cm)

On Die

Board to Board

On Die

Chip to chip

Board to Board

Between

cabinets

On-die Interconnect

Interconnect energy (per mm) reduces slower than compute

On-die data movement energy will start to dominate

0

0.2

0.4

0.6

0.8

1

1.2

90 65 45 32 22 14 10 7

Technology (nm) Source: Intel

Compute energy

On die IC energy

20

Network On Chip (NoC)

21

IMEM +

DMEM

21% 10-port

RF

4%

Router +

Links

28%

Clock

dist.

11%

Dual

FPMACs

36%

Global Clocking

1%

Routers & 2D-mesh10%

MC & DDR3-

80019%

Cores70%

21.7

2m

m

12.64mmI/O Area

I/O Area

PLL

single tile

1.5mm

2.0mm

TAP

21.7

2m

m

12.64mmI/O Area

I/O Area

PLL

single tile

1.5mm

2.0mm

TAP

8 X 10 Mesh

32 bit links

320 GB/sec bisection BW @ 5 GHz

80 Core TFLOP Chip (2006)

VRC

21

.4m

m

26.5mm

System Interface + I/O

DD

R3

MC

D

DR

3 M

C

DD

R3

MC

D

DR

3 M

C

PLL

TILE

TILE

JTAG

C C

C C

2 Core clusters in 6 X 4 Mesh

(why not 6 x 8?)

128 bit links

256 GB/sec bisection BW @ 2 GHz

48 Core Single Chip Cloud (2009)

On-chip Interconnect Analysis

22

Interconnect Structures

23

Buses over short distance

Shared bus

1 to 10 fJ/bit 0 to 5mm Limited scalability

Multi-ported Memory

Shared memory

10 to 100 fJ/bit 1 to 5mm Limited scalability

X-Bar

Cross Bar Switch

0.1 to 1pJ/bit 2 to 10mm Moderate scalability

1 to 3pJ/bit >5 mm, scalable

Packet Switched Network Board Cabinet

...

Second LevelSwitch

...

...

First levelSwitch

...

...

………………………… …………………………

…………………………

CabinetCluster

SwitchCluster

System

24

Hierarchical & Heterogeneous

Bus

C C

C C

Bus to connect over short distances

Bus

C C

C C

Bus

C C

C C

Bus

C C

C C

Bus

C C

C C

2nd Level Bus

Hierarchy of Busses Or hierarchical circuit and packet switched networks

R R

R R

Electrical Interconnect < 1 Meter

25

0.1

1

10

100

1000

1.2u 0.5u 0.25u 0.13u 65nm 32nm

Energy (pJ/bit)

Data rate (Gb/sec)

Source: ISSCC papers

BW and Energy efficiency improves, but not enough

Electrical Interconnect Advances

Low-loss flex

connector

Low-loss twinax

Employ, new, low-loss, non-traditional interconnects

Top of the package

connector

Co-optimization of interconnects

and circuits for energy efficiency

O’Mahony et al, “A 47x10Gb/s 1.4mW/(Gb/s) Parallel Interface in 45nm CMOS “, ISSCC 2010; and J. Jaussi, RESS004, IDF 2010

0

1

2

3

4

0 100 200 300 400E

ne

rgy (p

J/b

it)

Channel length (cm)

HDI

Flex

Twinax

State of the art

-0.5 -0.25 0 0.25 0.5

-10

0

10

-12

-9

-6

-3

0

-0.5 -0.25 0 0.25 0.5

10

0

-10

Optical Interconnect > 1 Meter

27

0

2

4

6

8

10

1% 10% 20%

En

erg

y (

pJ

/bit

)

Laser efficiency

Laser

100% Link

utilization

1

10

100

100% 50% 10%

En

erg

y (

pJ

/bit

)

Link utilization

Laser efficiency

1%, 10%, 20%

Energy in supporting electronics is very low

Link energy dominated by laser (efficiency)

Sustained, high link utilization required

Source: PETE Study group

Straw-man Exa— Interconnect

28

...

~0.8 PF

......... Cluster of 35

...

1,000 fibers each

...

............... 35 Clusters

............... 35 Clusters

8,000 fibers each

Assume: 40 Gbps, 10 pJ/b, $0.6/Gbps, 8B/FLOP, naïve tapering

$35M 217 MW

Bandwidth Tapering

29

24

6.65

1.13

0.19

0.03

0.0045

0.0005

0.00005

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

Core L1 L2 Chip Board Cab L1SysL2Sys

Byte

/FL

OP

8 Byte/Flop total

Naïve, 4X

Severe

0

0.5

1

1.5

L1 L2 Chip Board Cab Sys

Data

M

ovem

en

t P

ow

er

(MW

)

Total DM Power = 3 MW

0.1

1

10

100

1000

L1 L2 Chip Board Cab Sys

Data

M

ovem

en

t P

ow

er

(MW

)

Total DM Power = 217 MW

Intelligent BW tapering is necessary

HW-SW Co-design

30

Circuits & Design

Applications and SW stack provide guidance for efficient system design

Architecture

Programming Sys

System SW

Applications

Limitations, issues and opportunities to exploit

Bottom-up Guidance

31

1. NTV reduces energy but exacerbates variations

Small & Fast cores Random distribution Temp dependent

3. On-die Interconnect energy (per mm) does not reduce as much as compute

0

0.2

0.4

0.6

0.8

1

1.2

45 32 22 14 10 7

Re

lati

ve

Technology (nm)

Compute Energy

Interconnect Energy6X compute 1.6X interconnect

2. Limited NTV for arrays (memory) due to stability issues

Perf

orm

ance

Voltage

Compute

MemoryDisproportionate Memory arrays can be made larger

4. At NTV, leakage power is substantial portion of the total power

0%

10%

20%

30%

40%

50%

60%


SD L

eak

age

Po

we

r

100% Vdd

75% Vdd

50% Vdd

40% VddIncreasing Variations

Expect 50% leakage Idle hardware consumes energy

5. DRAM energy scales, but not enough

1

10

100

1000

90nm 65nm 45nm 32nm 22nm 14nm 10nm 7nm

DRAM Energy (pJ/b)

3D Hybrid Memory Cube

50 pJ/b today 8 pJ/b demonstrated Need < 2pJ/b

6. System interconnect limited by laser energy and cost

0

50

100

150

200

250

300


MW

Data Movement Power System

Cabinet

Boards

Die

Clusters

Islands

40 Gbps Photonic links @ 10 pJ/b

BW tapering and locality awareness necessary

Straw-man Architecture at NTV

32

Full Vdd 50% Vdd

Technology 7nm, 2018

Die area 500 mm2

Cores 2048

Frequency 4.2 GHz 600 MHz

TFLOPs 17.2 2.5

Power 600 Watts 37 Watts

E Efficiency 34 pJ/Flop 15 pJ/Flop

Compute energy efficiency close to Exascale goal

600K Transistors

Simplest Core

*

RF

Logic

C C C C

C C C C

Logi

cShared Cache

First level of hierarchy Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

Processor

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..


PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..


PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..


PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

………..

………..

Last level cache

……

…..

……

…..

Interconnect

Reduced frequency and flops

Reduced power and improved E-efficiency

SW Challenges

33

1.Extreme parallelism (1000X due to Exa, additional 4X due to NTV)

2.Data locality—reduce data movement

3.Intelligent scheduling—move thread to data if necessary

4.Fine grain resource management (objective function)

5.Applications and algorithms incorporate paradigm change

Pro

gra

mm

ing

mo

del

Exe

cution m

ode

l

Programming & Execution Model

34

Event driven tasks (EDT)

Dataflow inspired, tiny codelets (self contained)

Non blocking, no preemption

Programming model:

Separation of concerns: Domain specification & HW mapping

Express data locality with hierarchical tiling

Global, shared, non-coherent address space

Optimization and auto generation of EDTs (HW specific)

Execution model:

Dynamic, event-driven scheduling, non-blocking

Dynamic decision to move computation to data

Observation based adaption (self-awareness)

Implemented in the runtime environment

Separation of concerns:

User application, control, and resource management

Over-provisioned Introspectively Resource Managed

System

35

F F

M F

S

S

F S S

Addressing variations

1. Provide more compute HW 2. Law of large numbers 3. Static profile

M

S

S

S M M F

1. Schedule threads based on objectives and resources

2. Dynamically control and manage resources

3. Identify sensors, functions in HW for implementation

System SW implements introspective execution model

F F

M F

S

S

F S S

M

S

S

S M M F

Dynamic reconfiguration: 1. Energy efficiency 2. Latency 3. Dynamic resource

management

Fine grain resource mgmt

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

8MB Shared LLC

Interconnect………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

8MB Shared LLC

Interconnect………..

64MB Shared LLC

………..

………..

Interconnect

Processor Chip (16 Clusters)

Sensors for introspection

1. Energy consumption 2. Instantaneous power 3. Computations 4. Data movement

36

Summary

Power & energy challenge continues

Opportunistically employ NTV operation

3D integration for DRAM

Communication energy will far exceed computation

Data locality will be paramount

Revolutionary software stack needed

Take HW/SW co-design beyond just a buzz word!

Date post:	13-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Achieving Energy Efficiency by HW/SW Co-design · 2017-09-05 · 3 Compute Performance Roadmap...

Documents