+ All Categories
Home > Documents > Centip3De: A 64-Core, 3D Stacked, Near-Threshold...

Centip3De: A 64-Core, 3D Stacked, Near-Threshold...

Date post: 06-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
30
1 1 1 University of Michigan 1 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold System Ronald G. Dreslinski David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, Michael Wieckowski, Gregory Chen, Trevor Mudge, Dennis Sylvester, David Blaauw University of Michigan
Transcript
Page 1: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

1 1 1 University of Michigan 1 1

1

Centip3De: A 64-Core, 3D Stacked,

Near-Threshold System

Ronald G. Dreslinski

David Fick, Bharan Giridhar,

Gyouho Kim, Sangwon Seo, Matthew Fojtik,

Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim,

Nurrachman Liu, Michael Wieckowski, Gregory Chen,

Trevor Mudge, Dennis Sylvester, David Blaauw

University of Michigan

Page 2: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

2 2 2 University of Michigan

U CVdd

2

AIleakVdd

Af

The emerging dilemma:

More and more gates can fit on a die,

but cooling constraints are restricting their use

The Problem of Power

Circuit supply voltages are no longer scaling…

Power does not decrease at the same

rate that transistor count increases,

resulting in increased energy density

A = gate area scaling 1/s2

C = capacitance scaling < 1/s

Dynamic

dominates

Page 3: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

3 3 3 University of Michigan

Today: Super-Vth, High Performance, Power Constrained

Super-Vth

Energ

y /

Opera

tion

Log (

Dela

y)

Supply Voltage 0 Vth Vnom

Large gate overdrive favors

performance with

unsustainable power density

Must design within fixed TDP

Goal: maintain performance,

improved Energy/Operation

Normalized CPU Metrics

Page 4: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

4 4 4 University of Michigan

Subthreshold Design

Super-Vth Sub-Vth

Energ

y /

Opera

tion

Log (

Dela

y)

Supply Voltage 0 Vth Vnom

500 – 1000X

12-16X

Operating in sub-threshold

yields large power gains at the

expense of performance.

Applications: sensors, medical

Page 5: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

5 5 5 University of Michigan

Subthreshold Design

Super-Vth Sub-Vth

Energ

y /

Opera

tion

Log (

Dela

y)

Supply Voltage 0 Vth Vnom

500 – 1000X

12-16X

Operating in sub-threshold

yields large power gains at the

expense of performance.

Applications: sensors, medical

Phoenix 2 Processor, ISSCC’10

Page 6: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

6 6 6 University of Michigan

Near-Threshold Computing (NTC)

NTC Super-Vth Sub-Vth

Energ

y /

Opera

tion

Log (

Dela

y)

Supply Voltage 0 Vth Vnom

~10X

~50-100X

~2X

~6-8X

Near-Threshold Computing (NTC): •>60X power reduction

•6-8X energy reduction

• Enables 3D integration

Page 7: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

8 8 8 University of Michigan

Architectural Impact of NTC

Caches have higher Vopt and operating frequency

Smaller activity rate when compared to core logic

Leakage larger proportion of total power in caches

New Architectures Possible

Vt

Core

L1

L2

Core

L1

L2

Page 8: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

9 9 9 University of Michigan

SRAM is run at a higher VDD

Caches operate faster than core

Can introduce clustered architecture

Multiple cores share L1

Cores see private L1

L1 still provides single-cycle latency

Advantages:

Less coherence/snoop traffic

Larger cache for processes that need it

Drawbacks:

Core conflicts evicting L1 data

Not dominant in simulation

Longer interconnect

3D addressable

Cluster Cluster Cluster

Proposed NTC Architecture

L1

BUS / Switched Network

Next Level Memory

Core

L1

Core

L1

Core

L1

Core

L1

Core

BUS / Switched Network

Next Level Memory

Cluster

L1

Core Core Core Core

L1 Shared Cache

Page 9: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

10 10 10 University of Michigan

Proposed Boosting Approach

Measured results for 130nm LP design

10MHz becomes ~110MHz in 32nm simulation

140 FO4 delay core

Baseline

Cache runs 4x core frequency

Pipelined cache

Better Single Thread Performance

Turn some cores off, speed up the rest

Cache de-pipelined

Faster response time, same throughput

Core sees larger cache

Faster cores needs larger caches

Cluster

Core

L1

Core Core Core

4 Cores @ 10MHz (650mV)

Cache @ 40MHz (800mV)

Core Core Core

Cluster

Core

L1

1 Core @ 40MHz (850mV)

Cache @ 80MHz (1.15 V)

Core Core Core

Cluster

Core

L1

1 Core @ 80MHz (1.15V)

Cache @ 160MHz (1.65V)

4x

8x

Page 10: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

11 11 11 University of Michigan

Cache Timing

Data Array Tag Array

= Data Array

Tag Array

=

NTC Mode (3/4 Cores)

Low power

Tag arrays read first

0-1 data arrays accessed

Boost Mode (1/2)

Low latency

Data and tags read in parallel

4 data arrays accessed

Page 11: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

12 12 12 University of Michigan

Cache Timing

Data Array

Tag Array

=

NTC Mode (3/4 Cores)

Low power

Tag arrays read first

0-1 data arrays accessed

Other

AccessOther

AccessTag

Read

Tag

Comp

Data

ReadIdle

EX Stage Cache Access MEM StageIF/DE Stage

Edge

A

Edge

B

Edge

C

Edge

D

Page 12: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

13 13 13 University of Michigan

Cache Timing

Data Array Tag Array

=

Boost Mode (1/2)

Low latency

Data and tags read in parallel

4 data arrays accessed

EX Stage Cache Access MEM StageIF/DE Stage

Other

AccessOther

AccessTag & Data

Read

Tag Compare

& Mem Access

Edge

A

Edge

B

Page 13: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

14 14 14 University of Michigan

Centip3De System Overview

Page 14: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

15 15 15 University of Michigan

Centip3De System Overview

Measured

7-Layer NTC

system

2-Layer system

completed

fabrication

with measured

results

Full 7-layer system

expected

End of 2012

Page 15: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

16 16 16 University of Michigan

Centip3De System Overview

Cluster architecture

4 Cores/cluster

1kB I$, 8kB D$

Local clock controller

operates cores

90˚ Out-of-phase

1591 F2F connections

per cluster

Organized into layer

pairs (cachecore)

Minimizes routing

Up to two pairs

16 clusters per pair

Cores have only vertical interconnections

Cluster

x32

Page 16: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

17 17 17 University of Michigan

Centip3De System Overview

Bus interconnect

architecture

Up to 500 MHz

9-11 cycle latency

1-3 core cycles

8 lanes, each 128b

One per DRAM

interface

Each cluster connects

to all eight

1024b total

Vertically connected

through all four layers

Flipping interface enables 128-core system

Bus

System

Page 17: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

18 18 18 University of Michigan

Centip3De System Overview

3D-Stacked DRAM

Tezzaron Octopus

1 control layer

130nm CMOS

1 Gb bitcell layers

Up to two layers

DRAM process

8x 128b DDR2

interfaces

Operated at bus frequency (up to 500 MHz)

DRAM System

Page 18: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

19 19 19 University of Michigan

Centip3De System Overview

28485 F2F

3024 B2B

28485 F2F

3624 B2B

Page 19: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

20 20 20 University of Michigan

Centip3De System Overview

130nm process

12.66x5mm per layer

28.4M device core layer

18.0M device cache layer

Page 20: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

22 22 22 University of Michigan

2-Layer Stacking Process Evaluated

Core Layer

Cache Layer

For the measured 2-layer system,

aluminum wirebond pads were used instead

Wirebonds

Aluminum wirebonding pads

connected to perimeter

TSVs like for 7-layer

N P N P

F2F

Page 21: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

23 23 23 University of Michigan

Cache 3D Connections

SRAM SRAM

SR

AM

SR

AM

SR

AM

SR

AM

SRAMSRAM

SR

AM

SR

AM

SR

AM

SR

AM

SRAM

SRAM

SRAM

SRAM

Cache

Bus Interface

Sea of Gates

Page 22: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

24 24 24 University of Michigan

Core 3D Connections

Core 0 Core 1

Core 2Core 3

Sea of Gates

Page 23: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

25 25 25 University of Michigan

Cluster 3D Connections

1591 F2F Connections

Each saved ~600-1000um in routing

Prevented wiring congestion around SRAMS

Page 24: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

26 26 26 University of Michigan

Silicon Results

DRAM Control Layer

I$/D$

[00]

I$/D$

[07]

Cache Bus Hub

8x8 Crossbar

DRAM Bus Hub

8x4 Crossbar

Flipping Interface

I$/D$

[16]

I$/D$

[23]Cache Bus Hub

8x8 Crossbar

DRAM Bus Hub

Flipping Interface

I$/D$

[15]

I$/D$

[08]

Cache Bus Hub

8x8 Crossbar

DRAM Bus Hub

8x4 Crossbar

Flipping Interface

I$/D$

[31]

I$/D$

[24]Cache Bus Hub

8x8 Crossbar

DRAM Bus Hub

Flipping Interface

BottomCore

Layer

BottomCacheLayer

TopCacheLayer

TopCore

Layer

DRAM Bitcell Layer

DRAM Bitcell Layer

TezzaronOctopus

DRAM

Co

rtex

M3

[06

2]

Co

rtex

M3

[06

0]

DR

AM

Inte

rfac

e

DR

AM

Inte

rfac

e

DR

AM

Inte

rfac

e

DR

AM

Inte

rfac

eDR

AM

Inte

rfa

ce

DR

AM

Inte

rfa

ce

DR

AM

Inte

rfa

ce

DR

AM

Inte

rfa

ce

DR

AM

Inte

rfac

e

DR

AM

Inte

rfac

e

DR

AM

Inte

rfac

e

DR

AM

Inte

rfac

eDR

AM

Inte

rfa

ce

DR

AM

Inte

rfa

ce

DR

AM

Inte

rfa

ce

DR

AM

Inte

rfa

ce

Co

rtex

M3

[06

1]

Co

rtex

M3

[06

3]

Co

rtex

M3

[03

4]

Co

rtex

M3

[03

2]

Co

rtex

M3

[03

3]

Co

rtex

M3

[03

5]C

ort

ex

M3

[0

01

]

Co

rte

x M

3 [

00

3]

Co

rte

x M

3 [

00

2]

Co

rte

x M

3 [

00

0]

Co

rte

x M

3 [

02

9]

Co

rte

x M

3 [

03

1]

Co

rte

x M

3 [

03

0]

Co

rte

x M

3 [

02

8]

Co

rtex

M3

[12

6]

Co

rtex

M3

[12

4]

Co

rtex

M3

[12

5]

Co

rtex

M3

[12

7]

Co

rtex

M3

[09

8]

Co

rtex

M3

[09

6]

Co

rtex

M3

[09

7]

Co

rtex

M3

[09

9]C

ort

ex

M3

[0

93

]

Co

rte

x M

3 [

09

5]

Co

rte

x M

3 [

09

4]

Co

rte

x M

3 [

09

2]

Co

rte

x M

3 [

06

5]

Co

rte

x M

3 [

06

7]

Co

rte

x M

3 [

06

6]

Co

rte

x M

3 [

06

4]

Disabled Due

To Redundancy

Page 25: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

27 27 27 University of Michigan

Die Shot

Aluminum

wirebond

pads

DRAM

Interface/

Bus Hub

4-Core

Cluster

Looking through back of core-layer

130nm process

12.66x5mm per layer

28.4M device core layer

18.0M device cache layer

Page 26: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

28 28 28 University of Michigan

System Configurations

Cache Bus Hub

160 MHz1.15 Volts

I$/D$

M3

M3

M3

M3

Div 4x40 MHz0.80 Volts

Div 4x10 MHz0.65 Volts

4 Core Mode

0 Core Boosted0 Cores Gated

M3

Cache Bus Hub

160 MHz1.15 Volts

I$/D$

M3

M3

M3

Div 2x80 MHz1.15 Volts

Div 4x20 MHz0.75 Volts

3 Core Mode

3 Cores Boosted1 Core Gated

Cache Bus Hub

160 MHz1.15 Volts

I$/D$

M3

M3

M3

M3

Div 2x80 MHz1.15 Volts

Div 2x40 MHz0.85 Volts

2 Core Mode

2 Core Boosted2 Cores Gated

Cache Bus Hub

320 MHz1.6 Volts

I$/D$

M3

M3

M3

M3

Div 2x160 MHz1.65 Volts

Div 2x80 MHz1.15 Volts

1 Core Mode

1 Core Boosted3 Cores Gated

Page 27: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

29 29 29 University of Michigan

Measured Results

4-Core 3-Core 2-Core 1-Core

0

200

400

600

800

1000

1200

1400

1600

1800

2000

84.0

266

113 113

Po

wer

(mW

)

65.025.9

203

339

463

1851

175

113

155

1225

Core Power

Cache Power

Memory System Power

System Configuration

471

51.2

Boosting a single cluster

to 1-core mode requires

disabling, or down-boosting

other clusters

1-core cluster:

= 15x 4-core clusters

= 6x 3-core clusters

= 4.5x 2-core clusters

Baseline configuration

depends on TDP and

processing needs

Page 28: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

30 30 30 University of Michigan

Measured Results

4-Core 3-Core 2-Core 1-Core0

20

40

60

80

100

25

12.5

50

100

Sin

gle

-Th

rea

de

d P

erf

orm

an

ce

(D

MIP

S)

System Configuration

4-Core 3-Core 2-Core 1-Core

0

200

400

600

800

1000

1200

1400

1600

1800

2000

84.0

266

113 113

Po

wer

(mW

)

65.025.9

203

339

463

1851

175

113

155

1225

Core Power

Cache Power

Memory System Power

System Configuration

471

51.2

Page 29: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

31 31 31 University of Michigan

Measured Results

4-Core 3-Core 2-Core 1-Core0

500

1000

1500

2000

2500

3000

3500

4000

4500

3540

3930

3460

Eff

icie

ncy (

DM

IPS

/Watt

)

System Configuration

860

Measured Results:

Centip3De – 3,930 (130nm)

Industry Comparison:

ARM A9 – 8,000 (40nm) [1]

Estimated Results:

Centip3De – 18,500 (45nm)

[1] http://arm.com/products/processors/cortex-a/cortex-a9.php, ARM Ltd, 2011.

Page 30: Centip3De: A 64-Core, 3D Stacked, Near-Threshold Systemtnm.engin.umich.edu/wp-content/uploads/sites/353/... · University of Michigan 11 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold

32 32 32 University of Michigan

Conclusion

Near threshold computing (NTC)

Need low power solutions to maintain TDP

Achieves 10x energy efficiency => 10x more computation to give TDP

Offers optimum balance between performance and energy

Allows boosting for single threaded performance (Amdahl's law)

Large scale 3D CMP demonstrated

64 cores currently

128 cores + DRAM in the future

3D design shown to be feasible

This work was funded and organized with the help of DARPA,

Tezzaron, ARM, and the National Science Foundation


Recommended