+ All Categories
Home > Documents > Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate...

Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate...

Date post: 30-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
30
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6 0 Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi 1 University of Utah & HP Labs 1
Transcript
Page 1: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6 0Alternatives for Large Caches with CACTI 6.0

Naveen Muralimanohar

Rajeev Balasubramonian

Norman P Jouppipp

1

University of Utah & HP Labs 1

Page 2: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Large Caches

I t l M t itCache hierarchies will dominate chip area

Intel Montecito

3D stacked processors with an entire die for on-chip cache could be common

Cache Cachecommon

Montecito has two private 12 MB L3 caches (27MB including L2)( g )

Long global wires are required to transmit data/address

2University of Utah 2

Page 3: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Wire Delay/Power

Wi d l tl f f dWire delays are costly for performance and power

− Latencies of 60 cycles to reach ends of a chip at 32nm (@ 5 GHz)

− 50% of dynamic power is in interconnect y pswitching (Magen et al. SLIP 04)

CACTI* access time for 24 MB cache is 90 cyclesCACTI access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech

3University of Utah 3*version 4

Page 4: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Contribution

Support for various interconnect modelsImproved design space exploration

Support for modeling Non-Uniform Cache Access (NUCA)

University of Utah 4

Page 5: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Cache Design BasicsI t ddBitlines Input address

oderWordline

Bitlines

rray

arra

y

Dec

o

Tag

a

Dat

a a

Column muxesColumn muxesSense Amps

Comparatorsp

Output driverMux drivers

Data output

Output driver

5University of Utah 5

Valid output?

Page 6: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Existing Model - CACTIW dli & bitli d l

Decoder delay Decoder delayWordline & bitline delay Wordline & bitline delay

Cache model with 4 sub-arrays Cache model with 16 sub-arrays

Decoder delay = H tree delay + logic delay6University of Utah 6

Decoder delay = H-tree delay + logic delay

Page 7: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Power/Delay Overhead of Wires70%

50%

60%

70%H-tree delay percentageH-tree power percentage

H-tree delay increases with cache size

30%

40%

50%H-tree power continues to dominate

Bitli th j

10%

20%

30%Bitlines are other major contributors to total power

0%

10%

2 4 8 16 32Cache Size (MB)

p

7

Cache Size (MB)

Page 8: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Motivation

Th d i t l f i t t i lThe dominant role of interconnect is clear

Lack of tool to model interconnect in detail can impede progress

C rrent sol tions ha e limited ire optionsCurrent solutions have limited wire options

Orion, CACTI

- Weak wire model

- No support for modeling Multi-megabyte caches

8University of Utah 8

Page 9: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

CACTI 6.0 Enhancements

Incorporation ofIncorporation of Different wire modelsDifferent router modelsGrid topology for NUCAShared bus for UCAContention values for various cache configurations

Methodology to compute optimal NUCA organizationImproved interface that enables trade-off analysisValidation analysis

9University of Utah 9

Page 10: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Full-swing Wires

Z

X Y

10University of Utah 10

X

Page 11: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Full-swing Wires II

10% Delay Three different design pointspenalty 20% Delay

penalty30% Delaypenalty

Repeater size

Caveat: Repeater sizing and spacing cannot be controlled precisely all the time

11University of Utah 11

p y

Page 12: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Full-Swing Wires

F t d i lFast and simpleDelay proportional to sqrt(RC) as against RC

Hi h b d idthHigh bandwidthCan be pipelined

- Requires silicon area- High energy

- Quadratic dependence on voltage

12

Page 13: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Low-swing wires50mV

400mV

50mVraise400mV

400mV

400mV

Differential wires50mVdrop

13University of Utah 13

Page 14: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Differential Low-swing+ Very low power can be routed over other+ Very low-power, can be routed over other

modules- Relatively slow low-bandwidth high areaRelatively slow, low bandwidth, high area

requirement, requires special transmitter and receiver

Bitlines are a form of low-swing wireOptimized for speed and area as against powerDriver and pre-charger employ full Vdd voltage

14University of Utah 14

Page 15: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Delay Characteristics

Quadratic increase in delay

15University of Utah 15

Page 16: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Energy Characteristics

16University of Utah 16

Page 17: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Search Space of CACTI-5

Design space with global wires optimized for delay17University of Utah 17

Design space with global wires optimized for delay

Page 18: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Search Space of CACTI-6Low swingLow-swing

30% DelayPenalty

Least Delay

Design space with global and low swing wires

Least Delay

18University of Utah 18

Design space with global and low-swing wires

Page 19: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

CACTI – Another Limitation

Access delay is equal to the delay of slowest subAccess delay is equal to the delay of slowest sub-array

Very high hit time for large cachesy g gPotential solution – NUCAExtend CACTI to model NUCAEmploys a separate bus for each cache bank for multi-banked caches

Not scalableNot scalableExploit different wire types and networkdesign choices to improve the search space

19University of Utah 19

design choices to improve the search space

Page 20: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Non-Uniform Cache Access (NUCA)*

Large cache is broken into

a number of small banks CPU & L1Employs on-chip network

for communication

Access delay α (distance

between bank and cache

controller)Cache banks

*(Kim et al ASPLOS 02)20University of Utah 20

(Kim et al. ASPLOS 02)

Page 21: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Extension to CACTIOn chip networkOn-chip network

Wire model based on ITRS 2005 parametersGrid network3-stage speculative router pipeline

Network latency vs Bank access latency tradeoffIt t diff t b k iIterate over different bank sizesCalculate the average network delay based on the number of banks and bank sizesConsider contention values for different cache configurations

Similarly we also consider power consumed for each organization

21University of Utah 21

organization

Page 22: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Trade-off Analysis (32 MB Cache)

300

350

400Total No. of Cycles

Network Latency

200

250

300

cycl

es) Bank access latency

Network contention Cycles

100

150

Late

ncy

(c

16 Core CMP

0

50

2 4 8 16 32 64

L

22

No. of Banks

Page 23: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Effect of Core Count16 core

250300

Cyc

les 16-core

8-core4

100150200

ntio

n C 4-core

050

100

Con

ten

02 4 8 16 32 64

Bank Count

C

23

Bank Count

Page 24: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Power Centric Design (32MB Cache)

8.E-099.E-091.E-08

Total EnergyBank Energy

4 E 095.E-096.E-097.E-09

ergy

J

Network Energy

Power Optimal Point

1.E-092.E-093.E-094.E-09

Ene

0.E+00

2 4 8 16 32 64

B k C t24University of Utah 24

Bank Count

Page 25: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Validation

HSPICE tool

Predictive Technology Model (65nm tech.)gy ( )

Analytical model that employs PTM parameters compared against HSPICE

Distributed wordlines, bitlines, low-swing st buted o d es, b t es, o s gtransmitters, wires, receivers

V ifi d t b ithi 12%Verified to be within 12%University of Utah 25

Page 26: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Case Study: Heterogeneous D-NUCA

Dynamic NUCADynamic-NUCAReduces access time by dynamic data movementNear b banks are accessed more freq entlNear-by banks are accessed more frequently

Heterogeneous BanksN b b k d ll d hNear-by banks are made smaller and hence fasterAccess to nearby banks consume less powerAccess to nearby banks consume less powerOther banks can be made larger and more power efficient

26

power efficient

Page 27: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Access Frequency120.00%

80.00%

100.00%

120.00%

40.00%

60.00%

0.00%

20.00%

768

568

368

168

968

768

568

368

168

968

768

32,7

3,30

9,5

6,58

6,3

9,86

3,1

13,1

39,9

16,4

16,7

19,6

93,5

22,9

70,3

26,2

47,1

29,5

23,9

32,8

00,7

% request satisfied by x KB of cache27

% request satisfied by x KB of cache

Page 28: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Few Heterogeneous Organizations Considered by CACTI

Model 1

Model 2University of Utah 28

Model 2

Page 29: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Other Applications

E i i tiExposing wire propertiesNovel cache pipelining

Early lookup Aggressive lookup (ISCA 07)Early lookup, Aggressive lookup (ISCA 07)Flit-reservation flow control (Peh et al., HPCA 00)00)Novel topologies

Hybrid network (ISCA 07)y ( )

29

Page 30: Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

ConclusionNet ork parameters and contention pla aNetwork parameters and contention play a critical role in deciding NUCA organizationWire choices have significant impact on cacheWire choices have significant impact on cache propertiesCACTI 6 0 can identify models that reduceCACTI 6.0 can identify models that reduce power by a factor of three for a delay penalty of 25%

http://www.hpl.hp.com/personal/Norman_Jouppi/cacti6.html

30

http://www.cs.utah.edu/~rajeev/cacti6/


Recommended