Download - University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

University of Utah 1

The Effect of Interconnect Design on the Performance of Large L2

Caches

Naveen Muralimanohar Rajeev Balasubramonian


Motivation: Large Caches

Future processors will have large on-chip caches Intel Montecito has 24MB on-chip cache

Wire delay dominates in large caches Conventional design can lead to very high hit time

(CACTI access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech)

Careful network choices Improve access time

Open room for several other optimizations

Reduces power significantly


Effect of L2 Hit Time

0%

10%

20%

30%

40%

50%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

fma3

d

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

swim

two

lf

vort

ex vpr

wu

pw

ise

IPC

imp

rove

men

t

Increase in IPC due to reduction in L2 access time

8-issue, out-of-order processor (L2-hit time 30-15 cycles)

Avg = 17%


Cache DesignInput address

Dec

oderWordline

Bitlines

Tag

arr

ay

Dat

a ar

ray

Column muxesSense Amps

Comparators

Output driver

Valid output?

Mux drivers

Data output

Output driver


Existing Model - CACTI

Decoder delay Decoder delay

Wordline & bitline delay Wordline & bitline delay

Cache model with 4 sub-arrays Cache model with 16 sub-arrays


Shortcomings

CACTI Suboptimal for large cache size Access delay is equal to the delay of slowest

sub-array Very high hit time for large caches

Employs a separate bus for each cache bank for multi-banked caches


Non-Uniform Cache Access (NUCA)

Large cache is broken into

a number of small banks

Employs on-chip network

for communication

Access delay (distance

between bank and cache

controller)

CPU & L1

Cache banks


Shortcomings

NUCA Banks are sized such that the link latency is

one cycle (Kim et al. ASPLOS 02)

Increased routing complexity

Dissipates more power


Extension to CACTI

On-chip network

Wire model is done using ITRS 2005 parameters

Grid network

No. of rows = No. of columns (or ½ the no. of columns)

Network latency vs Bank access latency tradeoff

Modified the exhaustive search to include the network

overhead


Effect of Network Delay (32MB cache)

0

20

40

60

80

100

120

140

2 4 8 16 32 64 128 256 512 1024 2048 4096

Bank Count

Cy

cle

s (

Fre

q 5

GH

z)

Bank Access Time

Average Cache Access Latency (Global wires)

Average Network Delay

Delay optimal point


Outline

Overview

Cache Design

Effect of Network Delay Wire Design Space Exploiting Heterogeneous Wires Results


Wire Characteristics Wire Resistance and capacitance per unit length

),()22(0 verthorizverthorizwire fringenglayerspaci

width

spacing

thicknessKC

)2()( BarrierwidthBarrierthicknessRwire

Resistance Capacitance Bandwidth

Width

Spacing


Design Space Exploration Tuning wire width and spacing

Base caseB wires

Fast butLow bandwidth

L wires

(Width & Spacing)

Delay Bandwidth


Design Space Exploration Tuning Repeater size and spacing

Traditional WiresLarge repeatersOptimum spacing

Power Optimal WiresSmaller repeatersIncreased spacing

Dela

y Po

wer


Design Space Exploration

Base caseB wires8x plane

Base caseW wires4x plane

PoweroptimizedPW wires4x plane

Fast, low bandwidth

L wires8x plane

Latency 1x

Power 1x

Area 1x

Latency 1.6x

Power 0.9x

Area 0.5x

Latency 3.2x

Power 0.3x

Area 0.5x

Latency 0.5x

Power 0.5x

Area 5x


Access time for different link types

Bank Count

Bank Access Time

Avg Access time

8x-wires 4x-wires L-wires

16 17 46 75 21

32 9 40 71 15

64 6 38 63 14

128 5 44 68 17

256 4 51 83 20

512 3 82 113 27

1024 3 100 133 35

2048 3 99 162 51

4096 3 131 196 67


Outline

Overview

Cache Design

Effect of Network Delay

Wire Design Space Exploiting Heterogeneous Wires Results


Cache Look-UpTotal cache access time

Network delay

(req 6-8 bits to

identify the cache

Bank)

Decoder,

Wordline,

Bitline delay

(req 10-15 bits

of address)

Comparator,

output driver delay

(req remaining address

for tag match)

The entire access happens in a sequential

manner

Bank access


Early Look-Up Send partial

address in L-wires Initiate the bank

lookup Wait for the

complete address Complete the

access

L

Early lookup

(req 10-15

bits

of address)

Tag match

We can hide 60-70%

of the bank access

delay


Aggressive Look-Up Send partial address bits on L-wires

Do early look-up and do partial tag match

Send all the matched blocks aggressively

L

Agg. lookup

(req additional

8-bits of

address fpr

partial tag

match)

Tag match

at cache

controller

Network

delay reduced


Aggressive Look-Up Significant reduction in network delay (for address

transfer) Increase in traffic due to false match < 1% Marginal increase in link overhead

Additional 8-bits of L-wires compared to early lookup

- Adds complexity to cache controller- Needs logic to do tag match


Outline

Overview

Cache Design

Effect of Network Delay

Wire Design Space

Exploiting Heterogeneous Wires Results


Experimental Setup

Simplescalar with contention modeled in detail

Single core, 8-issue out-of-order processor

32 MB, 8-way set-associative, on-chip L2 cache

(SNUCA organization)

32KB I-cache and 32KB D-cache with hit latency

of 3 cycles

Main memory latency 300 cycles


Cache Models

Model Bank Access

(cycles)

Bank Count Network Link Description

1 3 512 B-wires Based on prior work

2 6 64 B-wires CACTI-L2

3 6 64 B & L–wires Early Lookup

4 6 64 B & L–wires Agg. Lookup

5 6 64 B & L–wires Upper bound


Performance Results (Global Wires)

Model 2 (CACTI-L2) : Average performance improvement – 11%

Performance improvement for L2 latency sensitive benchmarks – 16.3%

Model 3 (Early Lookup): Average performance improvement – 14.4%


Model 4 (Aggressive Lookup): Average performance improvement – 17.6%


Model 6 (L-Network): Average performance improvement – 11.4%


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

All Benchmarks Latency Sensitive Benchmarks


Performance Results (4X – Wires)

Wire delay constrained

model Performance

improvements are better

Early lookup performs 5% better

Aggressive model performs 28% better

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Model 1 Model 2 Model 3 Model 4 Model 5

Different Cache Configurations

IPC

(N

orm

ali

zed

to

Mo

de

l 1

)

All Benchmarks Latency Sensitive Benchmarks


Future Work Heterogeneous network in a CMP environment Hybrid-network

Employs a combination of point-to-point and bus for L-messages Effective use of L-wires Latency/bandwidth trade-off

Use of heterogeneous wires in DNUCA environment Cache design focusing on power

Pre-fetching (Power optimized wires) Writeback (Power optimized wires)


Conclusion

Traditional design approaches for large caches is sub-optimal

Network parameters play a significant role in the performance

of large caches

Modified CACTI model, that includes network overhead

performs 16.3% better compared to previous models

Heterogeneous network has potential to further improve the

performance

Early lookup – 21.6%

Aggressive lookup – 26.6%