Download - Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals.

Light NUCA: a proposal for bridging the inter-cache latency gap

Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1

1 Universidad de Zaragoza and 2 Universidad de Cantabria

Spain

2

Load-to-Use cache latency trend

3

Facing the inter-cache latency gap• Reconfigurable L1/L2

(Balasubramonian et al., MICRO’00)

• single-ported memory cells low bandwidth

• NUCA(Kim et al., ASPLOS’02)

• wire-delay dominated large caches

• routing-cache-routing network overhead

• L-NUCA: L1 + small cache tiles + specialized networks

low latencyhigh bandwidth

large associativity

4

Summary• Motivation

• Introduction to L-NUCAs

• Networking

–Topologies

–Routing

–Messages

• Global Miss Determination

• Single cycle cache look-up plus one-hop routing

• Experimental Platform

• Results

• Conclusions

5

L-NUCA introduction

Latency Tiles Size (KB)

1 1 8

3 4 32

4 6 48

Latency Tiles Size (KB)

5 9 72

6 13 104

7 15 120

6

Topologies and Routing

SearchSearch TransportTransport ReplacementReplacement

•Independent operations, ensures deadlock avoidance

•Broadcast Tree•No flow control

•2D mesh•On/Off flow control•Dynamic Distributed

Routing (DDR)

•Blocks ordered by temporal locality

•Latency-driven topology

•On/Off flow control•DDR

7

Headerless messages

Operation Message content Source Destination Width (bits or wires)

Search @ + MSHR + st data + ctrl r-tile rest tiles 41+4+64+2 = 111

Transport block + MSHR tile (hit) r-tile 256 + 4 = 260

Replacement block + @ tile i tile k, lat(k)=lat(i)+1 256 + 41 = 297

Assuming 32-byte blocks

•no header overhead•message = packet = flit = phit•Implicit destination

•More than 1k m4/m5 wires fit in one side of an 8KB cache(Intel 32nm, Natarajan et al., IEDM’08)

8KBcache

> 1

000

Worst case: 668

8

Global Miss Determination Logic

•Tiles stop miss propagation in hits•L-NUCA miss iff all last-level tiles miss•Scalable hierarchical organization, taken from SRAM bitlines

(Yang and Kim, JSSC’05)•one cycle after the last level look-up

9

Single-cycle tiles

Three networksHeaderless messages No DC, RT, and VANo virtual channels

•Avoidance of multiple routing stages

•Parallel data array access and switch allocation

XBar: 3 inputs, 2 outputs

low latency

10




• Results

• Conclusions

11

Simulator•Enhanced simplescalar 3.0d (Alpha) with:

•Cycle-accurate memory and network models•4-issue processor:

•Speculative wake-up and selective recovery(Intel Pentium 4 alike)

•128 ROB•64 LSQ•Load-to-Use L1 miss penalty: 4 + cache latency

•Memory system:•L1/RT: 32KB-4Way-32B (lat. 2/ init. rate 1) (2 ports)•L3: 8MB-16Way-128B (lat. 20 / init. rate 15)•16-entry L1/RT MSHR

•32 nm technology and 19 FO4s cycle-time

12

Workload and Delay, Power, and Area Models

• Workloads• All but one SPEC CPU 2006 benchmarks

(unable to run 483.xalancbmk on Alpha)

• Delay, Power, and Area modelling

• Cacti 5.3 and improved Orion for caches and routers

13




• Results

• 3-level conventional cache vs. L-NUCA and L3

• D-NUCA vs. L-NUCA and D-NUCA

• Conclusions

14

Tested Scenarios

•3-level conventional cache vs. L-NUCA and L3

•D-NUCA vs. L-NUCA and D-NUCA

15

Average IPC, 3-level vs. L-NUCA

+ 6.1 %

+ 15 %

16

Hierarchy energy, 3-level vs. L-NUCA

-14.2%

17

IPC and Area Comparison

L2-256KB L2-512KB

LN2-72KB

LN3-144KB LN4-248KB

IPC

AREA

0.91 mm2 1.29 mm2 0.86 mm2 1.59 mm20.46 mm2

•small L-NUCA network overhead (14 to 19 %)•The low density of L-NUCAs discourages the use of large

sizes

18

Tested Scenarios

•3-level conventional cache vs. L-NUCA and L3

•D-NUCA vs. L-NUCA and D-NUCA

19

Average IPC, L-NUCA with D-NUCA

+ 4.2%

+ 6.8 %

20

Hierarchy Energy, L-NUCA with D-NUCA

4.25 %

21

L-NUCA load-to-use latency

IPC

L2-256KB 1.46

LN3-144KB 1.66

In 10 benchmarks, Le2 captures

more than 75% of L2 read hits

22




• Results

• Conclusions

23

Conclusions & Future Work• L-NUCAs leverages the advantages of NoChips for

NoCaches, low latency and high bandwidth, and reduces the inter-cache latency gap

• Design based on 3 specialized networks conveying headerless messages

• Performance and Energy gains with conventional and D-NUCA LLCs

• Future Work:

• Integrate L-NUCAs in CMP and SMT environments

• Study the effect of prefetching for increasing spatial locality

Light NUCAs: a proposal for bridging the inter-cache latency gap

Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1

1 Universidad de Zaragoza and 2 Universidad de Cantabria

Spain

25

L-NUCA summary

26

Out-of-Order processor pipeline

27

Out-of-Order processor pipeline

28

Tile internals

SearchSearch TransportTransport ReplacementReplacement

MA: Miss Address Register (Search)U bf: Upperstream buffer (replacement)D bf: Downstream buffer (transport)

Every D and U buffer has 2 entries (2-cycle round-trip delay)