Light NUCA: a proposal for bridging the inter-cache latency gap
Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1
1 Universidad de Zaragoza and 2 Universidad de Cantabria
Spain
2
Load-to-Use cache latency trend
3
Facing the inter-cache latency gap• Reconfigurable L1/L2
(Balasubramonian et al., MICRO’00)
• single-ported memory cells low bandwidth
• NUCA(Kim et al., ASPLOS’02)
• wire-delay dominated large caches
• routing-cache-routing network overhead
• L-NUCA: L1 + small cache tiles + specialized networks
low latencyhigh bandwidth
large associativity
4
Summary• Motivation
• Introduction to L-NUCAs
• Networking
–Topologies
–Routing
–Messages
• Global Miss Determination
• Single cycle cache look-up plus one-hop routing
• Experimental Platform
• Results
• Conclusions
5
L-NUCA introduction
Latency Tiles Size (KB)
1 1 8
3 4 32
4 6 48
Latency Tiles Size (KB)
5 9 72
6 13 104
7 15 120
6
Topologies and Routing
SearchSearch TransportTransport ReplacementReplacement
•Independent operations, ensures deadlock avoidance
•Broadcast Tree•No flow control
•2D mesh•On/Off flow control•Dynamic Distributed
Routing (DDR)
•Blocks ordered by temporal locality
•Latency-driven topology
•On/Off flow control•DDR
7
Headerless messages
Operation Message content Source Destination Width (bits or wires)
Search @ + MSHR + st data + ctrl r-tile rest tiles 41+4+64+2 = 111
Transport block + MSHR tile (hit) r-tile 256 + 4 = 260
Replacement block + @ tile i tile k, lat(k)=lat(i)+1 256 + 41 = 297
Assuming 32-byte blocks
•no header overhead•message = packet = flit = phit•Implicit destination
•More than 1k m4/m5 wires fit in one side of an 8KB cache(Intel 32nm, Natarajan et al., IEDM’08)
8KBcache
> 1
000
Worst case: 668
8
Global Miss Determination Logic
•Tiles stop miss propagation in hits•L-NUCA miss iff all last-level tiles miss•Scalable hierarchical organization, taken from SRAM bitlines
(Yang and Kim, JSSC’05)•one cycle after the last level look-up
9
Single-cycle tiles
Three networksHeaderless messages No DC, RT, and VANo virtual channels
•Avoidance of multiple routing stages
•Parallel data array access and switch allocation
XBar: 3 inputs, 2 outputs
low latency
10
Summary• Motivation
• Introduction to L-NUCAs
• Experimental Platform
• Results
• Conclusions
11
Simulator•Enhanced simplescalar 3.0d (Alpha) with:
•Cycle-accurate memory and network models•4-issue processor:
•Speculative wake-up and selective recovery(Intel Pentium 4 alike)
•128 ROB•64 LSQ•Load-to-Use L1 miss penalty: 4 + cache latency
•Memory system:•L1/RT: 32KB-4Way-32B (lat. 2/ init. rate 1) (2 ports)•L3: 8MB-16Way-128B (lat. 20 / init. rate 15)•16-entry L1/RT MSHR
•32 nm technology and 19 FO4s cycle-time
12
Workload and Delay, Power, and Area Models
• Workloads• All but one SPEC CPU 2006 benchmarks
(unable to run 483.xalancbmk on Alpha)
• Delay, Power, and Area modelling
• Cacti 5.3 and improved Orion for caches and routers
13
Summary• Motivation
• Introduction to L-NUCAs
• Experimental Platform
• Results
• 3-level conventional cache vs. L-NUCA and L3
• D-NUCA vs. L-NUCA and D-NUCA
• Conclusions
14
Tested Scenarios
•3-level conventional cache vs. L-NUCA and L3
•D-NUCA vs. L-NUCA and D-NUCA
15
Average IPC, 3-level vs. L-NUCA
+ 6.1 %
+ 15 %
16
Hierarchy energy, 3-level vs. L-NUCA
-14.2%
17
IPC and Area Comparison
L2-256KB L2-512KB
LN2-72KB
LN3-144KB LN4-248KB
IPC
AREA
0.91 mm2 1.29 mm2 0.86 mm2 1.59 mm20.46 mm2
•small L-NUCA network overhead (14 to 19 %)•The low density of L-NUCAs discourages the use of large
sizes
18
Tested Scenarios
•3-level conventional cache vs. L-NUCA and L3
•D-NUCA vs. L-NUCA and D-NUCA
19
Average IPC, L-NUCA with D-NUCA
+ 4.2%
+ 6.8 %
20
Hierarchy Energy, L-NUCA with D-NUCA
4.25 %
21
L-NUCA load-to-use latency
IPC
L2-256KB 1.46
LN3-144KB 1.66
In 10 benchmarks, Le2 captures
more than 75% of L2 read hits
22
Summary• Motivation
• Introduction to L-NUCAs
• Experimental Platform
• Results
• Conclusions
23
Conclusions & Future Work• L-NUCAs leverages the advantages of NoChips for
NoCaches, low latency and high bandwidth, and reduces the inter-cache latency gap
• Design based on 3 specialized networks conveying headerless messages
• Performance and Energy gains with conventional and D-NUCA LLCs
• Future Work:
• Integrate L-NUCAs in CMP and SMT environments
• Study the effect of prefetching for increasing spatial locality
Light NUCAs: a proposal for bridging the inter-cache latency gap
Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1
1 Universidad de Zaragoza and 2 Universidad de Cantabria
Spain
25
L-NUCA summary
26
Out-of-Order processor pipeline
27
Out-of-Order processor pipeline
28
Tile internals
SearchSearch TransportTransport ReplacementReplacement
MA: Miss Address Register (Search)U bf: Upperstream buffer (replacement)D bf: Downstream buffer (transport)
Every D and U buffer has 2 entries (2-cycle round-trip delay)