University of Utah 1
The Effect of Interconnect Design on the Performance of Large L2
Caches
Naveen Muralimanohar Rajeev Balasubramonian
University of Utah 2
Motivation: Large Caches
Future processors will have large on-chip caches Intel Montecito has 24MB on-chip cache
Wire delay dominates in large caches Conventional design can lead to very high hit time
(CACTI access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech)
Careful network choices Improve access time
Open room for several other optimizations
Reduces power significantly
University of Utah 3
Effect of L2 Hit Time
0%
10%
20%
30%
40%
50%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
fma3
d
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
swim
two
lf
vort
ex vpr
wu
pw
ise
IPC
imp
rove
men
t
Increase in IPC due to reduction in L2 access time
8-issue, out-of-order processor (L2-hit time 30-15 cycles)
Avg = 17%
University of Utah 4
Cache DesignInput address
Dec
oderWordline
Bitlines
Tag
arr
ay
Dat
a ar
ray
Column muxesSense Amps
Comparators
Output driver
Valid output?
Mux drivers
Data output
Output driver
University of Utah 5
Existing Model - CACTI
Decoder delay Decoder delay
Wordline & bitline delay Wordline & bitline delay
Cache model with 4 sub-arrays Cache model with 16 sub-arrays
University of Utah 6
Shortcomings
CACTI Suboptimal for large cache size Access delay is equal to the delay of slowest
sub-array Very high hit time for large caches
Employs a separate bus for each cache bank for multi-banked caches
University of Utah 7
Non-Uniform Cache Access (NUCA)
Large cache is broken into
a number of small banks
Employs on-chip network
for communication
Access delay (distance
between bank and cache
controller)
CPU & L1
Cache banks
University of Utah 8
Shortcomings
NUCA Banks are sized such that the link latency is
one cycle (Kim et al. ASPLOS 02)
Increased routing complexity
Dissipates more power
University of Utah 9
Extension to CACTI
On-chip network
Wire model is done using ITRS 2005 parameters
Grid network
No. of rows = No. of columns (or ½ the no. of columns)
Network latency vs Bank access latency tradeoff
Modified the exhaustive search to include the network
overhead
University of Utah 10
Effect of Network Delay (32MB cache)
0
20
40
60
80
100
120
140
2 4 8 16 32 64 128 256 512 1024 2048 4096
Bank Count
Cy
cle
s (
Fre
q 5
GH
z)
Bank Access Time
Average Cache Access Latency (Global wires)
Average Network Delay
Delay optimal point
University of Utah 11
Outline
Overview
Cache Design
Effect of Network Delay Wire Design Space Exploiting Heterogeneous Wires Results
University of Utah 12
Wire Characteristics Wire Resistance and capacitance per unit length
),()22(0 verthorizverthorizwire fringenglayerspaci
width
spacing
thicknessKC
)2()( BarrierwidthBarrierthicknessRwire
Resistance Capacitance Bandwidth
Width
Spacing
University of Utah 13
Design Space Exploration Tuning wire width and spacing
Base caseB wires
Fast butLow bandwidth
L wires
(Width & Spacing)
Delay Bandwidth
University of Utah 14
Design Space Exploration Tuning Repeater size and spacing
Traditional WiresLarge repeatersOptimum spacing
Power Optimal WiresSmaller repeatersIncreased spacing
Dela
y Po
wer
University of Utah 15
Design Space Exploration
Base caseB wires8x plane
Base caseW wires4x plane
PoweroptimizedPW wires4x plane
Fast, low bandwidth
L wires8x plane
Latency 1x
Power 1x
Area 1x
Latency 1.6x
Power 0.9x
Area 0.5x
Latency 3.2x
Power 0.3x
Area 0.5x
Latency 0.5x
Power 0.5x
Area 5x
University of Utah 16
Access time for different link types
Bank Count
Bank Access Time
Avg Access time
8x-wires 4x-wires L-wires
16 17 46 75 21
32 9 40 71 15
64 6 38 63 14
128 5 44 68 17
256 4 51 83 20
512 3 82 113 27
1024 3 100 133 35
2048 3 99 162 51
4096 3 131 196 67
University of Utah 17
Outline
Overview
Cache Design
Effect of Network Delay
Wire Design Space Exploiting Heterogeneous Wires Results
University of Utah 18
Cache Look-UpTotal cache access time
Network delay
(req 6-8 bits to
identify the cache
Bank)
Decoder,
Wordline,
Bitline delay
(req 10-15 bits
of address)
Comparator,
output driver delay
(req remaining address
for tag match)
The entire access happens in a sequential
manner
Bank access
University of Utah 19
Early Look-Up Send partial
address in L-wires Initiate the bank
lookup Wait for the
complete address Complete the
access
L
Early lookup
(req 10-15
bits
of address)
Tag match
We can hide 60-70%
of the bank access
delay
University of Utah 20
Aggressive Look-Up Send partial address bits on L-wires
Do early look-up and do partial tag match
Send all the matched blocks aggressively
L
Agg. lookup
(req additional
8-bits of
address fpr
partial tag
match)
Tag match
at cache
controller
Network
delay reduced
University of Utah 21
Aggressive Look-Up Significant reduction in network delay (for address
transfer) Increase in traffic due to false match < 1% Marginal increase in link overhead
Additional 8-bits of L-wires compared to early lookup
- Adds complexity to cache controller- Needs logic to do tag match
University of Utah 22
Outline
Overview
Cache Design
Effect of Network Delay
Wire Design Space
Exploiting Heterogeneous Wires Results
University of Utah 23
Experimental Setup
Simplescalar with contention modeled in detail
Single core, 8-issue out-of-order processor
32 MB, 8-way set-associative, on-chip L2 cache
(SNUCA organization)
32KB I-cache and 32KB D-cache with hit latency
of 3 cycles
Main memory latency 300 cycles
University of Utah 24
Cache Models
Model Bank Access
(cycles)
Bank Count Network Link Description
1 3 512 B-wires Based on prior work
2 6 64 B-wires CACTI-L2
3 6 64 B & L–wires Early Lookup
4 6 64 B & L–wires Agg. Lookup
5 6 64 B & L–wires Upper bound
University of Utah 25
Performance Results (Global Wires)
Model 2 (CACTI-L2) : Average performance improvement – 11%
Performance improvement for L2 latency sensitive benchmarks – 16.3%
Model 3 (Early Lookup): Average performance improvement – 14.4%
Performance improvement for L2 latency sensitive benchmarks – 21.6%
Model 4 (Aggressive Lookup): Average performance improvement – 17.6%
Performance improvement for L2 latency sensitive benchmarks – 26.6%
Model 6 (L-Network): Average performance improvement – 11.4%
Performance improvement for L2 latency sensitive benchmarks – 16.2%
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
All Benchmarks Latency Sensitive Benchmarks
University of Utah 26
Performance Results (4X – Wires)
Wire delay constrained
model Performance
improvements are better
Early lookup performs 5% better
Aggressive model performs 28% better
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Model 1 Model 2 Model 3 Model 4 Model 5
Different Cache Configurations
IPC
(N
orm
ali
zed
to
Mo
de
l 1
)
All Benchmarks Latency Sensitive Benchmarks
University of Utah 27
Future Work Heterogeneous network in a CMP environment Hybrid-network
Employs a combination of point-to-point and bus for L-messages Effective use of L-wires Latency/bandwidth trade-off
Use of heterogeneous wires in DNUCA environment Cache design focusing on power
Pre-fetching (Power optimized wires) Writeback (Power optimized wires)
University of Utah 28
Conclusion
Traditional design approaches for large caches is sub-optimal
Network parameters play a significant role in the performance
of large caches
Modified CACTI model, that includes network overhead
performs 16.3% better compared to previous models
Heterogeneous network has potential to further improve the
performance
Early lookup – 21.6%
Aggressive lookup – 26.6%