CloudCacheExpanding and Shrinking Private
Caches
Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers
L2 cache design challenges
• Heterogeneous workloads– Multiple VMs/apps in a single chip– Ave. data center utilization: 15~30%
• Tiled many-core CMPs– Intel 48-core SCC, Tilera 100-core CMP– Many L2 banks: 10s ~ 100s
• L2 cache management– Capacity allocation– Remote L2 access latency– Distributed on-chip directory
Question: How to manage L2 cache resource of many-core CMPs?
CloudCache approach
Design philosophyTo aggressively and flexibly allocate capacity based on workloads’ demand
Key techniquesGlobal capacity partitioningCache chain links w/ nearby L2 banksLimited target broadcast
Beneficiaries:Gain performance
Benefactors:Sacrifice performance
CloudCache example
Threads
Capacity demand
Tiled 64-core CMP
CloudCache example
Threads
Capacity demand
Tiled 64-core CMP
Remote L2 access still slow
Tiled 64-core CMP
Remote L2 access still slow
Tiled 64-core CMP
Dir.
data
• Remote directory access is on critical path
CloudCache solution
• Remote directory access is on critical path
• Limited target broadcast (LTB)
Tiled 64-core CMP
Purposes of proposed techniques
I. Cloud (partitioned capacity) formation• Global capacity partitioning• Distance aware cache chain links
II. Directory access latency minimization• Limited target broadcast
I. Cloud formation
Four step process to forming clouds based on workload demand
Step 1: MonitoringStep 2: Capacity determinationStep 3: L2 bank/token allocationStep 4: Chain links formation
Step 1: Monitoring
• GCA (Global capacity allocator)
• Hit count per LRU position
“allocated capacity” + “monitoring capacity” (of 32 ways)
• Partitioning: Utility [Qureshi & Patt, MICRO ‘06] and
QoS
network
network
GCA
Hit countsCache
alloc. info
…
Step 2: Capacity determination
from allocated capacity
from monitoring capacity
Allocationengine
GCA
Output: capacity to minimize overall misses
Capacity
Capacity
Capacity
0.75
Step 3: Bank/token allocation
1. Local L2 cache first
2. Threads w/ larger capacity demand first3. Closer L2 banks first
2.751.75 0
3
1
2
0
1
2.75
1.25
Total cap. Cap. to allocate
1.250.25
0
0
Repeat!
Bank / token
0 / 12 / 13 / 0.75
1 / 13 / 0.25
Output: bank / token for each thread
Step 4: Building cache chain links
0
3
1
2
0 2 3MRU LRU
Virtual L2 cache
Hop distance 0 1 2
0 2.75
Total cap. Bank / token
0 / 12 / 13 / 0.75
Purposes of proposed techniques
I. Cloud formation• Global capacity partitioning• Distance aware cache chain links
II. Directory access latency minimization• Limited target broadcast
Purposes of proposed techniques
I. Cloud formation• Global capacity partitioning• Distance aware cache chain links
II. Directory access latency minimization• Limited target broadcast
II. Limited target broadcast (LTB)
• Private data:– LTB w/o dir. lookup– Update directory
II. Limited target broadcast (LTB)
• Private data:– LTB w/o dir. lookup– Update directory
• Shared data:– Dir. based coherence
• Private >> Shared
Limited target broadcast protocol
• Data is fetched w/ LTB
• Before dir. is updated, another core accesses dir. for the fetched data–Stale data in dir.
• Protocol is detailed in the paper
Experimental setup
• TPTS [Lee et al., SPE ‘10, Cho et al., ICPP ‘08]
– 64-core CMP with 8×8 2D mesh, 4-cycles/hop– Core: Intel’s ATOM-like two-issue in-order pipeline– Directory-based MESI protocol– Four independent DRAM controllers, four ports/controller– DRAM with Samsung DDR3-1600 timing
• Workloads– SPEC2006 (10B cycles)
• High/medium/low based on MPKI for varying cache capacity
– PARSEC (simlarge input set)• 16 threads/application
Evaluated schemes
• Shared
• Private
• DSR [HPCA 2009]– Spiller and Receiver
• ECC [ISCA 2010]– Local partitioning/monitoring
• CloudCache
Impact of global partitioning
High
Medium
Low
High
MP
KI
Shared Private DSR ECC Cloud-Cache
Shared Private DSR ECC Cloud-Cache
Shared Private DSR ECCCloud-Cache
Shared Private DSR ECCCloud-Cache
Impact of global partitioning
High
Medium
Low
High
MP
KI
Shared Private DSR ECC Cloud-Cache
Shared Private DSR ECC Cloud-Cache
Shared Private DSR ECCCloud-Cache
Shared Private DSR ECCCloud-Cache
Impact of global partitioning
High
Medium
Low
High
MP
KI
Shared Private DSR ECC Cloud-Cache
Shared Private DSR ECC Cloud-Cache
Shared Private DSR ECCCloud-Cache
Shared Private DSR ECCCloud-Cache
Shared Private DSR ECC Cloud-Cache
Shared Private DSR ECC Cloud-Cache
Shared Private DSR ECCCloud-Cache
Shared Private DSR ECCCloud-Cache
Impact of global partitioning
High
Medium
Low
High
MP
KI CloudCache aggressively allocates
capacity with global information
L2 cache access latencyAccess #
401.bzip2
0E+00
1E+06
2E+06
0E+00
1E+06
2E+060E+00
1E+06
2E+06
access latency
0 36 72 1081441802162522883243603960E+00
1E+06
2E+06
Shared
Private
DSR & ECC
CloudCache
widely spread latency
fast local access + off-chip access
fast local access + widely spread la-tency
fast local access + fast remote access
16 threads, throughput
Comb Light Medium Heavy AVG0%
20%
40%
60%
80%
100%
120%
140%
SharedDSRECCCLOUD
Rel.
Spee
d to
Pri
vate Pri-
vate
16 threads, throughput
Comb Light Medium Heavy AVG0%
20%
40%
60%
80%
100%
120%
140%
SharedDSRECCCLOUD
Rel.
Spee
d to
Pri
vate Pri-
vate
Comb Light Medium Heavy AVG0%
20%
40%
60%
80%
100%
120%
140%
SharedDSRECCCLOUD
Rel.
Spee
d to
Pri
vate
16 threads, throughput
Pri-vate
Comb Light Medium Heavy AVG0%
20%
40%
60%
80%
100%
120%
140%
SharedDSRECCCLOUD
Rel.
Spee
d to
Pri
vate
16 threads, throughput
Pri-vate
16 threads, beneficiaries
0123456789
101112
Shared DSR ECC Cloud
0%
5%
10%
15%
20%
25%
30%Shared DSR ECC Cloud
0%
10%
20%
30%
40%
50%
60%
70%
80%Shared DSR ECC Cloud
# of beneficia-ries
Avg. speed up
Max. speed up
16 threads, beneficiaries
0123456789
101112
Shared DSR ECC Cloud
0%
5%
10%
15%
20%
25%
30%Shared DSR ECC Cloud
0%
10%
20%
30%
40%
50%
60%
70%
80%Shared DSR ECC Cloud
# of beneficia-ries
Avg. speed up
Max. speed up
• Benefactors’ performance: <1% degradation– Please see paper for graphs
16 threads, beneficiaries
0123456789
101112
Shared DSR ECC Cloud
0%
5%
10%
15%
20%
25%
30%Shared DSR ECC Cloud
0%
10%
20%
30%
40%
50%
60%
70%
80%Shared DSR ECC Cloud
# of beneficia-ries
Avg. speed up
Max. speed up
32 / 64 threads, throughput
Comb Light AVE0%
20%40%60%80%
100%120%140%
64 threads
SharedDSRECCCLOUD
Comb Light AVE0%
20%40%60%80%
100%120%140%
32 threads
Mutithreaded workload (PARSEC)
Comb1 Comb2 Comb3 Comb4 Comb50
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6Shared DSR ECC CLOUD
sp
ee
du
p t
o p
riv
ate
Conclusion
• Unbounded shared capacity is EVIL
• CloudCache: private caches to threads– Capacity allocation with global partitioning– Cache chain link with nearby L2 banks– Limited target broadcast
• HW overhead is very small (~5KB).
Use CloudCache!
CloudCacheExpanding and Shrinking Private
Caches
Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers