Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

Laboratory for Computer Architecture 9/23/2009

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Dimitris Kaseridis1 Jeff Stuecheli1,2 & Lizy K. John1

1University of Texas – Austin2IBM – Austin

ICPP-38 2009

2 Laboratory for Computer Architecture 9/23/2009

Outline

Motivation/background

Cache partitioning/profiling

Proposed system

Results

Conclusion/future work

ICPP-38 2009


Motivation

Shared resources in CMPs

– Last Level Cache

– Memory bandwidth

Opportunity and Pitfalls

– Constructive• Mixing low and high cache requirements in shared pool

– Destructive• Thrashing workloads (spec cpu 2000 art + mcf)

– Cache partitioning required

– Primary opportunity requires heterogeneous workload mixes• Typical in consolidation + virtualization

ICPP-38 2009


Monolithic vs NUCA vs Industry architectures

Monolithic: One large shared uniform latency cache bank on a CMP

– Does not exploit physical locality for private data

– Slow for all

CMP-NUCA: Typical proposal has a very large number of autonomous cache banks

– Very flexible (256 banks)

– Non optimal configuration• Inefficient bank size (bank overhead)

Real implementations

– Fewer banks in industry

– NUCA with discrete cache levels

– Key is wire assumptions made in original NUCA analysis

IBM POWER7

Intel Nehalem EX

Core Core Core Core

Core Core Core Core

Cache Cache

Cache Cache

cache cache cache

cache cache cache cache

cache

ICPP-38 2009


Baseline System

8 cores

16 MB total capacity

– 16 x 1 MB banks

– 8 way associative

Local Banks

– Tight latency to close core

Center Banks

– Shared capacity

ICPP-38 2009


Cache Partitioning/Profiling

ICPP-38 2009


Cache Sharing/Partitioning

Last level cache of CMP

– Once isolated resources now shared• Drove need for isolation

– Design space• Non-configurable

– Shared vs private caches• Static partitioning/policy

– Long term policy choice• Dynamic

– Real time profiling directed partitions– Trial and error (experiment to find ideal configuration)– Predictive profilers

> Non-invasive state space exploration (our system)

ICPP-38 2009


Bank-aware cache partitions

System components

– Non-invasive profiling using MSA (Mattson Stack Algorithm)

– Cache allocation using marginal utility

– Bank-aware LLC partitions

ICPP-38 2009


MSA Based Cache Profiling

Mattson stack algorithm

– Originally proposed to concurrently simulate many cache sizes

– Structure is a true LRU cache

– Stack distance from MRU of each reference is recorded

– Misses can be calculated for fraction of ways

ICPP-38 2009


Hardware MSA implementation Naïve algorithm is prohibitive

– Fully associative– Complete cache directory of maximum cache size for every core on the

CMP (total size)

Reduction– Set Sampling– Partial Tags– Maximal Capacity

Configuration in paper– 12 bit tag– 1/32 set sampling– 9/16 bank per core– 0.4% overhead of cache on chip

ways

sets

ICPP-38 2009


Marginal Utility

Miss rate relative to capacity is non-linear, and heavily workload dependant

Dramatic miss rate reduction as data structures become cache contained

In practice,

– Iteratively assign cache to cores that produce the most hits per capacity

ICPP-38 2009


Bank-aware LLC partitions

a ideal MSA model

b banked true LRU

– Cascade banks

– Power inefficient

c realistic banking

– Allocation policy• Hash allocation• Random allocation

– Bank granularity• Uniform requirement

ICPP-38 2009


Bank-aware allocation heuristics

General idea

– As capacity grows, courser assignment is good enough

Only share portions of Local cache banks between neighbors

Central banks are assigned to a specific core

– Any core to receive central banks is also assigned full local capacity

ICPP-38 2009


Cache allocation flowchart

Assign full cache banks first (steps 1-3)

– All cores that have multiple banks are complete

Partition remaining local banks (steps 4-7)

– Fine tune assignment

– Sharing pairs

ICPP-38 2009


Evaluation

ICPP-38 2009


Methodology

Workloads

– 8 cores running mix of 26 SPEC CPU 2000 workloads

– What benchmark mix?

– Typical is to classify with limited experiments• We wanted to cover a larger state space

Monte Carlo

– Compare bank aware miss rate to ideal assignment• Show algorithm works for many cases

Detailed simulation

– Cycle accurate

– Full system• Simics+GEMS+CourseBanks+CachePartitions

ICPP-38 2009


Monte CarloHow close is Bank-aware assignment to ideal monolithic?

Graphic shows miss rate reduction

– 1000 random SpecCPU 2000 benchmark mixes

97% correlation in miss rates

ICPP-38 2009


Workload sets for detailed simulation

ICPP-38 2009


Cycle accurate simulation

Overall

– Miss ratio

• 70% reduction over shared

• 25% over equal

– Throughput

• 43% increase over shared

• 11% increase over equal

0

0.2

0.4

0.6

0.8

1

1.2

Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 GM

Rela

tive

Mis

s Ra

te

No-Partitions Equal-Partitions Bank-aware

0

0.2

0.4

0.6

0.8

1

1.2

Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 GM

Rela

tive

CPI

No-Partitions Equal-Partitions Bank-aware

ICPP-38 2009


Conclusion/future work

Significant miss rate reduction/throughput improvement possible

– Partitions are very important

– Marginal utility can work with realistic banked CMP caches

Heterogeneous Benchmarks needed

– Can’t evaluate all combinations

– Hand chosen combinations are hard to compare across proposals

ICPP-38 2009


Thank You,Questions?

Laboratory for Computer ArchitectureUniversity of Texas Austin

&IBM Austin

Date post:	14-Jan-2016
Category:	Documents
Upload:	rory
View:	32 times
Download:	0 times

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Documents