+ All Categories
Home > Documents > Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Date post: 14-Jan-2016
Category:
Upload: rory
View: 32 times
Download: 0 times
Share this document with a friend
Description:
Bank-aware Dynamic Cache Partitioning for Multicore Architectures. Dimitris Kaseridis 1 Jeff Stuecheli 1,2 & Lizy K. John 1 1 University of Texas – Austin 2 IBM – Austin. Outline. Motivation/background Cache partitioning/profiling Proposed system Results Conclusion/future work. - PowerPoint PPT Presentation
21
ICPP-38 2009 Laboratory for Computer Architecture 9/23/2009 Bank-aware Dynamic Cache Partitioning for Multicore Architectures Dimitris Kaseridis 1 Jeff Stuecheli 1,2 & Lizy K. John 1 1 University of Texas – Austin 2 IBM – Austin
Transcript
Page 1: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

Laboratory for Computer Architecture 9/23/2009

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Dimitris Kaseridis1 Jeff Stuecheli1,2 & Lizy K. John1

1University of Texas – Austin2IBM – Austin

Page 2: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

2 Laboratory for Computer Architecture 9/23/2009

Outline

Motivation/background

Cache partitioning/profiling

Proposed system

Results

Conclusion/future work

Page 3: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

3 Laboratory for Computer Architecture 9/23/2009

Motivation

Shared resources in CMPs

– Last Level Cache

– Memory bandwidth

Opportunity and Pitfalls

– Constructive• Mixing low and high cache requirements in shared pool

– Destructive• Thrashing workloads (spec cpu 2000 art + mcf)

– Cache partitioning required

– Primary opportunity requires heterogeneous workload mixes• Typical in consolidation + virtualization

Page 4: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

4 Laboratory for Computer Architecture 9/23/2009

Monolithic vs NUCA vs Industry architectures

Monolithic: One large shared uniform latency cache bank on a CMP

– Does not exploit physical locality for private data

– Slow for all

CMP-NUCA: Typical proposal has a very large number of autonomous cache banks

– Very flexible (256 banks)

– Non optimal configuration• Inefficient bank size (bank overhead)

Real implementations

– Fewer banks in industry

– NUCA with discrete cache levels

– Key is wire assumptions made in original NUCA analysis

IBM POWER7

Intel Nehalem EX

Core Core Core Core

Core Core Core Core

Cache Cache

Cache Cache

cache cache cache

cache cache cache cache

cache

Page 5: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

5 Laboratory for Computer Architecture 9/23/2009

Baseline System

8 cores

16 MB total capacity

– 16 x 1 MB banks

– 8 way associative

Local Banks

– Tight latency to close core

Center Banks

– Shared capacity

Page 6: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

6 Laboratory for Computer Architecture 9/23/2009

Cache Partitioning/Profiling

Page 7: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

7 Laboratory for Computer Architecture 9/23/2009

Cache Sharing/Partitioning

Last level cache of CMP

– Once isolated resources now shared• Drove need for isolation

– Design space• Non-configurable

– Shared vs private caches• Static partitioning/policy

– Long term policy choice• Dynamic

– Real time profiling directed partitions– Trial and error (experiment to find ideal configuration)– Predictive profilers

> Non-invasive state space exploration (our system)

Page 8: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

8 Laboratory for Computer Architecture 9/23/2009

Bank-aware cache partitions

System components

– Non-invasive profiling using MSA (Mattson Stack Algorithm)

– Cache allocation using marginal utility

– Bank-aware LLC partitions

Page 9: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

9 Laboratory for Computer Architecture 9/23/2009

MSA Based Cache Profiling

Mattson stack algorithm

– Originally proposed to concurrently simulate many cache sizes

– Structure is a true LRU cache

– Stack distance from MRU of each reference is recorded

– Misses can be calculated for fraction of ways

Page 10: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

10 Laboratory for Computer Architecture 9/23/2009

Hardware MSA implementation Naïve algorithm is prohibitive

– Fully associative– Complete cache directory of maximum cache size for every core on the

CMP (total size)

Reduction– Set Sampling– Partial Tags– Maximal Capacity

Configuration in paper– 12 bit tag– 1/32 set sampling– 9/16 bank per core– 0.4% overhead of cache on chip

ways

sets

Page 11: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

11 Laboratory for Computer Architecture 9/23/2009

Marginal Utility

Miss rate relative to capacity is non-linear, and heavily workload dependant

Dramatic miss rate reduction as data structures become cache contained

In practice,

– Iteratively assign cache to cores that produce the most hits per capacity

Page 12: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

12 Laboratory for Computer Architecture 9/23/2009

Bank-aware LLC partitions

a ideal MSA model

b banked true LRU

– Cascade banks

– Power inefficient

c realistic banking

– Allocation policy• Hash allocation• Random allocation

– Bank granularity• Uniform requirement

Page 13: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

13 Laboratory for Computer Architecture 9/23/2009

Bank-aware allocation heuristics

General idea

– As capacity grows, courser assignment is good enough

Only share portions of Local cache banks between neighbors

Central banks are assigned to a specific core

– Any core to receive central banks is also assigned full local capacity

Page 14: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

14 Laboratory for Computer Architecture 9/23/2009

Cache allocation flowchart

Assign full cache banks first (steps 1-3)

– All cores that have multiple banks are complete

Partition remaining local banks (steps 4-7)

– Fine tune assignment

– Sharing pairs

Page 15: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

15 Laboratory for Computer Architecture 9/23/2009

Evaluation

Page 16: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

16 Laboratory for Computer Architecture 9/23/2009

Methodology

Workloads

– 8 cores running mix of 26 SPEC CPU 2000 workloads

– What benchmark mix?

– Typical is to classify with limited experiments• We wanted to cover a larger state space

Monte Carlo

– Compare bank aware miss rate to ideal assignment• Show algorithm works for many cases

Detailed simulation

– Cycle accurate

– Full system• Simics+GEMS+CourseBanks+CachePartitions

Page 17: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

17 Laboratory for Computer Architecture 9/23/2009

Monte CarloHow close is Bank-aware assignment to ideal monolithic?

Graphic shows miss rate reduction

– 1000 random SpecCPU 2000 benchmark mixes

97% correlation in miss rates

Page 18: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

18 Laboratory for Computer Architecture 9/23/2009

Workload sets for detailed simulation

Page 19: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

19 Laboratory for Computer Architecture 9/23/2009

Cycle accurate simulation

Overall

– Miss ratio

• 70% reduction over shared

• 25% over equal

– Throughput

• 43% increase over shared

• 11% increase over equal

0

0.2

0.4

0.6

0.8

1

1.2

Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 GM

Rela

tive

Mis

s Ra

te

No-Partitions Equal-Partitions Bank-aware

0

0.2

0.4

0.6

0.8

1

1.2

Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 GM

Rela

tive

CPI

No-Partitions Equal-Partitions Bank-aware

Page 20: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

20 Laboratory for Computer Architecture 9/23/2009

Conclusion/future work

Significant miss rate reduction/throughput improvement possible

– Partitions are very important

– Marginal utility can work with realistic banked CMP caches

Heterogeneous Benchmarks needed

– Can’t evaluate all combinations

– Hand chosen combinations are hard to compare across proposals

Page 21: Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ICPP-38 2009

21 Laboratory for Computer Architecture 9/23/2009

Thank You,Questions?

Laboratory for Computer ArchitectureUniversity of Texas Austin

&IBM Austin


Recommended