+ All Categories
Home > Documents > cloudcache.pptx [Last saved by...

cloudcache.pptx [Last saved by...

Date post: 06-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
32
CloudCache: Expanding and Shrinking Private Caches CloudCache: Expanding and Shrinking Private Caches Sangyeun Cho Computer Science Department University of Pittsburgh Sangyeun Cho Computer Science Department University of Pittsburgh University of Pittsburgh Credits Parts of the work presented in this talk are from the results obtained in collaboration with students and faculty at the University of Pittsburgh: Mohammad Hammoud Lei Jin Hyunjin Lee Kiyeon Lee Prof. Bruce Childers Prof. Rami Melhem Partial support has come as: NSF grant CCF-0702236 NSF grant CCF-0952273 A. Richard Newton Graduate Scholarship, ACM DAC 2008
Transcript
Page 1: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

CloudCache:Expanding and Shrinking Private Caches

CloudCache:Expanding and Shrinking Private Caches

Sangyeun Cho

Computer Science DepartmentUniversity of Pittsburgh

Sangyeun Cho

Computer Science DepartmentUniversity of Pittsburgh

University of Pittsburgh

Credits

Parts of the work presented in this talk are from the results obtained in collaboration with students and faculty at the University of Pittsburgh:• Mohammad Hammoud

• Lei Jin

• Hyunjin Lee

• Kiyeon Lee

• Prof. Bruce Childers

• Prof. Rami Melhem

Partial support has come as:• NSF grant CCF-0702236

• NSF grant CCF-0952273

• A. Richard Newton Graduate Scholarship, ACM DAC 2008

Page 2: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Recent multicore design trends

Modular designs based on relatively simple cores• Easier to validate (a single core)• Easier to scale (the same validated design replicated multiple times)• Due to these reasons, future “many-core” processors may look like this• Examples: Tilera TILE*, Intel 48-core chip [Howard et al., ISSCC ’10]

University of Pittsburgh

Recent multicore design trends

Modular designs based on relatively simple cores• Easier to validate (a single core)• Easier to scale (the same validated design replicated multiple times)• Due to these reasons, future “many-core” processors will look like this• Examples: Tilera TILE*, Intel 48-core chip [Howard et al., ISSCC ’10]

Design focus shifts from cores to “uncores”• Network on-chip (NoC): bus, crossbar, ring, 2D/3D mesh, …• Memory hierarchy: caches, coherence mechanisms, memory

controllers, …• System-on-chip components like network MAC, cypher, …

Note also:• In this design, aggregate cache capacity increases as we add more tiles• Some tiles may be active while others are inactive

Page 3: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

L2 cache design issues

L2 caches occupy a major portion (prob. the largest) on chip

Capacity vs. latency• Off-chip access (still) expensive

• On-chip latency can grow; with switched on-chip networks, we have NUCA (Non-Uniform Cache Architecture)

• Private cache vs. shared cache or a hybrid scheme?

Quality of service• Interference (capacity/bandwidth)

• Explicit/implicit capacity partitioning

Manageability, fault and yield• Not all caches have to be active

• Hard (irreversible) faults + process variation + soft faults

University of Pittsburgh

Private cache vs. shared cache

Short hit latency (always local)

High on-chip miss rate

Long miss resolution time (i.e., are there replicas?

Low on-chip miss rate

Straightforward data location

Long average hit latency

Uncontrolled interference

Page 4: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Historical perspective

Private designs were there• Multiprocessors based on microprocessors

Shared cache clustering [Nayfeh and Olukotun, ISCA ‘94]

Multicores with private design more straightforward• Earlier AMD chips and Sun Microsystems’ UltraSPARC processors

Intel, IBM, and Sun pushed for shared caches (e.g., Xeon products, POWER-4/5/6, Niagara*)• A variety of workloads perform relatively well

More recently, private caches re-gain popularity• We have more transistors• Interference is easily controlled• Match well with on-/off-chip (shared) L3 $$ [Oh et al., ISVLSI ‘09]

University of Pittsburgh

Shared $$ issue 1: NUCA

A compromise design• Large monolithic shared cache: conceptually simple but very slow

• Smaller cache slices w/ disparate latencies [Kim et al., ASPLOS ‘02]

Interleave data to all available slices (e.g., IBM POWER-*)

Problems• Blind data distribution with no provisions for latency hiding

• As a result, simple NUCA performance is not scalable

yourprogram

yourdata

Page 5: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Improving program-data distance

Key idea: replicate or migrate

Victim replication [Zhang and Asanović, ISCA ’05]

• Victims from private L1 cache are “replicated” to local L2 bank

• Essentially, local L2 banks incorporate large victim caching space

• A natural hybrid scheme that takes advantages of shared & private $$

Adaptive Selective Replication [Beckmann et al., MICRO ‘06]

• Uncontrolled replication can be detrimental (shared cache degenerates to private cache)

• Based on benefit and cost, control replication level

• Hardware support quite complex

Adaptive Controlled Migration [Hammoud, Cho, and Melhem, HiPEAC ‘09]

• Migrate (rather than replicate) a block to minimize average access latency to this block based on Manhattan distance

• Caveat: book-keeping overheads

University of Pittsburgh

3-way communication

Once you migrate a cache block away from its home, how do you locate it? [Hammoud, Cho, and Melhem, HiPEAC ‘09]

B

home tile of B

new host of B

another requester

1

3

2

Page 6: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

3-way communication

Once you migrate a cache block away from its home, how do you locate it? [Hammoud, Cho, and Melhem, HiPEAC ‘09]

B

home tile of B

new host of B

requester

1

3

2

University of Pittsburgh

Cache block location strategies

Always-go-home• “Go to the home tile (directory) to find the tracking information”

• Lots of 3-way cache-to-cache transfers

Broadcasting• “Don’t remember anything”

• Expensive

Caching of tracking information at potential requesters• “Remember (locally) where previously accessed blocks are”

• Quickly locate the target block locally (after first miss)

• Need maintenance (coherence) of distributed information

• We proposed and studied one such technique [Hammoud, Cho, and Melhem, HiPEAC ‘09]

Page 7: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Caching of tracking information

…B

home tile of BL2$$ Dir

.

TrT

Rcore

B primarytracking info.

b

requester tile for B

L2$$ Dir

.

TrT

Rcore

tag ptr

replicatedtracking info.

Primary tracking information at home tile• Shows where the block has been migrated to

• Keeps track of who has copied the tracking information (bit vector)

Tracking information table (TrT)• A copy of the tracking information (only one pointer)

• Updated as necessary (initiated by the directory of the home tile)

B

University of Pittsburgh

A flexible data mapping approach

Key idea: What if we distribute memory pages to cache banks instead of memory blocks? [Cho and Jin, MICRO ‘06]

Memory blocks Memory pages

Page 8: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Observation

Memory pages Program 1

Program 2

OS PAGE ALLOCATIONOS PAGE ALLOCATION

Software maps data to different $$

Key: OS page allocation policies

Flexible

University of Pittsburgh

Shared $$ issue 2: interference

Co-scheduled programs compete for cache capacity freely (“capitalism”)

Well-behaving programs get damages if another program vie for more cache capacity w/ little reuse (e.g., streaming)

Performance loss due to interference hard to predict

on 8-core Niagara

Page 9: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Explicit/implicit partitioning

Strict partitioning based on system directives• E.g., program A gets 768KB and program B 256KB

• System must allow a user to specify partition sizes [Iyer, ICS ‘04][Guo et al., MICRO ‘07]

Architectural/system mechanisms• Way partitioning (e.g., 3 vs. 2 ways) [Suh et al., ICS ‘01][Kim et al., PACT ‘04]

• Bin/bank partitioning: maps pages to cache bins (no hardware support needed) [Liu et al., HPCA ‘04][Cho, Jin and Lee, RTCSA ‘07][Lin et al., HPCA ‘08]

3 ways to program A 2 ways to program B

University of Pittsburgh

Explicit/implicit partitioning

Strict partitioning based on system directives• E.g., program A gets 768KB and program B 256KB

• System must allow a user to specify partition sizes [Iyer, ICS ‘04][Guo et al., MICRO ‘07]

Architectural/system mechanisms• Way partitioning (e.g., 5 vs. 3 ways) [Suh et al., ICS ‘01][Kim et al., PACT ‘04]

• Bin/bank partitioning: maps pages to cache bins (no hardware support needed) [Liu et al., HPCA ‘04][Cho, Jin and Lee, RTCSA ‘07][Lin et al., HPCA ‘08]

Implicit partitioning based on utility (“utilitarian”)• Way partitioning based on marginal gains [Suh et al., ICS ‘01][Qureshi et al.,

MICRO ‘06]

• Bank level [Lee, Cho and Childers, HPCA ’10]

• Bank + way level [Lee, Cho and Childers, HPCA ’11]

Page 10: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Explicit partitioning

University of Pittsburgh

Implicit partitioning

(cache capacity) (cache capacity)

Page 11: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Private $$ issue: capacity borrowing

The main drawback of private caches is fixed (small) capacity Assume that we have an underlying mechanism for enforcing

correctness of cache access semantics; Key idea: borrow (or steal) capacity from others

• How much capacity do I need and opt to borrow?• Who may be a potential capacity donor?• Do we allow performance degradation of a capacity donor?

Cooperative Caching (CC) [Chang and Sohi, ISCA ‘06]

• Stealing coordinated by a centralized directory (is not scalable)

Dynamic Spill Receive (DSR) [Qureshi, HPCA ‘09]

• Each node is a spiller or a receiver• On eviction from a spiller, randomly choose a receiver to copy the data into;

then, for data access later, use broadcasting

StimulusCache [Lee, Cho, and Childers, HPCA ‘10]

• It’s likely we’ll have “excess cache” in the future• Intelligently share the capacity provided by excess caches based on utility

University of Pittsburgh

Core disabling uneconomical

Many unused (excess) L2 caches exist

Problem exacerbated with many cores

0%

5%

10%

15%

20%

25%

30%

35%

40%

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

probab

ility

# of sound cores/ L2 caches

L2 cache

processing logic

core (L2 + proc. Logic)

32 core

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6 7 8

Probab

ility

# of sound cores/ L2 caches

8 core L2 cache

processing logic

core (L2 + proc. logic)

Page 12: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Flow-in#N: # data blocks flowing to EC#N

Hit#N: # data blocks hit at EC#N

Dynamic sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

MainmemoryFlow-in#1

EC #2

Hits#1

Hits#2

Flow-in#2Flow-in#2

University of Pittsburgh

Hit/Flow-in ↑ More ECsHit/Flow-in ↓ Less ECs

Dynamic sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

MainmemoryFlow-in#1

EC #2

Hits#1

Hits#2

Flow-in#2Flow-in#2

Page 13: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 0: At least 1 EC No harmful effect on EC#2

Allocate 2 ECs

Dynamic sharing policy

EC #1

Mainmemory

EC #2

Hits#1

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 1: At least 2 ECs

Allocate 2 ECs

Dynamic sharing policy

EC #1

Mainmemory

EC #2

Hits#2

Flow-in#2

Page 14: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: At least 1 ECHarmful effect on EC#2

Allocate 1 EC

Dynamic sharing policy

EC #1

Mainmemory

EC #2

Hits#1

Flow-in#2

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: No benefit with ECs

Allocate 0 EC

Dynamic sharing policy

EC #1

Mainmemory

EC #2

Flow-in#2

Page 15: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Maximized capacity utilization

Minimized capacity interference

Dynamic sharing policy

Mainmemory

EC #1 EC #2

2

2

1

0

EC#

University of Pittsburgh

Summary: priv. vs. shared caching

For a small- to medium-scale multicore processor, shared caching appears good (for its simplicity)• A variety of workloads run well

• Care must be taken about interferences

• Because of NUCA effects, this approach is not scalable!

For larger-scale multicore processors, private cache schemes appear more promising• Free performance isolation

• Less NUCA effect

• Easier fault isolation

• Capacity stealing a necessity

Page 16: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Summary: hypothetical caching

(256kB L2 cache slice)

Rel

ativ

e pe

rfor

man

ceto

sha

red

cach

ing

(w/o

pro

filin

g)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

gzip vpr gcc mcf

crafty

parse r eo

ngap

vorte

xbzip

2tw

olf

wupwise swim

mgrid mesa art

equake

ammp

g-mea

n

private w/ profilingshared w/ profilingstatic 2D coloring

Tackle both miss rate and locality through judicious data mapping! [Jin and Cho, ICPP ‘08]

University of Pittsburgh

0

50

100

150

200

250

300

350

400

450

Summary: hypothetical caching

# pa

ges

map

ped

(256kB L2 cache bank)

Program location

GCC

XY

Page 17: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Summary: hypothetical caching alpha 0 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

ammp

art

bzip2

crafty

eon

equake

gap

gcc

gzip

mcf

mesa

mgrid

parser

swim

twolf

vortex

vpr

wupwise

private caching shared caching(256kB L2 cache bank)

University of Pittsburgh

In the remainder of this talk

CloudCache• Private cache based scheme for large-scale multicore processors

• Capacity borrowing at way/bank level & distance awareness

Page 18: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

A 30,000-feet snapshot

1

2

34

5

6

8

9

7

Cache cloud Working core

High hit rate& fast access

Low hit rate& slow access

C

C Working core

Each cloud is an exclusive private cache for a working core

Clouds are built with nearby cache ways organized in a chain

Clouds are built in a two-step process: Dynamic global partitioning and cache cloud formation

University of Pittsburgh

Step 1: Dynamic global partitioning

Global capacity allocator (GCA) runs a partitioning algorithm periodically using “hit counts” information• Each working core sends GCA hit counts from “allocated capacity” and

additional “monitoring capacity” (of 32 ways)

Our partitioning considers both utility [Qureshi & Patt, MICRO ‘06]

and QoS

This step computes how much capacity to give to each core

networknetwork

GCA

Hit countsCache

alloc. info

from allocated capacity

from monitoring capacity

Allocationengine

Allocationengine

Counter buffer

GCA

Page 19: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

0.75

1. Local L2 cache first

2. Threads w/ larger capacity demand first

3. Closer L2 banks first

4. Allocate capacity as much as possible

Allocation is done only for sound L2 banks

2.751.75

Step 2: Cache cloud formation

0

3

1

2

0

1

2.75

1.25

Goal Capacity to allocate

1.250.25

0

0

Repeat!

University of Pittsburgh

Cache cloud chaining example

1 2

3 4 5

6 7 8

4 1 5 7 3 6 8 24 1 5 7 3 6 8 2

8 6 6 4 4 3 2 2

MRU LRU

Cloud capacity

Core ID

Token count

35

Working core 4

2

6

Working core 2

0-hop distance1-hop distance 2-hop distance

Page 20: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Architectural support

L2router

Dir

L1 Proc

Global capacity allocator

L2router

Dir

L1 Proc

Global capacity allocatorGlobal capacity allocator

L2

Router

Dir

L1 Proc

Cloud table

Monitor tags

Hit counters

Home Core ID Token #Next Core ID

Home Core ID Token #Next Core ID

Home Core ID Token #Next Core ID

Capacity allocation

Cache cloud chaining

Cloud table resembles StimulusCache’s NECP

Additional hardware is per-way hit counters and monitor tags

We do not need core ID tracking information per block

University of Pittsburgh

Capacity allocation example

Global capacity allocatorGlobal capacity allocator

Seven working cores10.5

812.625

0.75

9.625

14.5

8

Page 21: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Capacity allocation example

Global capacity allocatorGlobal capacity allocator

10.5

812.625

0.75

9.625

14.5

8

13.125

8.5

7.875

7

1111.5

6

Repartitionevery ‘T’ cycles

University of Pittsburgh

Capacity allocation example

Global capacity allocatorGlobal capacity allocator

13.125

8.57.875

7

11

11.5

6

Page 22: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

CloudCache is fast?

Remote L2 access has 3-way communication

Global capacity allocatorGlobal capacity allocator

(1) Directory lookup

(2) Request forwarding

(3) Data forwarding

Distance-aware cachecloud formation tacklesonly (3)!

University of Pittsburgh

Limited target broadcast (LTB)

Make common case “super fast” and rare case “not so fast”

Global capacity allocatorGlobal capacity allocator

Private data: Limited target broadcast No wait for directory lookup

Shared data:Directory-based coherence

Private data ≫ Shared data

Page 23: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

LTB example

w/o broadcasting1 1

2 3 4 5

6

2 3 4 5

11

1 6w/ broadcasting

Dir. request BroadcastLocal L2 hit

Dir. response Broadcast hit

time

time

University of Pittsburgh

When is LTB used?

Shared data: Access ordering is managed by directory• LTBP is NOT used

Private data: Access ordering is not needed• Fast access with broadcast first, then notify directory

Race condition?• When there is an access request for private data from a non-owner core

before directory is updated

Page 24: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

LTB protocol (LTBP)

RR

WW BB

start

1

2

3

4

Input Action

1 Request from other cores Broadcast lock request

2 Ack from the owner Process non-owner request

3 Nack from the owner

4 Request from the ownerR: ReadyW: WaitB: Busy

BLBL BUBU

BU: Broadcast unlockBL: Broadcast lock

Directory

L2 cache1

2

Input Action

1 Broadcast lock request Ack

2 Invalidation

University of Pittsburgh

Quality of Service (QoS) support

QoS Maximumperformancedegradation samplingfactor, 32inthiswork privatecachecapacity, 8inthiswork currentlyallocatedcapacity

Base cycle: Cycles (time) with private cache (estimated)

.Basecyclecurrentcycle ∑ currentcyclecurrentcycle ∑

Estimated cycle: Cycles with cache capacity ‘j’

Estimatedcycle j Basecycle ∑ .

Min j s. t. Estimatedcycle j / 1 QoS Basecycle

Page 25: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

DSR vs. ECC vs. CloudCache

Dynamic Spill Receive [Quresh, HPCA ’09]

• Node is either spiller or receiver depending on memory usage

• Spiller nodes randomly “spill” evicted data to a receiver node

Elastic Cooperative Caching [Herrero et al., ISCA ‘10]

• Each node has private area and shared area

• Nodes with high memory demand can spill evicted data to shared area in a randomly selected node

University of Pittsburgh

Experimental setup

TPTS [Lee et al., SPE ‘10, Cho et al., ICPP ‘08]

• 64-core CMP with 8×8 2D mesh, 4-cycles/hop

• Core: Intel’s ATOM-like two-issue in-order pipeline

• Directory-based MESI protocol

• Four independent DRAM controllers, four ports/controller

• DRAM with Samsung DDR3-1600 timing

Workloads• SPEC2006 (10B cycles)

High/medium/low based on MPKI for varying cache capacity

• PARSEC (simlarge input set)16 threads/application

Page 26: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Impact of global partitioning

High

Medium

Low

High

MP

KI

University of Pittsburgh

Impact of global partitioning

High

Medium

Low

High

MP

KI

Page 27: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Impact of global partitioning

High

Medium

Low

High

MP

KI

University of Pittsburgh

Impact of global partitioning

High

Medium

Low

High

MP

KI

Page 28: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Impact of global partitioning

High

Medium

Low

High

MP

KI

University of Pittsburgh

L2 cache access latency

Access # 401.bzip2

Private

DSR

Shared

0%

50%

100%

CloudCache w/o LTB0.E+00

1.E+06

2.E+06

0%

50%

100%

0.E+00

1.E+06

2.E+06

0%

50%

100%

ECC0.E+00

1.E+06

2.E+06

0%

50%

100%

0.E+00

1.E+06

2.E+060%

50%

100%

0.E+00

1.E+06

2.E+06

0%

50%

100%

0 100 200 300

CloudCache w/ LTB

access latency

0.E+00

1.E+06

2.E+06

0 100 200 300

Shared: widely spread latency

Private: fast local access + off-chip access

DSR & ECC: fast local access + widely spread latency

CloudCache w/o LTBfast local access + narrowly spread latency

CloudCache w/ LTBfast local access + fast remote access

Page 29: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

16 threads

1

University of Pittsburgh

32 and 64 threads

Page 30: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

PARSEC

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Comb1 Comb2 Comb3 Comb4 Comb5

spee

du

p t

o p

riva

te

Shared DSR ECC CLOUD

University of Pittsburgh

PARSEC

0

0.5

1

1.5

2

swaption blacksch. canneal bodytrack swaption facesim canneal ferret

Comb2 Comb5

spee

du

p t

o p

riva

te

Shared DSR ECC CLOUD

Page 31: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Effect of QoS enforcement

0.9

1

1.1

1.2

1.3

1.4

1.5

1 21 41 61 81 101 121 141 161 181

speedup to private

No QoS 98% QoS 95% QoS

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

1 11 21 31 41 51 61 71 81 91

different programs

University of Pittsburgh

CloudCache summary

Future processors will carry many more cores and cache resources and future workloads will be heterogeneous

Low-level caches must become more scalable and flexible

We proposed CloudCache w/ three techniques:• Global “private” capacity allocation to eliminate interference and to

minimize on-chip misses

• Distance-aware data placement to overcome NUCA latency

• Limited target broadcast to overcome directory lookup latency

Proposed techniques synergistically improve performance

Still, we find that global capacity allocation the most effective

QoS is naturally supported in CloudCache

Page 32: cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

University of Pittsburgh

Our multicore cache publications Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-Level

Page Allocation,” MICRO 2006. Best paper nominee. Cho, Jin and Lee, “Achieving Predictable Performance with On-Chip Shared

L2 Caches for Manycore-Based Real-Time Systems,” RTCSA 2007. Invited paper.

Hammoud, Cho and Melhem, “ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors,” HiPEAC 2008.

Hammoud, Cho and Melhem, “Dynamic Cache Clustering for Chip Multiprocessors,” ICS 2008.

Jin and Cho, “Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches,” ICPP 2008.

Oh, Lee, Lee, and Cho, “An Analytical Model to Study Optimal Area Breakdown between Cores and Caches in a Chip Multiprocessor,” ISVLSI 2009.

Jin and Cho, “SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors,” PACT 2009.

Lee, Cho and Childers, “StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache,” HPCA 2010.

Hammoud, Cho and Melhem, “Cache Equalizer: A Placement Mechanism for Chip Multiprocessor Distributed Shared Caches,” HiPEAC 2011.

Lee, Cho and Childers, “CloudCache: Expanding and Shrinking Private Caches,” HPCA 2011.


Recommended