ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture

Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering

Lecture 6 Fair Caching Mechanisms for CMP

2ECE8833 H.-H. S. Lee 2009

Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04]

L2 $

L1 $

……

Processor Core 1 Processor Core 2

L1 $

Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

[Kim, Chandra, Solihin PACT2004]

3ECE8833 H.-H. S. Lee 2009

Cache Sharing in CMP

L2 $

L1 $

……

Processor Core 1

L1 $

Processor Core 2←t1


4ECE8833 H.-H. S. Lee 2009



L1 $

Processor Core 1

L1 $

Processor Core 2

L2 $

……

t2→

5ECE8833 H.-H. S. Lee 2009



L1 $

L2 $

……

Processor Core 1 Processor Core 2←t1

L1 $

t2→

t2’s throughput is significantly reduced due to unfair cache sharing.

6ECE8833 H.-H. S. Lee 2009

Shared L2 Cache Space Contention


0

2

4

6

8

10

gzip(alone) gzip+applu gzip+apsi gzip+art gzip+swim

gzip's Normalized

Cache MissesPer

Instruction

0

0.2

0.4

0.6

0.8

1

1.2

gzip(alone) gzip+applu gzip+apsi gzip+art gzip+swim

gzip'sNormalized

IPC

7ECE8833 H.-H. S. Lee 2009

Impact of Unfair Cache Sharing

7

• Uniprocessor scheduling

• 2-core CMP scheduling

• gzip will get more time slices than others if gzip is set to run at higher priority (and it could run slower than others priority inversion)

• It could further slows down the other processes (starvation)• Thus the overall throughput is reduced (uniform slowdown)

t1t4

t1t3t2

t1t2

t1t3

t1t2

t1t3

t4t1P1:

P2:

time slice

time slice

8ECE8833 H.-H. S. Lee 2009

Stack Distance Profiling Algorithm

CTR Pos0

CTR Pos1

CTR Pos2

CTR Pos3

MRU LRU

HITCounters

Cache Tag

HIT Counters Value

CTR Pos 0CTR Pos 1CTR Pos 2CTR Pos 3

30201510

Misses = 25

[Qureshi+, MICRO-39]

9ECE8833 H.-H. S. Lee 2009

Stack Distance Profiling

• A counter for each cache way, C>A is the counter for misses• Show the reuse frequency for each way in a cache• Can be used to predict the misses for associativity smaller than “A”

– Misses for 2-way cache for gzip = C>A + Σ Ci where i = 3 to 8• art does not need all the space for likely poor temporal locality• If the given space is halved for art and given to gzip, what

happens?

10ECE8833 H.-H. S. Lee 2009

Fairness Metrics [Kim et al. PACT’04]

• Uniform slowdown

j

j

i

i

aloneT

sharedT

aloneT

sharedT

_

_

_

_

Execution time of ti when it runs

alone.


11ECE8833 H.-H. S. Lee 2009



Execution time of ti when it shares

cache with others.

j

j

i

i

aloneT

sharedT

aloneT

sharedT

_

_

_

_


12ECE8833 H.-H. S. Lee 2009




• We want to minimize:– Ideally:

i

iiji

ij

aloneT

sharedTXwhereXXM

_

_,0

j

j

i

i

aloneT

sharedT

aloneT

sharedT

_

_

_

_

Try to equalize the ratio of miss increase of each thread

13ECE8833 H.-H. S. Lee 2009




• We want to minimize:– Ideally:

i

iiji

ij

aloneT

sharedTXwhereXXM

_

_,0

j

j

i

i

aloneT

sharedT

aloneT

sharedT

_

_

_

_

i

iiji

ij

aloneMiss

sharedMissXwhereXXM

_

_,1

i

iiji

ij

aloneMissRate

sharedMissRateXwhereXXM

_

_,3

iiijiij aloneMissRatesharedMissRateXwhereXXM __,5

14ECE8833 H.-H. S. Lee 2009

Partitionable Cache Hardware

LRULRU

LRULRU

P1: 448B

P2 Miss

P2: 576B

Current Partition

P1: 384BP2: 640B

Target Partition

• Modified LRU cache replacement policy– G. E. Suh, et. al., HPCA 2002

Per-thread Counter


15ECE8833 H.-H. S. Lee 2009

Partitionable Cache Hardware

LRULRU

LRU* LRU

P1: 448B

P2 Miss

P2: 576B

Current Partition

P1: 384BP2: 640B

Target Partition

• Modified LRU cache replacement policy– G. Suh, et. al., HPCA 2002

LRULRU

LRU* LRU

P1: 384BP2: 640B

Current Partition

P1: 384BP2: 640B

Target Partition


Partition granularity could be as coarse as one entire cache

way

16ECE8833 H.-H. S. Lee 2009

Dynamic Fair Caching Algorithm

P1:

P2:

Ex) OptimizingM3 metric

P1:

P2:

Target Partition

MissRate alone

P1:

P2:

MissRate shared

Repartitioninginterval

Counters to keep miss rates running the process alone

(from stack distance profiling)

Counters to keep dynamic miss rates

(running with a shared cache)

Counters to keep target

partition size


10K accesses found to be the

best

17ECE8833 H.-H. S. Lee 2009


1st Interval P1:20%

P2: 5%

MissRate alone


P1:

P2:

MissRate shared

P1:20%

P2:15%

MissRate shared

P1:256KB

P2:256KB

Target Partition


18ECE8833 H.-H. S. Lee 2009


Repartition!

Evaluate M3P1: 20% / 20%P2: 15% / 5%

P1:20%

P2: 5%

MissRate alone


P1:20%

P2:15%

MissRate shared

P1:256KB

P2:256KB

Target Partition

P1:192KB

P2:320KB

Target Partition

Partition granularity: 64KB


19ECE8833 H.-H. S. Lee 2009


2nd Interval P1:20%

P2: 5%

MissRate alone


P1:20%

P2:15%

MissRate shared

P1:20%

P2:15%

MissRate shared

P1:192KB

P2:320KB

Target Partition


20ECE8833 H.-H. S. Lee 2009


Repartition!

Evaluate M3P1: 20% / 20%P2: 10% / 5%

P1:20%

P2: 5%

MissRate alone


P1:20%

P2:15%

MissRate shared

P1:20%

P2:10%

MissRate shared

P1:192KB

P2:320KB

Target Partition

P1:128KB

P2:384KB

Target Partition


21ECE8833 H.-H. S. Lee 2009


3rd Interval P1:20%

P2: 5%

MissRate alone


P1:20%

P2:10%

MissRate shared

P1:128KB

P2:384KB

Target Partition

P1:20%

P2:10%

MissRate shared

P1:25%

P2: 9%

MissRate shared


22ECE8833 H.-H. S. Lee 2009


Repartition!Do Rollback if:P2: Δ<Trollback

Δ=MRold-MRnew

P1:20%

P2: 5%

MissRate alone


P1:20%

P2:10%

MissRate shared

P1:25%

P2: 9%

MissRate shared

P1:128KB

P2:384KB

Target Partition

P1:192KB

P2:320KB

Target Partition


The best Trollback

threshold found to be 20%

23ECE8833 H.-H. S. Lee 2009

Generic Repartitioning Algorithm

Pick the largest and smallest as a pair for

repartitioning

Repeat for all candidate processes

Utility-Based Cache Partitioning (UCP)

25ECE8833 H.-H. S. Lee 2009

Running Processes on Dual-Core [Qureshi & Patt, MICRO-39]

• LRU: in real runs on avg., 7 ways were allocated to equake and 9 to vpr• UTIL

– How much you use (in a set) is how much you will get – Ideally, 3 ways to equake and 13 to vpr

# of ways given (1 to 16) # of ways given (1 to 16)

26ECE8833 H.-H. S. Lee 2009

Defining Utility

Utility Uab = Misses with a ways – Misses with b ways

Low Utility

High Utility

Saturating Utility

Num ways from 16-way 1MB L2

Mis

ses

per

10

00

in

stru

ctio

ns

Slide courtesy: Moin Qureshi, MICRO-39

27ECE8833 H.-H. S. Lee 2009

Framework for UCP

Slide courtesy: Moin Qureshi, MICRO-39

Three components:

Utility Monitors (UMON) per core

Partitioning Algorithm (PA)

Replacement support to enforce partitions

I$

D$Core1

I$

D$Core2

SharedL2 cache

Main Memory

UMON1 UMON2PA

28ECE8833 H.-H. S. Lee 2009

Utility Monitors (UMON) For each core, simulate LRU policy using Auxiliary Tag Dir (ATD)

UMON-global (one way-counter for all sets) Hit counters in ATD to count hits per recency position LRU is a stack algorithm: hit counts utility E.g., hits(2 ways) = H0+H1

Set ASet BSet CSet DSet ESet FSet GSet H

+ + + ++...(MRU) (LRU)H0 H1 H2 H3 H15

ATD

29ECE8833 H.-H. S. Lee 2009

Utility Monitors (UMON) Extra tags incur hardware and power overhead

DSS reduces overhead [Qureshi et al. ISCA’06]


+ + + ++...(MRU) (LRU)H0 H1 H2 H3 H15

ATD


30ECE8833 H.-H. S. Lee 2009

Utility Monitors (UMON) Extra tags incur hardware and power overhead

DSS reduces overhead [Qureshi et al. ISCA’06]

32 sets sufficient based on Chebyshev’s inequality

Sample every 32 sets (simple static) used in the paper

Storage < 2KB/UMON (or 0.17% L2)


+ + + ++...(MRU) (LRU)H0 H1 H2 H3 H15

ATD

UMON (DSS)

Set BSet ESet F

31ECE8833 H.-H. S. Lee 2009

Partitioning Algorithm (PA) Evaluate all possible partitions and select the best

With a ways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from

UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 Select a that maximizes (Hitscore1 + Hitscore2)

Partitioning done once every 5 million cycles

After each partitioning interval Hit counters in all UMONs are halved To retain some past information

32ECE8833 H.-H. S. Lee 2009

Replacement Policy to Reach Desired PartitionUse way partitioning [Suh+ HPCA’02, Iyer ICS’04]

• Each Line contains core-id bits• On a miss, count ways_occupied in the set by

miss-causing app• Binary decision for dual-core (in this paper)

ways_occupied < ways_given

Yes No

Victim is the LRU line from other app

Victim is the LRU line from miss-causing app

33ECE8833 H.-H. S. Lee 2009

UCP Performance (Weighted Speedup)

UCP improves average weighted speedup by 11% (Dual Core)

34ECE8833 H.-H. S. Lee 2009

UPC Performance (Throughput)

UCP improves average throughput by 17%

Dynamic Insertion Policy

36ECE8833 H.-H. S. Lee 2009

Conventional LRU

MRUMRU LRULRU

Slide Source: Yuejian XieSlide Source: Yuejian Xie

37ECE8833 H.-H. S. Lee 2009

Conventional LRU

MRUMRU LRULRU

Occupies one cache blockfor a long time with no benefit!

Occupies one cache blockfor a long time with no benefit!

Slide Source: Yuejian Xie

38ECE8833 H.-H. S. Lee 2009

LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]

MRUMRU LRULRU

Incoming Block

Incoming Block

38


39ECE8833 H.-H. S. Lee 2009


MRUMRU LRULRU

Useless Block Evicted at next eviction

Useful Block Moved to MRU position

Adapted Slide from Yuejian Xie

40ECE8833 H.-H. S. Lee 2009


MRUMRU LRULRU

Useless Block Evicted at next eviction

Useful Block Moved to MRU position


LIP is not entirely new, Intel has tried this in 1998 when designing “Timna” (integrating CPU and Gfx accelerator that share L2)

41ECE8833 H.-H. S. Lee 2009

BIP: Bimodal Insertion Policy [Qureshi et al. ISCA’07]

if ( rand() < e ) Insert at MRU position; // LRU replacement policy

elseInsert at LRU position;

Promote to MRU if reused

LIP may not age older lines

Infrequently insert lines in MRU position

Let e = Bimodal throttle parameter

42ECE8833 H.-H. S. Lee 2009

DIP: Dynamic Insertion Policy [Qureshi et al. ISCA’07]

Two types of workloads: LRU-friendly or BIP-friendly

DIP can be implemented by:

1. Monitor both policies (LRU and BIP)

2. Choose the best-performing policy

3. Apply the best policy to the cache

Need a cost-effective implementation “Set Dueling”

DIP

BIP LRU

LIP LRU

ε1-ε

43ECE8833 H.-H. S. Lee 2009

Set Dueling for DIP [Qureshi et al. ISCA’07]

LRU-sets

Follower Sets

BIP-sets

Divide the cache in three:• Dedicated LRU sets• Dedicated BIP sets • Follower sets (winner of LRU,BIP)

n-bit saturating counter misses to LRU sets: counter++misses to BIP sets : counter--

Counter decides policy for follower sets:• MSB = 0, Use LRU• MSB = 1, Use BIP

n-bit cntr+

miss

–miss

MSB = 0?

YES No

Use LRU Use BIP

monitor choose apply

(using a single counter)

Slide Source: Moin Qureshi

Promotion/Insertion Pseudo Partitioning

45ECE8833 H.-H. S. Lee 2009

PIPP [Xie & Loh ISCA’09] • What’s PIPP?

– Promotion/Insertion Pseudo Partitioning– Achieving both capacity (UCP) and dead-time management

(DIP).

• Eviction– LRU block as the victim

• Insertion– The core’s quota worth of blocks away from LRU

• Promotion– To MRU by only one.

MRUMRU LRULRU

To Evict

Promote

HitHit

Insert Position = 3 (Target Allocation)

NewNew

45


46ECE8833 H.-H. S. Lee 2009

PIPP ExampleCore0 quota: 5

blocksCore1 quota: 3

blocks

Core0 quota: 5 blocks


11 AA 22 33 44 55BB CC

Core0’s Block

Core0’s Block

Core1’s Block

Core1’s Block

Request

MRU

MRU LRULRU

Core1’s quota=3

DD


47ECE8833 H.-H. S. Lee 2009



blocks



11 AA 22 5533 44 DD BB

Core0’s Block

Core0’s Block

Core1’s Block

Core1’s Block

Request

MRU

MRU LRULRU

66

Core0’s quota=5


48ECE8833 H.-H. S. Lee 2009



blocks



11 AA 22 66 33 44 DD BB

Core0’s Block

Core0’s Block

Core1’s Block

Core1’s Block

Request

MRU

MRU LRULRU

Core0’s quota=5

77


49ECE8833 H.-H. S. Lee 2009



blocks



11 AA 22 66 33 44 DD

Core0’s Block

Core0’s Block

Core1’s Block

Core1’s Block

Request

MRU

MRU LRULRU

DD

77


50ECE8833 H.-H. S. Lee 2009

Core0 Core1 Core2 Core3

Quota 6 4 4 2

MRU

MRU LRULRU

Insert closer to LRU

position

Insert closer to LRU

position

50

How PIPP Does Both Management


51ECE8833 H.-H. S. Lee 2009 51

MRU0





Core0’s Block

Core0’s Block

Core1’s Block

Core1’s Block

Request

Strict Partition

Strict Partition

MRU1 LRU1LRU0

NewNew

Pseudo Partitioning Benefits


52ECE8833 H.-H. S. Lee 2009 52

MRU

LRU





Core0’s Block

Core0’s Block

Core1’s Block

Core1’s Block

Request

NewNew

Pseudo PartitionPseudo Partition



Core1 “stole” a line from Core0

53ECE8833 H.-H. S. Lee 2009


54ECE8833 H.-H. S. Lee 2009 54

NewNew

MRU

MRU LRULRU

MRU

MRU LRULRU

NewNew

Single Reuse Block


55ECE8833 H.-H. S. Lee 2009 55

AlgorithmCapacity

Management

Dead-time Managemen

tNote

LRUBaseline, no explicit

management

UCP Strict partitioning

DIP / TADIPInsert at LRU and

promote to MRU on hit

PIPP Pseudo-partitioning and incremental promotion

Algorithm Comparison


Date post:	13-Jan-2016
Category:	Documents
Upload:	early
View:	25 times
Download:	0 times

ECE8833 Polymorphous and Many-Core Computer Architecture

Documents