ECE8833 Polymorphous and Many-Core Computer Architecture
Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering
Lecture 6 Fair Caching Mechanisms for CMP
2ECE8833 H.-H. S. Lee 2009
Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04]
L2 $
L1 $
……
Processor Core 1 Processor Core 2
L1 $
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
[Kim, Chandra, Solihin PACT2004]
3ECE8833 H.-H. S. Lee 2009
Cache Sharing in CMP
L2 $
L1 $
……
Processor Core 1
L1 $
Processor Core 2←t1
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
4ECE8833 H.-H. S. Lee 2009
Cache Sharing in CMP
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
L1 $
Processor Core 1
L1 $
Processor Core 2
L2 $
……
t2→
5ECE8833 H.-H. S. Lee 2009
Cache Sharing in CMP
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
L1 $
L2 $
……
Processor Core 1 Processor Core 2←t1
L1 $
t2→
t2’s throughput is significantly reduced due to unfair cache sharing.
6ECE8833 H.-H. S. Lee 2009
Shared L2 Cache Space Contention
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
0
2
4
6
8
10
gzip(alone) gzip+applu gzip+apsi gzip+art gzip+swim
gzip's Normalized
Cache MissesPer
Instruction
0
0.2
0.4
0.6
0.8
1
1.2
gzip(alone) gzip+applu gzip+apsi gzip+art gzip+swim
gzip'sNormalized
IPC
7ECE8833 H.-H. S. Lee 2009
Impact of Unfair Cache Sharing
7
• Uniprocessor scheduling
• 2-core CMP scheduling
• gzip will get more time slices than others if gzip is set to run at higher priority (and it could run slower than others priority inversion)
• It could further slows down the other processes (starvation)• Thus the overall throughput is reduced (uniform slowdown)
t1t4
t1t3t2
t1t2
t1t3
t1t2
t1t3
t4t1P1:
P2:
time slice
time slice
8ECE8833 H.-H. S. Lee 2009
Stack Distance Profiling Algorithm
CTR Pos0
CTR Pos1
CTR Pos2
CTR Pos3
MRU LRU
HITCounters
Cache Tag
HIT Counters Value
CTR Pos 0CTR Pos 1CTR Pos 2CTR Pos 3
30201510
Misses = 25
[Qureshi+, MICRO-39]
9ECE8833 H.-H. S. Lee 2009
Stack Distance Profiling
• A counter for each cache way, C>A is the counter for misses• Show the reuse frequency for each way in a cache• Can be used to predict the misses for associativity smaller than “A”
– Misses for 2-way cache for gzip = C>A + Σ Ci where i = 3 to 8• art does not need all the space for likely poor temporal locality• If the given space is halved for art and given to gzip, what
happens?
10ECE8833 H.-H. S. Lee 2009
Fairness Metrics [Kim et al. PACT’04]
• Uniform slowdown
j
j
i
i
aloneT
sharedT
aloneT
sharedT
_
_
_
_
Execution time of ti when it runs
alone.
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
11ECE8833 H.-H. S. Lee 2009
Fairness Metrics [Kim et al. PACT’04]
• Uniform slowdown
Execution time of ti when it shares
cache with others.
j
j
i
i
aloneT
sharedT
aloneT
sharedT
_
_
_
_
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
12ECE8833 H.-H. S. Lee 2009
Fairness Metrics [Kim et al. PACT’04]
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
• Uniform slowdown
• We want to minimize:– Ideally:
i
iiji
ij
aloneT
sharedTXwhereXXM
_
_,0
j
j
i
i
aloneT
sharedT
aloneT
sharedT
_
_
_
_
Try to equalize the ratio of miss increase of each thread
13ECE8833 H.-H. S. Lee 2009
Fairness Metrics [Kim et al. PACT’04]
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
• Uniform slowdown
• We want to minimize:– Ideally:
i
iiji
ij
aloneT
sharedTXwhereXXM
_
_,0
j
j
i
i
aloneT
sharedT
aloneT
sharedT
_
_
_
_
i
iiji
ij
aloneMiss
sharedMissXwhereXXM
_
_,1
i
iiji
ij
aloneMissRate
sharedMissRateXwhereXXM
_
_,3
iiijiij aloneMissRatesharedMissRateXwhereXXM __,5
14ECE8833 H.-H. S. Lee 2009
Partitionable Cache Hardware
LRULRU
LRULRU
P1: 448B
P2 Miss
P2: 576B
Current Partition
P1: 384BP2: 640B
Target Partition
• Modified LRU cache replacement policy– G. E. Suh, et. al., HPCA 2002
Per-thread Counter
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
15ECE8833 H.-H. S. Lee 2009
Partitionable Cache Hardware
LRULRU
LRU* LRU
P1: 448B
P2 Miss
P2: 576B
Current Partition
P1: 384BP2: 640B
Target Partition
• Modified LRU cache replacement policy– G. Suh, et. al., HPCA 2002
LRULRU
LRU* LRU
P1: 384BP2: 640B
Current Partition
P1: 384BP2: 640B
Target Partition
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
Partition granularity could be as coarse as one entire cache
way
16ECE8833 H.-H. S. Lee 2009
Dynamic Fair Caching Algorithm
P1:
P2:
Ex) OptimizingM3 metric
P1:
P2:
Target Partition
MissRate alone
P1:
P2:
MissRate shared
Repartitioninginterval
Counters to keep miss rates running the process alone
(from stack distance profiling)
Counters to keep dynamic miss rates
(running with a shared cache)
Counters to keep target
partition size
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
10K accesses found to be the
best
17ECE8833 H.-H. S. Lee 2009
Dynamic Fair Caching Algorithm
1st Interval P1:20%
P2: 5%
MissRate alone
Repartitioninginterval
P1:
P2:
MissRate shared
P1:20%
P2:15%
MissRate shared
P1:256KB
P2:256KB
Target Partition
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
18ECE8833 H.-H. S. Lee 2009
Dynamic Fair Caching Algorithm
Repartition!
Evaluate M3P1: 20% / 20%P2: 15% / 5%
P1:20%
P2: 5%
MissRate alone
Repartitioninginterval
P1:20%
P2:15%
MissRate shared
P1:256KB
P2:256KB
Target Partition
P1:192KB
P2:320KB
Target Partition
Partition granularity: 64KB
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
19ECE8833 H.-H. S. Lee 2009
Dynamic Fair Caching Algorithm
2nd Interval P1:20%
P2: 5%
MissRate alone
Repartitioninginterval
P1:20%
P2:15%
MissRate shared
P1:20%
P2:15%
MissRate shared
P1:192KB
P2:320KB
Target Partition
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
20ECE8833 H.-H. S. Lee 2009
Dynamic Fair Caching Algorithm
Repartition!
Evaluate M3P1: 20% / 20%P2: 10% / 5%
P1:20%
P2: 5%
MissRate alone
Repartitioninginterval
P1:20%
P2:15%
MissRate shared
P1:20%
P2:10%
MissRate shared
P1:192KB
P2:320KB
Target Partition
P1:128KB
P2:384KB
Target Partition
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
21ECE8833 H.-H. S. Lee 2009
Dynamic Fair Caching Algorithm
3rd Interval P1:20%
P2: 5%
MissRate alone
Repartitioninginterval
P1:20%
P2:10%
MissRate shared
P1:128KB
P2:384KB
Target Partition
P1:20%
P2:10%
MissRate shared
P1:25%
P2: 9%
MissRate shared
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
22ECE8833 H.-H. S. Lee 2009
Dynamic Fair Caching Algorithm
Repartition!Do Rollback if:P2: Δ<Trollback
Δ=MRold-MRnew
P1:20%
P2: 5%
MissRate alone
Repartitioninginterval
P1:20%
P2:10%
MissRate shared
P1:25%
P2: 9%
MissRate shared
P1:128KB
P2:384KB
Target Partition
P1:192KB
P2:320KB
Target Partition
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
The best Trollback
threshold found to be 20%
23ECE8833 H.-H. S. Lee 2009
Generic Repartitioning Algorithm
Pick the largest and smallest as a pair for
repartitioning
Repeat for all candidate processes
Utility-Based Cache Partitioning (UCP)
25ECE8833 H.-H. S. Lee 2009
Running Processes on Dual-Core [Qureshi & Patt, MICRO-39]
• LRU: in real runs on avg., 7 ways were allocated to equake and 9 to vpr• UTIL
– How much you use (in a set) is how much you will get – Ideally, 3 ways to equake and 13 to vpr
# of ways given (1 to 16) # of ways given (1 to 16)
26ECE8833 H.-H. S. Lee 2009
Defining Utility
Utility Uab = Misses with a ways – Misses with b ways
Low Utility
High Utility
Saturating Utility
Num ways from 16-way 1MB L2
Mis
ses
per
10
00
in
stru
ctio
ns
Slide courtesy: Moin Qureshi, MICRO-39
27ECE8833 H.-H. S. Lee 2009
Framework for UCP
Slide courtesy: Moin Qureshi, MICRO-39
Three components:
Utility Monitors (UMON) per core
Partitioning Algorithm (PA)
Replacement support to enforce partitions
I$
D$Core1
I$
D$Core2
SharedL2 cache
Main Memory
UMON1 UMON2PA
28ECE8833 H.-H. S. Lee 2009
Utility Monitors (UMON) For each core, simulate LRU policy using Auxiliary Tag Dir (ATD)
UMON-global (one way-counter for all sets) Hit counters in ATD to count hits per recency position LRU is a stack algorithm: hit counts utility E.g., hits(2 ways) = H0+H1
Set ASet BSet CSet DSet ESet FSet GSet H
+ + + ++...(MRU) (LRU)H0 H1 H2 H3 H15
ATD
29ECE8833 H.-H. S. Lee 2009
Utility Monitors (UMON) Extra tags incur hardware and power overhead
DSS reduces overhead [Qureshi et al. ISCA’06]
Set ASet BSet CSet DSet ESet FSet GSet H
+ + + ++...(MRU) (LRU)H0 H1 H2 H3 H15
ATD
Set ASet BSet CSet DSet ESet FSet GSet H
30ECE8833 H.-H. S. Lee 2009
Utility Monitors (UMON) Extra tags incur hardware and power overhead
DSS reduces overhead [Qureshi et al. ISCA’06]
32 sets sufficient based on Chebyshev’s inequality
Sample every 32 sets (simple static) used in the paper
Storage < 2KB/UMON (or 0.17% L2)
Set ASet BSet CSet DSet ESet FSet GSet H
+ + + ++...(MRU) (LRU)H0 H1 H2 H3 H15
ATD
UMON (DSS)
Set BSet ESet F
31ECE8833 H.-H. S. Lee 2009
Partitioning Algorithm (PA) Evaluate all possible partitions and select the best
With a ways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from
UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 Select a that maximizes (Hitscore1 + Hitscore2)
Partitioning done once every 5 million cycles
After each partitioning interval Hit counters in all UMONs are halved To retain some past information
32ECE8833 H.-H. S. Lee 2009
Replacement Policy to Reach Desired PartitionUse way partitioning [Suh+ HPCA’02, Iyer ICS’04]
• Each Line contains core-id bits• On a miss, count ways_occupied in the set by
miss-causing app• Binary decision for dual-core (in this paper)
ways_occupied < ways_given
Yes No
Victim is the LRU line from other app
Victim is the LRU line from miss-causing app
33ECE8833 H.-H. S. Lee 2009
UCP Performance (Weighted Speedup)
UCP improves average weighted speedup by 11% (Dual Core)
34ECE8833 H.-H. S. Lee 2009
UPC Performance (Throughput)
UCP improves average throughput by 17%
Dynamic Insertion Policy
36ECE8833 H.-H. S. Lee 2009
Conventional LRU
MRUMRU LRULRU
Slide Source: Yuejian XieSlide Source: Yuejian Xie
37ECE8833 H.-H. S. Lee 2009
Conventional LRU
MRUMRU LRULRU
Occupies one cache blockfor a long time with no benefit!
Occupies one cache blockfor a long time with no benefit!
Slide Source: Yuejian Xie
38ECE8833 H.-H. S. Lee 2009
LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]
MRUMRU LRULRU
Incoming Block
Incoming Block
38
Slide Source: Yuejian Xie
39ECE8833 H.-H. S. Lee 2009
LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]
MRUMRU LRULRU
Useless Block Evicted at next eviction
Useful Block Moved to MRU position
Adapted Slide from Yuejian Xie
40ECE8833 H.-H. S. Lee 2009
LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]
MRUMRU LRULRU
Useless Block Evicted at next eviction
Useful Block Moved to MRU position
Slide Source: Yuejian Xie
LIP is not entirely new, Intel has tried this in 1998 when designing “Timna” (integrating CPU and Gfx accelerator that share L2)
41ECE8833 H.-H. S. Lee 2009
BIP: Bimodal Insertion Policy [Qureshi et al. ISCA’07]
if ( rand() < e ) Insert at MRU position; // LRU replacement policy
elseInsert at LRU position;
Promote to MRU if reused
LIP may not age older lines
Infrequently insert lines in MRU position
Let e = Bimodal throttle parameter
42ECE8833 H.-H. S. Lee 2009
DIP: Dynamic Insertion Policy [Qureshi et al. ISCA’07]
Two types of workloads: LRU-friendly or BIP-friendly
DIP can be implemented by:
1. Monitor both policies (LRU and BIP)
2. Choose the best-performing policy
3. Apply the best policy to the cache
Need a cost-effective implementation “Set Dueling”
DIP
BIP LRU
LIP LRU
ε1-ε
43ECE8833 H.-H. S. Lee 2009
Set Dueling for DIP [Qureshi et al. ISCA’07]
LRU-sets
Follower Sets
BIP-sets
Divide the cache in three:• Dedicated LRU sets• Dedicated BIP sets • Follower sets (winner of LRU,BIP)
n-bit saturating counter misses to LRU sets: counter++misses to BIP sets : counter--
Counter decides policy for follower sets:• MSB = 0, Use LRU• MSB = 1, Use BIP
n-bit cntr+
miss
–miss
MSB = 0?
YES No
Use LRU Use BIP
monitor choose apply
(using a single counter)
Slide Source: Moin Qureshi
Promotion/Insertion Pseudo Partitioning
45ECE8833 H.-H. S. Lee 2009
PIPP [Xie & Loh ISCA’09] • What’s PIPP?
– Promotion/Insertion Pseudo Partitioning– Achieving both capacity (UCP) and dead-time management
(DIP).
• Eviction– LRU block as the victim
• Insertion– The core’s quota worth of blocks away from LRU
• Promotion– To MRU by only one.
MRUMRU LRULRU
To Evict
Promote
HitHit
Insert Position = 3 (Target Allocation)
NewNew
45
Slide Source: Yuejian Xie
46ECE8833 H.-H. S. Lee 2009
PIPP ExampleCore0 quota: 5
blocksCore1 quota: 3
blocks
Core0 quota: 5 blocks
Core1 quota: 3 blocks
11 AA 22 33 44 55BB CC
Core0’s Block
Core0’s Block
Core1’s Block
Core1’s Block
Request
MRU
MRU LRULRU
Core1’s quota=3
DD
Slide Source: Yuejian Xie
47ECE8833 H.-H. S. Lee 2009
PIPP ExampleCore0 quota: 5
blocksCore1 quota: 3
blocks
Core0 quota: 5 blocks
Core1 quota: 3 blocks
11 AA 22 5533 44 DD BB
Core0’s Block
Core0’s Block
Core1’s Block
Core1’s Block
Request
MRU
MRU LRULRU
66
Core0’s quota=5
Slide Source: Yuejian Xie
48ECE8833 H.-H. S. Lee 2009
PIPP ExampleCore0 quota: 5
blocksCore1 quota: 3
blocks
Core0 quota: 5 blocks
Core1 quota: 3 blocks
11 AA 22 66 33 44 DD BB
Core0’s Block
Core0’s Block
Core1’s Block
Core1’s Block
Request
MRU
MRU LRULRU
Core0’s quota=5
77
Slide Source: Yuejian Xie
49ECE8833 H.-H. S. Lee 2009
PIPP ExampleCore0 quota: 5
blocksCore1 quota: 3
blocks
Core0 quota: 5 blocks
Core1 quota: 3 blocks
11 AA 22 66 33 44 DD
Core0’s Block
Core0’s Block
Core1’s Block
Core1’s Block
Request
MRU
MRU LRULRU
DD
77
Slide Source: Yuejian Xie
50ECE8833 H.-H. S. Lee 2009
Core0 Core1 Core2 Core3
Quota 6 4 4 2
MRU
MRU LRULRU
Insert closer to LRU
position
Insert closer to LRU
position
50
How PIPP Does Both Management
Slide Source: Yuejian Xie
51ECE8833 H.-H. S. Lee 2009 51
MRU0
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0’s Block
Core0’s Block
Core1’s Block
Core1’s Block
Request
Strict Partition
Strict Partition
MRU1 LRU1LRU0
NewNew
Pseudo Partitioning Benefits
Slide Source: Yuejian Xie
52ECE8833 H.-H. S. Lee 2009 52
MRU
LRU
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0’s Block
Core0’s Block
Core1’s Block
Core1’s Block
Request
NewNew
Pseudo PartitionPseudo Partition
Pseudo Partitioning Benefits
Slide Source: Yuejian Xie
Core1 “stole” a line from Core0
53ECE8833 H.-H. S. Lee 2009
Pseudo Partitioning Benefits
54ECE8833 H.-H. S. Lee 2009 54
NewNew
MRU
MRU LRULRU
MRU
MRU LRULRU
NewNew
Single Reuse Block
Slide Source: Yuejian Xie
55ECE8833 H.-H. S. Lee 2009 55
AlgorithmCapacity
Management
Dead-time Managemen
tNote
LRUBaseline, no explicit
management
UCP Strict partitioning
DIP / TADIPInsert at LRU and
promote to MRU on hit
PIPP Pseudo-partitioning and incremental promotion
Algorithm Comparison
Slide Source: Yuejian Xie