Imbalanced Cache Partitioning for Balanced Data-Parallel Programs
Abhisek Pan & Vijay S. Pai Electrical and Computer Engineering Purdue University MICRO-46, 2013
Motivation • Last level cache partitioning heavily studied
for multiprogramming workloads • Multithreading = multiprogramming ▫ All threads have to progress equally ▫ Pure throughput maximization is not enough
• Data-parallel threads are similar to each other in their data access patterns
• However equal allocation => suboptimal cache utilization
2
MICRO-46, 2013
Motivation • Last level cache partitioning heavily studied
for multiprogramming workloads • Multithreading = multiprogramming ▫ All threads have to progress equally ▫ Pure throughput maximization is not enough
• Data-parallel threads are similar to each other in their data access patterns
• However equal allocation => suboptimal cache utilization
3
MICRO-46, 2013
Balanced threads need highly imbalanced partitions
Contributions • Shared LLC partitioning for balanced data-
parallel applications
• Increasing allocation for one thread at a time improves utilization
• Prioritizing each thread in turn ensures balanced progress
• 17% drop in miss rate, 8% drop in execution time on average for 4-core 8MB cache
• Negligible overheads
4
MICRO-46, 2013
Outline • Motivation • Contributions • Background • Memory Reuse Behavior of Threads • Proposed Scheme • Evaluation • Overheads & Limitations • Conclusion
5
MICRO-46, 2013
Way-partitioning • N-way set-associative cache = > each set
has N ways or blocks • Unpartitioned cache ▫ Least recently used entry among all ways replaced on a
miss ▫ Thread-agnostic LRU
• Way-partitioning ▫ Each way is owned by one core at a time ▫ On a miss, a core replaces the LRU entry among the
ways owned by it ▫ No restriction on access, only on replacement
6
MICRO-46, 2013
• Miss-rate vs. ways in a single set • Each thread considered in isolation
7
MICRO-46, 2013
Per-thread Miss Rate Curves
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
Reuse Distance
0.0
0.2
0.4
0.6
0.8
1.0
Mis
sR
ate
Thread 0Thread 1Thread 2Thread 3
Mis
s Rat
e
Ways
working set 1
working set 2
• Miss-rate vs. ways in a single set • Each thread considered in isolation
8
MICRO-46, 2013
Per-thread Miss Rate Curves
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
Reuse Distance
0.0
0.2
0.4
0.6
0.8
1.0
Mis
sR
ate
Thread 0Thread 1Thread 2Thread 3
Mis
s Rat
e
Ways
working set 1
working set 2
Inefficient Allocation!
Symmetric Memory Access
9
MICRO-46, 2013
• Miss-curves symmetric across threads
• Seen for all benchmarks & cache sizes
Art, 212 sets Blackscholes, 29 sets
Fluidanimate, 214 sets
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
22.0
24.0
26.0
Reuse Distance
0.0
0.2
0.4
0.6
0.8
1.0M
iss
Rat
e
Thread 0Thread 1Thread 2Thread 3
Ways
Mis
s Rat
e
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
Reuse Distance
0.0
0.2
0.4
0.6
0.8
1.0
Mis
sR
ate
Thread 0Thread 1Thread 2Thread 3
Ways
Mis
s Rat
e
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
22.0
24.0
Reuse Distance
0.0
0.2
0.4
0.6
0.8
1.0
Mis
sR
ate
Thread 0Thread 1Thread 2Thread 3
Ways
Mis
s Rat
e
Miss Rate
0 8 32 Ways 8
0 8 32
32 0
- +
WS too small Improvement opportunity
WS too large
Utilization through Imbalance
• 32 way cache, 4 threads – default allocation = 8 ways / thread
• Prioritize one thread at a time • Vary preferred thread identity
10
MICRO-46, 2013
Miss Rate
0 8 32 Ways 8
0 8 32
32 0
- +
WS too small Improvement opportunity
WS too large
Utilization through Imbalance
• 32 way cache, 4 threads – default allocation = 8 ways / thread
• Prioritize one thread at a time • Vary preferred thread identity
11
MICRO-46, 2013
Imbalance in partitions benefits the preferred thread
High Imbalance & Unpreferred threads
• Each thread switches between preferred and un-preferred
• Unpreferred thread data remains in preferred partition
• Continues to benefit un-preferred thread even as its partition shrinks
• Imbalance magnifies benefits by reducing pressure on preferred partition
12
MICRO-46, 2013
High Imbalance & Unpreferred threads
• Each thread switches between preferred and un-preferred
• Unpreferred thread data remains in preferred partition
• Continues to benefit un-preferred thread even as its partition shrinks
• Imbalance magnifies benefits by reducing pressure on preferred partition
13
MICRO-46, 2013
Large preferred partition benefits unpreferred threads too
Proposed Strategy • Default allocation is inefficient • Allocate extra ways to a single thread by
equally penalizing all other threads • Select the preferred thread in round-robin
manner ▫ Ensure balanced progress
• Allocation changes at pre-set execution intervals
14
MICRO-46, 2013
Two-Stage Partitioning • Evaluation Stage ▫ Triggers at the start of a new program phase ▫ Divide the cache sets into equal-sized segments ▫ Each segment is partitioned into a different level of
imbalance ▫ 32 way cache shared among 4 cores – configurations
from 8-8-8-8 -> 29-1-1-1 ▫ Each core is prioritized in turn ▫ Configuration with least number of misses chosen
15
MICRO-46, 2013
Evaluation Stage Cache
16
MICRO-46, 2013
Thread 1 Thread 2 Thread 3 Thread 4
Ways
Segm
ents
1 2345 67 8
• Each segment has multiple sets • Each thread becomes the preferred thread in turn
Evaluation Stage Cache
17
MICRO-46, 2013
Thread 1 Thread 2 Thread 3 Thread 4
Ways
Segm
ents
1 2345 67 8
• Each segment has multiple sets • Each thread becomes the preferred thread in turn
Capture effects of imbalance on preferred and unpreferred threads
Considering Unpartitioned Cache • An unpartitioned (thread-agnostic LRU)
segment included in evaluation
• Replace a low-imbalance configuration
• Benefits of partitioning are obtained through high levels of imbalance
18
MICRO-46, 2013
Unpartitioned Segment
19
MICRO-46, 2013
Thread 1 Thread 2 Thread 3 Thread 4
Ways
Segm
ents
1 2345 67 8
No partitions
Stable Stage • Maintain the chosen configuration till the
next program phase change
• Choose preferred thread in round-robin manner
• Basic-block vector tracking used to identify changes in program phase (based on previous work)
20
MICRO-46, 2013
Evaluation Framework • Simulator: Simics-GEMS • Target: 4-core CMP with 32 way shared L2
cache, and 2 way private L1 caches ▫ 1 thread per core, 64 byte line size, LRU replacement
• Workload: ▫ 9 data-parallel workloads ▫ Mix of parsec (pthread build) and SPEC OMP suite ▫ Parsec - Blackscholes, Canneal, Fluidanimate,
Streamcluster, Swaptions ▫ SPEC OMP – Art, Equake, Swim, Wupwise
21
MICRO-46, 2013
Baselines • Unpartitioned cache (thread-agnostic LRU) • Statically equi-partitioned cache • A CPI-based adaptive partitioning scheme
(Muralidhara et al., IPDPS 2010) ▫ Starts with equal partition ▫ Proportional partitioning (ways proportional to
CPI) ▫ Store <ways, CPI> to build a runtime model to
predict CPI variations with change in allocation ▫ Accelerate critical thread
22
MICRO-46, 2013
1.E+03'
1.E+04'
1.E+05'
1.E+06'
1.E+07'
1.E+08'
1.E+09'
6' 7' 8' 9' 10' 11' 12' 13' 14' 15' 16' 17' 18'
Misses%
Set%Bits%
Art' Blackscholes' Canneal' Equake' Fluidanimate'
Streamcluster' SwapFons' Swim' Wupwise'
Cache&Size&(Bytes)&
Misses&
128K& 256K& 512K& &1M& &2M& &4M& &8M& 16M& 32M& 64M& 128M& 256M& 512M&
Misses vs size
23
MICRO-46, 2013
4-core 32-way cache with equal partitions
Results • Benefits of partitioning strongly tied to
cache size
• Partitioning beneficial only when per-thread working set is between the default allocation and the cache capacity
• Proposed method outperforms the baselines where there is potential for benefit
24
MICRO-46, 2013
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
Swap-ons"
Blackscholes"
Art"
Streamcluster"
Canneal"
Fluidanim
ate"
Equake"
Swim"
Wupwise"
STATICEEQ" CPI" IMBERR"
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
Swap-ons"
Blackscholes"
Art"
Streamcluster"
Canneal"
Fluidanim
ate"
Equake"
Swim"
Wupwise"
STATICEEQ" CPI" IMBERR"
Comparison with Unpartitioned: 8 MB cache, 4 cores, 32 ways
25
MICRO-46, 2013
Mis
ses
Exec
utio
n Ti
me
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
Swap-ons"
Blackscholes"
Art"
Streamcluster"
Canneal"
Fluidanim
ate"
Equake"
Swim"
Wupwise"
STATICEEQ" CPI" IMBERR"
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
Swap-ons"
Blackscholes"
Art"
Streamcluster"
Canneal"
Fluidanim
ate"
Equake"
Swim"
Wupwise"
STATICEEQ" CPI" IMBERR"
Comparison with Unpartitioned: 32 MB cache, 4 cores, 32 ways
26
MICRO-46, 2013
Mis
ses
Exec
utio
n Ti
me
30 15 1.6
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
Swap-ons"
Blackscholes"
Art"
Streamcluster"
Canneal"
Fluidanim
ate"
Equake"
Swim"
Wupwise"
STATICEEQ" CPI" IMBERR"
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
Swap-ons"
Blackscholes"
Art"
Streamcluster"
Canneal"
Fluidanim
ate"
Equake"
Swim"
Wupwise"
STATICEEQ" CPI" IMBERR"
Comparison with Unpartitioned: 128 MB cache, 4 cores, 32 ways
27
MICRO-46, 2013
Mis
ses
Exec
utio
n Ti
me
Across the Board • Outperforms the CPI-based in most cases where
there is potential for benefit ▫ Proportional partitioning generates data points near the
default allocation ▫ From these starting points the search fails to find the
high-utility (high-imbalance) configurations
• No partitioning is best in some cases (Equake) ▫ Constructive interference ▫ Proposed scheme chooses global LRU appropriately ▫ Worst-case 5% increase in time due to evaluation
28
MICRO-46, 2013
Overheads • Space overhead negligible ▫ Way partitioning for each segment
• Program phase detection overhead ▫ Basic block vector tracking
• For small cache sizes, evaluation stage can increase execution time ▫ <1 % on average, 5 % maximum
29
MICRO-46, 2013
Limitations • Scalability ▫ Fine-grained barriers would mean smaller intervals
• Limited exploration of solution space ▫ One preferred thread at a time ▫ The benefits of high imbalance makes the scheme
practical
30
MICRO-46, 2013
Conclusion • Simple runtime partitioning for balanced
data-parallel programs • Effective cache utilization and balanced
progress achieved through A. High Imbalance in partitions and B. Prioritizing each thread in turn
• High imbalance allows un-preferred threads to benefit from the large preferred partition
31
MICRO-46, 2013
Thank You!
32
MICRO-46, 2013
Questions…
Injecting Extra Imbalance
• Over-allocation in preferred thread protects long distance accesses of unpreferred thread
33
MICRO-46, 2013
Long RD accesses not served by shrinking partition
Freq
uenc
y
8 16 0 Ways/ SSRD
Freq
uenc
y 8 16 0
Ways/ SSRD
Freq
uenc
y
8 16 0 Ways/ SSRD
Freq
uenc
y
8 16 0 Ways/ SSRD
Long RD accesses are protected in preferred partition
Hits Misses
Extra Ways
T1 (+) T2 (-)
T2 (--) T1 (++)
Effect of Over-allocation
• Benefits to preferred thread saturate at 14 ways • Benefits to un-preferred thread increase as allocation falls • Hits for un-preferred thread are in preferred thread partition
34
MICRO-46, 2013 8 7 6 5 4 3 2 1
Allocation
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Ref
eren
ces
⇥107
MissPrivate Local Hit
Private Foreign HitShared Local Hit
Shared Foreign Hit
Thread in preferred state Thread in un-preferred state
8 11 14 17 20 23 26 29Allocation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Ref
eren
ces
⇥107MissPrivate Self-HitPrivate Foreign Hit
Shared Self-HitShared Foreign Hit
Adapting to Phase Changes • Changes in program phase need to be
identified to trigger evaluation • Per-thread binary basic block vectors are
used to identify the basic blocks touched in each interval
• Hamming distance between the BBVs of current and last intervals are compared to identify phase changes
35
MICRO-46, 2013
1.1# 3# 0.8#7.6# 3.8# 6.5#
53.9#
23.41#
0#10#20#30#40#50#60#
8-8-8-8#
11-7-7-7#
14-6-6-6#
17-5-5-5#
20-4-4-4#
23-3-3-3#
26-2-2-2#
29-1-1-1#
Time#in#Percent#
Alloca;ons#
Considering Unpartitioned Cache • Time spent in various imbalance
configurations for runs showing benefits of partitioning
36
MICRO-46, 2013
Round-robin vs. Critical-thread • Prioritize the critical thread instead of using round robin • No significant difference – Accelerating critical thread has
the same effect as giving each thread a fair share
37
MICRO-46, 2013
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
Art"
Blackscholes"
Canneal"
Equake"
Fluidanimate"
Stream
cluster"
SwapAons"
Swim"
Wupwise"
Normalized"ExecuAon"Time" Normalized"Misses"
Performance of critical-thread normalized to round-robin