Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46,...

Imbalanced Cache Partitioning for Balanced Data-Parallel Programs

Abhisek Pan & Vijay S. Pai Electrical and Computer Engineering Purdue University MICRO-46, 2013

Motivation • Last level cache partitioning heavily studied

for multiprogramming workloads • Multithreading = multiprogramming ▫  All threads have to progress equally ▫  Pure throughput maximization is not enough

• Data-parallel threads are similar to each other in their data access patterns

• However equal allocation => suboptimal cache utilization

2

MICRO-46, 2013

Motivation • Last level cache partitioning heavily studied

for multiprogramming workloads • Multithreading = multiprogramming ▫  All threads have to progress equally ▫  Pure throughput maximization is not enough

• Data-parallel threads are similar to each other in their data access patterns

• However equal allocation => suboptimal cache utilization

3

MICRO-46, 2013

Balanced threads need highly imbalanced partitions

Contributions • Shared LLC partitioning for balanced data-

parallel applications

• Increasing allocation for one thread at a time improves utilization

• Prioritizing each thread in turn ensures balanced progress

• 17% drop in miss rate, 8% drop in execution time on average for 4-core 8MB cache

• Negligible overheads

4

MICRO-46, 2013

Outline • Motivation • Contributions • Background • Memory Reuse Behavior of Threads • Proposed Scheme • Evaluation • Overheads & Limitations • Conclusion

5

MICRO-46, 2013

Way-partitioning • N-way set-associative cache = > each set

has N ways or blocks • Unpartitioned cache ▫  Least recently used entry among all ways replaced on a

miss ▫  Thread-agnostic LRU

• Way-partitioning ▫  Each way is owned by one core at a time ▫  On a miss, a core replaces the LRU entry among the

ways owned by it ▫  No restriction on access, only on replacement

6

MICRO-46, 2013

• Miss-rate vs. ways in a single set • Each thread considered in isolation

7

MICRO-46, 2013

Per-thread Miss Rate Curves

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0

Mis

sR

ate

Thread 0Thread 1Thread 2Thread 3

Mis

s Rat

e

Ways

working set 1

working set 2

• Miss-rate vs. ways in a single set • Each thread considered in isolation

8

MICRO-46, 2013

Per-thread Miss Rate Curves

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0

Mis

sR

ate


Mis

s Rat

e

Ways

working set 1

working set 2

Inefficient Allocation!

Symmetric Memory Access

9

MICRO-46, 2013

• Miss-curves symmetric across threads

•  Seen for all benchmarks & cache sizes

Art, 212 sets Blackscholes, 29 sets

Fluidanimate, 214 sets

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

22.0

24.0

26.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0M

iss

Rat

e


Ways

Mis

s Rat

e

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0

Mis

sR

ate


Ways

Mis

s Rat

e

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

22.0

24.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0

Mis

sR

ate


Ways

Mis

s Rat

e

Miss Rate

0 8 32 Ways 8

0 8 32

32 0

- +

WS too small Improvement opportunity

WS too large

Utilization through Imbalance

•  32 way cache, 4 threads – default allocation = 8 ways / thread

•  Prioritize one thread at a time •  Vary preferred thread identity

10

MICRO-46, 2013

Miss Rate

0 8 32 Ways 8

0 8 32

32 0

- +

WS too small Improvement opportunity

WS too large

Utilization through Imbalance

•  32 way cache, 4 threads – default allocation = 8 ways / thread

•  Prioritize one thread at a time •  Vary preferred thread identity

11

MICRO-46, 2013

Imbalance in partitions benefits the preferred thread

High Imbalance & Unpreferred threads

• Each thread switches between preferred and un-preferred

• Unpreferred thread data remains in preferred partition

• Continues to benefit un-preferred thread even as its partition shrinks

• Imbalance magnifies benefits by reducing pressure on preferred partition

12

MICRO-46, 2013

High Imbalance & Unpreferred threads

• Each thread switches between preferred and un-preferred

• Unpreferred thread data remains in preferred partition

• Continues to benefit un-preferred thread even as its partition shrinks

• Imbalance magnifies benefits by reducing pressure on preferred partition

13

MICRO-46, 2013

Large preferred partition benefits unpreferred threads too

Proposed Strategy • Default allocation is inefficient • Allocate extra ways to a single thread by

equally penalizing all other threads • Select the preferred thread in round-robin

manner ▫  Ensure balanced progress

• Allocation changes at pre-set execution intervals

14

MICRO-46, 2013

Two-Stage Partitioning • Evaluation Stage ▫  Triggers at the start of a new program phase ▫  Divide the cache sets into equal-sized segments ▫  Each segment is partitioned into a different level of

imbalance ▫  32 way cache shared among 4 cores – configurations

from 8-8-8-8 -> 29-1-1-1 ▫  Each core is prioritized in turn ▫  Configuration with least number of misses chosen

15

MICRO-46, 2013

Evaluation Stage Cache

16

MICRO-46, 2013

Thread 1 Thread 2 Thread 3 Thread 4

Ways

Segm

ents

1 2345 67 8

•  Each segment has multiple sets •  Each thread becomes the preferred thread in turn

Evaluation Stage Cache

17

MICRO-46, 2013


Ways

Segm

ents

1 2345 67 8

•  Each segment has multiple sets •  Each thread becomes the preferred thread in turn

Capture effects of imbalance on preferred and unpreferred threads

Considering Unpartitioned Cache • An unpartitioned (thread-agnostic LRU)

segment included in evaluation

• Replace a low-imbalance configuration

• Benefits of partitioning are obtained through high levels of imbalance

18

MICRO-46, 2013

Unpartitioned Segment

19

MICRO-46, 2013


Ways

Segm

ents

1 2345 67 8

No partitions

Stable Stage • Maintain the chosen configuration till the

next program phase change

• Choose preferred thread in round-robin manner

• Basic-block vector tracking used to identify changes in program phase (based on previous work)

20

MICRO-46, 2013

Evaluation Framework • Simulator: Simics-GEMS • Target: 4-core CMP with 32 way shared L2

cache, and 2 way private L1 caches ▫  1 thread per core, 64 byte line size, LRU replacement

• Workload: ▫  9 data-parallel workloads ▫  Mix of parsec (pthread build) and SPEC OMP suite ▫  Parsec - Blackscholes, Canneal, Fluidanimate,

Streamcluster, Swaptions ▫  SPEC OMP – Art, Equake, Swim, Wupwise

21

MICRO-46, 2013

Baselines • Unpartitioned cache (thread-agnostic LRU) • Statically equi-partitioned cache • A CPI-based adaptive partitioning scheme

(Muralidhara et al., IPDPS 2010) ▫  Starts with equal partition ▫  Proportional partitioning (ways proportional to

CPI) ▫  Store <ways, CPI> to build a runtime model to

predict CPI variations with change in allocation ▫  Accelerate critical thread

22

MICRO-46, 2013

1.E+03'

1.E+04'

1.E+05'

1.E+06'

1.E+07'

1.E+08'

1.E+09'

6' 7' 8' 9' 10' 11' 12' 13' 14' 15' 16' 17' 18'

Misses%

Set%Bits%

Art' Blackscholes' Canneal' Equake' Fluidanimate'

Streamcluster' SwapFons' Swim' Wupwise'

Cache&Size&(Bytes)&

Misses&

128K& 256K& 512K& &1M& &2M& &4M& &8M& 16M& 32M& 64M& 128M& 256M& 512M&

Misses vs size

23

MICRO-46, 2013

4-core 32-way cache with equal partitions

Results • Benefits of partitioning strongly tied to

cache size

• Partitioning beneficial only when per-thread working set is between the default allocation and the cache capacity

• Proposed method outperforms the baselines where there is potential for benefit

24

MICRO-46, 2013

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"

STATICEEQ" CPI" IMBERR"

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"


Comparison with Unpartitioned: 8 MB cache, 4 cores, 32 ways

25

MICRO-46, 2013

Mis

ses

Exec

utio

n Ti

me

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"


0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"



26

MICRO-46, 2013

Mis

ses

Exec

utio

n Ti

me

30 15 1.6

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"


0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"



27

MICRO-46, 2013

Mis

ses

Exec

utio

n Ti

me

Across the Board • Outperforms the CPI-based in most cases where

there is potential for benefit ▫  Proportional partitioning generates data points near the

default allocation ▫  From these starting points the search fails to find the

high-utility (high-imbalance) configurations

• No partitioning is best in some cases (Equake) ▫  Constructive interference ▫  Proposed scheme chooses global LRU appropriately ▫  Worst-case 5% increase in time due to evaluation

28

MICRO-46, 2013

Overheads • Space overhead negligible ▫  Way partitioning for each segment

• Program phase detection overhead ▫  Basic block vector tracking

• For small cache sizes, evaluation stage can increase execution time ▫  <1 % on average, 5 % maximum

29

MICRO-46, 2013

Limitations • Scalability ▫  Fine-grained barriers would mean smaller intervals

• Limited exploration of solution space ▫  One preferred thread at a time ▫  The benefits of high imbalance makes the scheme

practical

30

MICRO-46, 2013

Conclusion • Simple runtime partitioning for balanced

data-parallel programs • Effective cache utilization and balanced

progress achieved through A. High Imbalance in partitions and B. Prioritizing each thread in turn

• High imbalance allows un-preferred threads to benefit from the large preferred partition

31

MICRO-46, 2013

Thank You!

32

MICRO-46, 2013

Questions…

Injecting Extra Imbalance

•  Over-allocation in preferred thread protects long distance accesses of unpreferred thread

33

MICRO-46, 2013

Long RD accesses not served by shrinking partition

Freq

uenc

y

8 16 0 Ways/ SSRD

Freq

uenc

y 8 16 0

Ways/ SSRD

Freq

uenc

y

8 16 0 Ways/ SSRD

Freq

uenc

y

8 16 0 Ways/ SSRD

Long RD accesses are protected in preferred partition

Hits Misses

Extra Ways

T1 (+) T2 (-)

T2 (--) T1 (++)

Effect of Over-allocation

•  Benefits to preferred thread saturate at 14 ways •  Benefits to un-preferred thread increase as allocation falls •  Hits for un-preferred thread are in preferred thread partition

34

MICRO-46, 2013 8 7 6 5 4 3 2 1

Allocation

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Ref

eren

ces

⇥107

MissPrivate Local Hit

Private Foreign HitShared Local Hit

Shared Foreign Hit

Thread in preferred state Thread in un-preferred state

8 11 14 17 20 23 26 29Allocation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Ref

eren

ces

⇥107MissPrivate Self-HitPrivate Foreign Hit

Shared Self-HitShared Foreign Hit

Adapting to Phase Changes • Changes in program phase need to be

identified to trigger evaluation • Per-thread binary basic block vectors are

used to identify the basic blocks touched in each interval

•  Hamming distance between the BBVs of current and last intervals are compared to identify phase changes

35

MICRO-46, 2013

1.1# 3# 0.8#7.6# 3.8# 6.5#

53.9#

23.41#

0#10#20#30#40#50#60#

8-8-8-8#

11-7-7-7#

14-6-6-6#

17-5-5-5#

20-4-4-4#

23-3-3-3#

26-2-2-2#

29-1-1-1#

Time#in#Percent#

Alloca;ons#

Considering Unpartitioned Cache • Time spent in various imbalance

configurations for runs showing benefits of partitioning

36

MICRO-46, 2013

Round-robin vs. Critical-thread •  Prioritize the critical thread instead of using round robin •  No significant difference – Accelerating critical thread has

the same effect as giving each thread a fair share

37

MICRO-46, 2013

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Art"

Blackscholes"

Canneal"

Equake"

Fluidanimate"

Stream

cluster"

SwapAons"

Swim"

Wupwise"

Normalized"ExecuAon"Time" Normalized"Misses"

Performance of critical-thread normalized to round-robin

Date post:	04-Jul-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46,...

Documents