+ All Categories
Home > Documents > Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46,...

Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46,...

Date post: 04-Jul-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
37
Imbalanced Cache Partitioning for Balanced Data-Parallel Programs Abhisek Pan & Vijay S. Pai Electrical and Computer Engineering Purdue University MICRO-46, 2013
Transcript
Page 1: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Imbalanced Cache Partitioning for Balanced Data-Parallel Programs

Abhisek Pan & Vijay S. Pai Electrical and Computer Engineering Purdue University MICRO-46, 2013

Page 2: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Motivation • Last level cache partitioning heavily studied

for multiprogramming workloads • Multithreading = multiprogramming ▫  All threads have to progress equally ▫  Pure throughput maximization is not enough

• Data-parallel threads are similar to each other in their data access patterns

• However equal allocation => suboptimal cache utilization

2

MICRO-46, 2013

Page 3: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Motivation • Last level cache partitioning heavily studied

for multiprogramming workloads • Multithreading = multiprogramming ▫  All threads have to progress equally ▫  Pure throughput maximization is not enough

• Data-parallel threads are similar to each other in their data access patterns

• However equal allocation => suboptimal cache utilization

3

MICRO-46, 2013

Balanced threads need highly imbalanced partitions

Page 4: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Contributions • Shared LLC partitioning for balanced data-

parallel applications

• Increasing allocation for one thread at a time improves utilization

• Prioritizing each thread in turn ensures balanced progress

• 17% drop in miss rate, 8% drop in execution time on average for 4-core 8MB cache

• Negligible overheads

4

MICRO-46, 2013

Page 5: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Outline • Motivation • Contributions • Background • Memory Reuse Behavior of Threads • Proposed Scheme • Evaluation • Overheads & Limitations • Conclusion

5

MICRO-46, 2013

Page 6: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Way-partitioning • N-way set-associative cache = > each set

has N ways or blocks • Unpartitioned cache ▫  Least recently used entry among all ways replaced on a

miss ▫  Thread-agnostic LRU

• Way-partitioning ▫  Each way is owned by one core at a time ▫  On a miss, a core replaces the LRU entry among the

ways owned by it ▫  No restriction on access, only on replacement

6

MICRO-46, 2013

Page 7: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

• Miss-rate vs. ways in a single set • Each thread considered in isolation

7

MICRO-46, 2013

Per-thread Miss Rate Curves

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0

Mis

sR

ate

Thread 0Thread 1Thread 2Thread 3

Mis

s Rat

e

Ways

working set 1

working set 2

Page 8: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

• Miss-rate vs. ways in a single set • Each thread considered in isolation

8

MICRO-46, 2013

Per-thread Miss Rate Curves

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0

Mis

sR

ate

Thread 0Thread 1Thread 2Thread 3

Mis

s Rat

e

Ways

working set 1

working set 2

Inefficient Allocation!

Page 9: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Symmetric Memory Access

9

MICRO-46, 2013

• Miss-curves symmetric across threads

•  Seen for all benchmarks & cache sizes

Art, 212 sets Blackscholes, 29 sets

Fluidanimate, 214 sets

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

22.0

24.0

26.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0M

iss

Rat

e

Thread 0Thread 1Thread 2Thread 3

Ways

Mis

s Rat

e

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0

Mis

sR

ate

Thread 0Thread 1Thread 2Thread 3

Ways

Mis

s Rat

e

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

22.0

24.0

Reuse Distance

0.0

0.2

0.4

0.6

0.8

1.0

Mis

sR

ate

Thread 0Thread 1Thread 2Thread 3

Ways

Mis

s Rat

e

Page 10: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Miss Rate

0 8 32 Ways 8

0 8 32

32 0

- +

WS too small Improvement opportunity

WS too large

Utilization through Imbalance

•  32 way cache, 4 threads – default allocation = 8 ways / thread

•  Prioritize one thread at a time •  Vary preferred thread identity

10

MICRO-46, 2013

Page 11: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Miss Rate

0 8 32 Ways 8

0 8 32

32 0

- +

WS too small Improvement opportunity

WS too large

Utilization through Imbalance

•  32 way cache, 4 threads – default allocation = 8 ways / thread

•  Prioritize one thread at a time •  Vary preferred thread identity

11

MICRO-46, 2013

Imbalance in partitions benefits the preferred thread

Page 12: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

High Imbalance & Unpreferred threads

• Each thread switches between preferred and un-preferred

• Unpreferred thread data remains in preferred partition

• Continues to benefit un-preferred thread even as its partition shrinks

• Imbalance magnifies benefits by reducing pressure on preferred partition

12

MICRO-46, 2013

Page 13: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

High Imbalance & Unpreferred threads

• Each thread switches between preferred and un-preferred

• Unpreferred thread data remains in preferred partition

• Continues to benefit un-preferred thread even as its partition shrinks

• Imbalance magnifies benefits by reducing pressure on preferred partition

13

MICRO-46, 2013

Large preferred partition benefits unpreferred threads too

Page 14: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Proposed Strategy • Default allocation is inefficient • Allocate extra ways to a single thread by

equally penalizing all other threads • Select the preferred thread in round-robin

manner ▫  Ensure balanced progress

• Allocation changes at pre-set execution intervals

14

MICRO-46, 2013

Page 15: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Two-Stage Partitioning • Evaluation Stage ▫  Triggers at the start of a new program phase ▫  Divide the cache sets into equal-sized segments ▫  Each segment is partitioned into a different level of

imbalance ▫  32 way cache shared among 4 cores – configurations

from 8-8-8-8 -> 29-1-1-1 ▫  Each core is prioritized in turn ▫  Configuration with least number of misses chosen

15

MICRO-46, 2013

Page 16: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Evaluation Stage Cache

16

MICRO-46, 2013

Thread 1 Thread 2 Thread 3 Thread 4

Ways

Segm

ents

1 2345 67 8

•  Each segment has multiple sets •  Each thread becomes the preferred thread in turn

Page 17: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Evaluation Stage Cache

17

MICRO-46, 2013

Thread 1 Thread 2 Thread 3 Thread 4

Ways

Segm

ents

1 2345 67 8

•  Each segment has multiple sets •  Each thread becomes the preferred thread in turn

Capture effects of imbalance on preferred and unpreferred threads

Page 18: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Considering Unpartitioned Cache • An unpartitioned (thread-agnostic LRU)

segment included in evaluation

• Replace a low-imbalance configuration

• Benefits of partitioning are obtained through high levels of imbalance

18

MICRO-46, 2013

Page 19: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Unpartitioned Segment

19

MICRO-46, 2013

Thread 1 Thread 2 Thread 3 Thread 4

Ways

Segm

ents

1 2345 67 8

No partitions

Page 20: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Stable Stage • Maintain the chosen configuration till the

next program phase change

• Choose preferred thread in round-robin manner

• Basic-block vector tracking used to identify changes in program phase (based on previous work)

20

MICRO-46, 2013

Page 21: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Evaluation Framework • Simulator: Simics-GEMS • Target: 4-core CMP with 32 way shared L2

cache, and 2 way private L1 caches ▫  1 thread per core, 64 byte line size, LRU replacement

• Workload: ▫  9 data-parallel workloads ▫  Mix of parsec (pthread build) and SPEC OMP suite ▫  Parsec - Blackscholes, Canneal, Fluidanimate,

Streamcluster, Swaptions ▫  SPEC OMP – Art, Equake, Swim, Wupwise

21

MICRO-46, 2013

Page 22: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Baselines • Unpartitioned cache (thread-agnostic LRU) • Statically equi-partitioned cache • A CPI-based adaptive partitioning scheme

(Muralidhara et al., IPDPS 2010) ▫  Starts with equal partition ▫  Proportional partitioning (ways proportional to

CPI) ▫  Store <ways, CPI> to build a runtime model to

predict CPI variations with change in allocation ▫  Accelerate critical thread

22

MICRO-46, 2013

Page 23: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

1.E+03'

1.E+04'

1.E+05'

1.E+06'

1.E+07'

1.E+08'

1.E+09'

6' 7' 8' 9' 10' 11' 12' 13' 14' 15' 16' 17' 18'

Misses%

Set%Bits%

Art' Blackscholes' Canneal' Equake' Fluidanimate'

Streamcluster' SwapFons' Swim' Wupwise'

Cache&Size&(Bytes)&

Misses&

128K& 256K& 512K& &1M& &2M& &4M& &8M& 16M& 32M& 64M& 128M& 256M& 512M&

Misses vs size

23

MICRO-46, 2013

4-core 32-way cache with equal partitions

Page 24: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Results • Benefits of partitioning strongly tied to

cache size

• Partitioning beneficial only when per-thread working set is between the default allocation and the cache capacity

• Proposed method outperforms the baselines where there is potential for benefit

24

MICRO-46, 2013

Page 25: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"

STATICEEQ" CPI" IMBERR"

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"

STATICEEQ" CPI" IMBERR"

Comparison with Unpartitioned: 8 MB cache, 4 cores, 32 ways

25

MICRO-46, 2013

Mis

ses

Exec

utio

n Ti

me

Page 26: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"

STATICEEQ" CPI" IMBERR"

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"

STATICEEQ" CPI" IMBERR"

Comparison with Unpartitioned: 32 MB cache, 4 cores, 32 ways

26

MICRO-46, 2013

Mis

ses

Exec

utio

n Ti

me

30 15 1.6

Page 27: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"

STATICEEQ" CPI" IMBERR"

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Swap-ons"

Blackscholes"

Art"

Streamcluster"

Canneal"

Fluidanim

ate"

Equake"

Swim"

Wupwise"

STATICEEQ" CPI" IMBERR"

Comparison with Unpartitioned: 128 MB cache, 4 cores, 32 ways

27

MICRO-46, 2013

Mis

ses

Exec

utio

n Ti

me

Page 28: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Across the Board • Outperforms the CPI-based in most cases where

there is potential for benefit ▫  Proportional partitioning generates data points near the

default allocation ▫  From these starting points the search fails to find the

high-utility (high-imbalance) configurations

• No partitioning is best in some cases (Equake) ▫  Constructive interference ▫  Proposed scheme chooses global LRU appropriately ▫  Worst-case 5% increase in time due to evaluation

28

MICRO-46, 2013

Page 29: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Overheads • Space overhead negligible ▫  Way partitioning for each segment

• Program phase detection overhead ▫  Basic block vector tracking

• For small cache sizes, evaluation stage can increase execution time ▫  <1 % on average, 5 % maximum

29

MICRO-46, 2013

Page 30: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Limitations • Scalability ▫  Fine-grained barriers would mean smaller intervals

• Limited exploration of solution space ▫  One preferred thread at a time ▫  The benefits of high imbalance makes the scheme

practical

30

MICRO-46, 2013

Page 31: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Conclusion • Simple runtime partitioning for balanced

data-parallel programs • Effective cache utilization and balanced

progress achieved through A. High Imbalance in partitions and B. Prioritizing each thread in turn

• High imbalance allows un-preferred threads to benefit from the large preferred partition

31

MICRO-46, 2013

Page 32: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Thank You!

32

MICRO-46, 2013

Questions…

Page 33: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Injecting Extra Imbalance

•  Over-allocation in preferred thread protects long distance accesses of unpreferred thread

33

MICRO-46, 2013

Long RD accesses not served by shrinking partition

Freq

uenc

y

8 16 0 Ways/ SSRD

Freq

uenc

y 8 16 0

Ways/ SSRD

Freq

uenc

y

8 16 0 Ways/ SSRD

Freq

uenc

y

8 16 0 Ways/ SSRD

Long RD accesses are protected in preferred partition

Hits Misses

Extra Ways

T1 (+) T2 (-)

T2 (--) T1 (++)

Page 34: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Effect of Over-allocation

•  Benefits to preferred thread saturate at 14 ways •  Benefits to un-preferred thread increase as allocation falls •  Hits for un-preferred thread are in preferred thread partition

34

MICRO-46, 2013 8 7 6 5 4 3 2 1

Allocation

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Ref

eren

ces

⇥107

MissPrivate Local Hit

Private Foreign HitShared Local Hit

Shared Foreign Hit

Thread in preferred state Thread in un-preferred state

8 11 14 17 20 23 26 29Allocation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Ref

eren

ces

⇥107MissPrivate Self-HitPrivate Foreign Hit

Shared Self-HitShared Foreign Hit

Page 35: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Adapting to Phase Changes • Changes in program phase need to be

identified to trigger evaluation • Per-thread binary basic block vectors are

used to identify the basic blocks touched in each interval

•  Hamming distance between the BBVs of current and last intervals are compared to identify phase changes

35

MICRO-46, 2013

Page 36: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

1.1# 3# 0.8#7.6# 3.8# 6.5#

53.9#

23.41#

0#10#20#30#40#50#60#

8-8-8-8#

11-7-7-7#

14-6-6-6#

17-5-5-5#

20-4-4-4#

23-3-3-3#

26-2-2-2#

29-1-1-1#

Time#in#Percent#

Alloca;ons#

Considering Unpartitioned Cache • Time spent in various imbalance

configurations for runs showing benefits of partitioning

36

MICRO-46, 2013

Page 37: Imbalanced Cache Partitioning for Balanced Data-Parallel ... · Symmetric Memory Access 9 MICRO-46, 2013 • Miss-curves symmetric across threads • Seen for all benchmarks & cache

Round-robin vs. Critical-thread •  Prioritize the critical thread instead of using round robin •  No significant difference – Accelerating critical thread has

the same effect as giving each thread a fair share

37

MICRO-46, 2013

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

Art"

Blackscholes"

Canneal"

Equake"

Fluidanimate"

Stream

cluster"

SwapAons"

Swim"

Wupwise"

Normalized"ExecuAon"Time" Normalized"Misses"

Performance of critical-thread normalized to round-robin


Recommended