+ All Categories
Home > Documents > MICRO: Home - Imbalanced Cache Partitioning for Balanced Data-Parallel … · 2013. 12. 23. ·...

MICRO: Home - Imbalanced Cache Partitioning for Balanced Data-Parallel … · 2013. 12. 23. ·...

Date post: 30-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
37
Imbalanced Cache Partitioning for Balanced Data-Parallel Programs Abhisek Pan & Vijay S. Pai Electrical and Computer Engineering Purdue University MICRO-46, 2013
Transcript
  • Imbalanced Cache Partitioning for Balanced Data-Parallel Programs

    Abhisek Pan & Vijay S. Pai Electrical and Computer Engineering Purdue University MICRO-46, 2013

  • Motivation • Last level cache partitioning heavily studied

    for multiprogramming workloads • Multithreading = multiprogramming ▫  All threads have to progress equally ▫  Pure throughput maximization is not enough

    • Data-parallel threads are similar to each other in their data access patterns

    • However equal allocation => suboptimal cache utilization

    2

    MICRO-46, 2013

  • Motivation • Last level cache partitioning heavily studied

    for multiprogramming workloads • Multithreading = multiprogramming ▫  All threads have to progress equally ▫  Pure throughput maximization is not enough

    • Data-parallel threads are similar to each other in their data access patterns

    • However equal allocation => suboptimal cache utilization

    3

    MICRO-46, 2013

    Balanced threads need highly imbalanced partitions

  • Contributions • Shared LLC partitioning for balanced data-

    parallel applications • Increasing allocation for one thread at a

    time improves utilization • Prioritizing each thread in turn ensures

    balanced progress • 17% drop in miss rate, 8% drop in execution

    time on average for 4-core 8MB cache • Negligible overheads

    4

    MICRO-46, 2013

  • Outline • Motivation • Contributions • Background • Memory Reuse Behavior of Threads • Proposed Scheme • Evaluation • Overheads & Limitations • Conclusion

    5

    MICRO-46, 2013

  • Way-partitioning • N-way set-associative cache = > each set

    has N ways or blocks • Unpartitioned cache ▫  Least recently used entry among all ways replaced on a

    miss ▫  Thread-agnostic LRU

    • Way-partitioning ▫  Each way is owned by one core at a time ▫  On a miss, a core replaces the LRU entry among the

    ways owned by it ▫  No restriction on access, only on replacement

    6

    MICRO-46, 2013

  • • Miss-rate vs. ways in a single set • Each thread considered in isolation

    7

    MICRO-46, 2013

    Per-thread Miss Rate Curves

    0.0

    2.0

    4.0

    6.0

    8.0

    10.0

    12.0

    14.0

    16.0

    18.0

    Reuse Distance

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Mis

    sR

    ate

    Thread 0Thread 1Thread 2Thread 3

    Mis

    s Rat

    e

    Ways

    working set 1

    working set 2

  • • Miss-rate vs. ways in a single set • Each thread considered in isolation

    8

    MICRO-46, 2013

    Per-thread Miss Rate Curves

    0.0

    2.0

    4.0

    6.0

    8.0

    10.0

    12.0

    14.0

    16.0

    18.0

    Reuse Distance

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Mis

    sR

    ate

    Thread 0Thread 1Thread 2Thread 3

    Mis

    s Rat

    e

    Ways

    working set 1

    working set 2

    Inefficient Allocation!

  • Symmetric Memory Access

    9

    MICRO-46, 2013

    • Miss-curves symmetric across threads

    •  Seen for all benchmarks & cache sizes

    Art, 212 sets Blackscholes, 29 sets

    Fluidanimate, 214 sets

    0.0

    2.0

    4.0

    6.0

    8.0

    10.0

    12.0

    14.0

    16.0

    18.0

    20.0

    22.0

    24.0

    26.0

    Reuse Distance

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0M

    iss

    Rat

    e

    Thread 0Thread 1Thread 2Thread 3

    Ways

    Mis

    s Rat

    e

    0.0

    2.0

    4.0

    6.0

    8.0

    10.0

    12.0

    14.0

    16.0

    18.0

    Reuse Distance

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Mis

    sR

    ate

    Thread 0Thread 1Thread 2Thread 3

    Ways

    Mis

    s Rat

    e

    0.0

    2.0

    4.0

    6.0

    8.0

    10.0

    12.0

    14.0

    16.0

    18.0

    20.0

    22.0

    24.0

    Reuse Distance

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Mis

    sR

    ate

    Thread 0Thread 1Thread 2Thread 3

    Ways

    Mis

    s Rat

    e

  • Miss Rate

    0 8 32 Ways 8

    0 8 32

    32 0

    - +

    WS too small Improvement opportunity

    WS too large

    Utilization through Imbalance

    •  32 way cache, 4 threads – default allocation = 8 ways / thread

    •  Prioritize one thread at a time •  Vary preferred thread identity

    10

    MICRO-46, 2013

  • Miss Rate

    0 8 32 Ways 8

    0 8 32

    32 0

    - +

    WS too small Improvement opportunity

    WS too large

    Utilization through Imbalance

    •  32 way cache, 4 threads – default allocation = 8 ways / thread

    •  Prioritize one thread at a time •  Vary preferred thread identity

    11

    MICRO-46, 2013

    Imbalance in partitions benefits the preferred thread

  • High Imbalance & Unpreferred threads

    • Each thread switches between preferred and un-preferred

    • Unpreferred thread data remains in preferred partition

    • Continues to benefit un-preferred thread even as its partition shrinks

    • Imbalance magnifies benefits by reducing pressure on preferred partition

    12

    MICRO-46, 2013

  • High Imbalance & Unpreferred threads

    • Each thread switches between preferred and un-preferred

    • Unpreferred thread data remains in preferred partition

    • Continues to benefit un-preferred thread even as its partition shrinks

    • Imbalance magnifies benefits by reducing pressure on preferred partition

    13

    MICRO-46, 2013

    Large preferred partition benefits unpreferred threads too

  • Proposed Strategy • Default allocation is inefficient • Allocate extra ways to a single thread by

    equally penalizing all other threads • Select the preferred thread in round-robin

    manner ▫  Ensure balanced progress

    • Allocation changes at pre-set execution intervals

    14

    MICRO-46, 2013

  • Two-Stage Partitioning • Evaluation Stage ▫  Triggers at the start of a new program phase ▫  Divide the cache sets into equal-sized segments ▫  Each segment is partitioned into a different level of

    imbalance ▫  32 way cache shared among 4 cores – configurations

    from 8-8-8-8 -> 29-1-1-1 ▫  Each core is prioritized in turn ▫  Configuration with least number of misses chosen

    15

    MICRO-46, 2013

  • Evaluation Stage Cache

    16

    MICRO-46, 2013

    Thread 1 Thread 2 Thread 3 Thread 4

    Ways

    Segm

    ents

    1 2345 67 8

    •  Each segment has multiple sets •  Each thread becomes the preferred thread in turn

  • Evaluation Stage Cache

    17

    MICRO-46, 2013

    Thread 1 Thread 2 Thread 3 Thread 4

    Ways

    Segm

    ents

    1 2345 67 8

    •  Each segment has multiple sets •  Each thread becomes the preferred thread in turn

    Capture effects of imbalance on preferred and unpreferred threads

  • Considering Unpartitioned Cache • An unpartitioned (thread-agnostic LRU)

    segment included in evaluation

    • Replace a low-imbalance configuration

    • Benefits of partitioning are obtained through high levels of imbalance

    18

    MICRO-46, 2013

  • Unpartitioned Segment

    19

    MICRO-46, 2013

    Thread 1 Thread 2 Thread 3 Thread 4

    Ways

    Segm

    ents

    1 2345 67 8

    No partitions

  • Stable Stage • Maintain the chosen configuration till the

    next program phase change

    • Choose preferred thread in round-robin manner

    • Basic-block vector tracking used to identify changes in program phase (based on previous work)

    20

    MICRO-46, 2013

  • Evaluation Framework • Simulator: Simics-GEMS • Target: 4-core CMP with 32 way shared L2

    cache, and 2 way private L1 caches ▫  1 thread per core, 64 byte line size, LRU replacement

    • Workload: ▫  9 data-parallel workloads ▫  Mix of parsec (pthread build) and SPEC OMP suite ▫  Parsec - Blackscholes, Canneal, Fluidanimate,

    Streamcluster, Swaptions ▫  SPEC OMP – Art, Equake, Swim, Wupwise

    21

    MICRO-46, 2013

  • Baselines • Unpartitioned cache (thread-agnostic LRU) • Statically equi-partitioned cache • A CPI-based adaptive partitioning scheme

    (Muralidhara et al., IPDPS 2010) ▫  Starts with equal partition ▫  Proportional partitioning (ways proportional to

    CPI) ▫  Store to build a runtime model to

    predict CPI variations with change in allocation ▫  Accelerate critical thread

    22

    MICRO-46, 2013

  • 1.E+03'

    1.E+04'

    1.E+05'

    1.E+06'

    1.E+07'

    1.E+08'

    1.E+09'

    6' 7' 8' 9' 10' 11' 12' 13' 14' 15' 16' 17' 18'

    Misses%

    Set%Bits%

    Art' Blackscholes' Canneal' Equake' Fluidanimate'

    Streamcluster' SwapFons' Swim' Wupwise'

    Cache&Size&(Bytes)&

    Misses&

    128K& 256K& 512K& &1M& &2M& &4M& &8M& 16M& 32M& 64M& 128M& 256M& 512M&

    Misses vs size

    23

    MICRO-46, 2013

    4-core 32-way cache with equal partitions

  • Results • Benefits of partitioning strongly tied to

    cache size

    • Partitioning beneficial only when per-thread working set is between the default allocation and the cache capacity

    • Proposed method outperforms the baselines where there is potential for benefit

    24

    MICRO-46, 2013

  • 0"

    0.2"

    0.4"

    0.6"

    0.8"

    1"

    1.2"

    Swap-ons"

    Blackscholes"

    Art"

    Streamcluster"

    Canneal"

    Fluidanim

    ate"

    Equake"

    Swim"

    Wupwise"

    STATICEEQ" CPI" IMBERR"

    0"

    0.2"

    0.4"

    0.6"

    0.8"

    1"

    1.2"

    Swap-ons"

    Blackscholes"

    Art"

    Streamcluster"

    Canneal"

    Fluidanim

    ate"

    Equake"

    Swim"

    Wupwise"

    STATICEEQ" CPI" IMBERR"

    Comparison with Unpartitioned: 8 MB cache, 4 cores, 32 ways

    25

    MICRO-46, 2013

    Mis

    ses

    Exec

    utio

    n Ti

    me

  • 0"

    0.2"

    0.4"

    0.6"

    0.8"

    1"

    1.2"

    Swap-ons"

    Blackscholes"

    Art"

    Streamcluster"

    Canneal"

    Fluidanim

    ate"

    Equake"

    Swim"

    Wupwise"

    STATICEEQ" CPI" IMBERR"

    0"

    0.2"

    0.4"

    0.6"

    0.8"

    1"

    1.2"

    Swap-ons"

    Blackscholes"

    Art"

    Streamcluster"

    Canneal"

    Fluidanim

    ate"

    Equake"

    Swim"

    Wupwise"

    STATICEEQ" CPI" IMBERR"

    Comparison with Unpartitioned: 32 MB cache, 4 cores, 32 ways

    26

    MICRO-46, 2013

    Mis

    ses

    Exec

    utio

    n Ti

    me

    30 15 1.6

  • 0"

    0.2"

    0.4"

    0.6"

    0.8"

    1"

    1.2"

    Swap-ons"

    Blackscholes"

    Art"

    Streamcluster"

    Canneal"

    Fluidanim

    ate"

    Equake"

    Swim"

    Wupwise"

    STATICEEQ" CPI" IMBERR"

    0"

    0.2"

    0.4"

    0.6"

    0.8"

    1"

    1.2"

    Swap-ons"

    Blackscholes"

    Art"

    Streamcluster"

    Canneal"

    Fluidanim

    ate"

    Equake"

    Swim"

    Wupwise"

    STATICEEQ" CPI" IMBERR"

    Comparison with Unpartitioned: 128 MB cache, 4 cores, 32 ways

    27

    MICRO-46, 2013

    Mis

    ses

    Exec

    utio

    n Ti

    me

  • Across the Board • Outperforms the CPI-based in most cases where

    there is potential for benefit ▫  Proportional partitioning generates data points near the

    default allocation ▫  From these starting points the search fails to find the

    high-utility (high-imbalance) configurations

    • No partitioning is best in some cases (Equake) ▫  Constructive interference ▫  Proposed scheme chooses global LRU appropriately ▫  Worst-case 5% increase in time due to evaluation

    28

    MICRO-46, 2013

  • Overheads • Space overhead negligible ▫  Way partitioning for each segment

    • Program phase detection overhead ▫  Basic block vector tracking

    • For small cache sizes, evaluation stage can increase execution time ▫ 

  • Limitations • Scalability ▫  Fine-grained barriers would mean smaller intervals

    • Limited exploration of solution space ▫  One preferred thread at a time ▫  The benefits of high imbalance makes the scheme

    practical

    30

    MICRO-46, 2013

  • Conclusion • Simple runtime partitioning for balanced

    data-parallel programs • Effective cache utilization and balanced

    progress achieved through A. High Imbalance in partitions and B. Prioritizing each thread in turn

    • High imbalance allows un-preferred threads to benefit from the large preferred partition

    31

    MICRO-46, 2013

  • Thank You!

    32

    MICRO-46, 2013

    Questions…

  • Injecting Extra Imbalance

    •  Over-allocation in preferred thread protects long distance accesses of unpreferred thread

    33

    MICRO-46, 2013

    Long RD accesses not served by shrinking partition

    Freq

    uenc

    y

    8 16 0 Ways/ SSRD

    Freq

    uenc

    y 8 16 0

    Ways/ SSRD

    Freq

    uenc

    y

    8 16 0 Ways/ SSRD

    Freq

    uenc

    y

    8 16 0 Ways/ SSRD

    Long RD accesses are protected in preferred partition

    Hits Misses

    Extra Ways

    T1 (+) T2 (-)

    T2 (--) T1 (++)

  • Effect of Over-allocation

    •  Benefits to preferred thread saturate at 14 ways •  Benefits to un-preferred thread increase as allocation falls •  Hits for un-preferred thread are in preferred thread partition

    34

    MICRO-46, 2013 8 7 6 5 4 3 2 1

    Allocation

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    Ref

    eren

    ces

    ⇥107MissPrivate Local Hit

    Private Foreign HitShared Local Hit

    Shared Foreign Hit

    Thread in preferred state Thread in un-preferred state

    8 11 14 17 20 23 26 29Allocation

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    Ref

    eren

    ces

    ⇥107 MissPrivate Self-HitPrivate Foreign Hit

    Shared Self-HitShared Foreign Hit

  • Adapting to Phase Changes • Changes in program phase need to be

    identified to trigger evaluation • Per-thread binary basic block vectors are

    used to identify the basic blocks touched in each interval

    •  Hamming distance between the BBVs of current and last intervals are compared to identify phase changes

    35

    MICRO-46, 2013

  • 1.1# 3# 0.8#7.6# 3.8# 6.5#

    53.9#

    23.41#

    0#10#20#30#40#50#60#

    8-8-8-8#

    11-7-7-7#

    14-6-6-6#

    17-5-5-5#

    20-4-4-4#

    23-3-3-3#

    26-2-2-2#

    29-1-1-1#

    Time#in#Percent#

    Alloca;ons#

    Considering Unpartitioned Cache • Time spent in various imbalance

    configurations for runs showing benefits of partitioning

    36

    MICRO-46, 2013

  • Round-robin vs. Critical-thread •  Prioritize the critical thread instead of using round robin •  No significant difference – Accelerating critical thread has

    the same effect as giving each thread a fair share

    37

    MICRO-46, 2013

    0"

    0.2"

    0.4"

    0.6"

    0.8"

    1"

    1.2"

    Art"

    Blackscholes"

    Canneal"

    Equake"

    Fluidanimate"

    Stream

    cluster"

    SwapAons"

    Swim"

    Wupwise"

    Normalized"ExecuAon"Time" Normalized"Misses"

    Performance of critical-thread normalized to round-robin


Recommended