1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana...

Post on 14-Dec-2015

213 views 0 download

Tags:

transcript

1

A Hybrid Adaptive Feedback Based Prefetcher

Santhosh Verma, David Koppelman and Lu PengLouisiana State University

2

Motivation

Can’t always expect high prefetch accuracy & timeliness

Potential can be lost when these are low Adaptive schemes adjust aggressiveness

based on effectiveness Adaption and selectiveness as important as

address prediction

3

Our Scheme – Hybrid Adaptive Prefetcher (HAP)

Start with good address prediction – Stride / Sequential hybrid Sequential prefetching scheme requires no warmup Stride prefetcher is more robust

Issue prefetches selectively Incorporate a published adaptive prefetch

method Feedback Directed Prefetching (Srinath et. al, HPCA

2007) Improve with bandwidth adaption

4

Related Work – Feedback Directed Prefetching (HPCA 2007)

Prefetcher aggressiveness defined by prefetch distance and degree

Aggressiveness adjusted dynamically based on three feedback metrics Percentage of useful prefetches Percentage of late prefetches Percentage of prefetches which cause demand

misses (cache pollution)

5

Differences between FDP and our scheme

Use both L1 and L2 prefetching Scheme is modified to support L1/L2

Use a hybrid stride / sequential prefetching scheme

A bandwidth based feedback metric is proposed

No cache pollution metric

6

Stride/Sequential Prefetching Scheme – Training Stride Prefetcher

Use a PC-indexed stride prediction scheme

Stride Prediction Table Entry

1. Compute new stride usingthis field and current address value

2. Store computed stride

3. Increment count for unchanged strideReset otherwise

Entry is trained if Count is above a threshold

value

7

Stride/Sequential Prefetching Scheme – Issuing Prefetches

Check stride table on demand miss / hit to prefetched line Issue stride prefetches based on degree and

distance

Sequential prefetches If no valid / trained stride entry If previous line present in cache Issue sequential prefetches based on degree

8

Adjusting Aggressiveness with Feedback Metrics

Prefetch Accuracy – Percentage of prefetches used by a demand request

Prefetch Lateness – Percentage of accurate prefetches which are late

Bandwidth Contention – Percentage of clock cycles during which cache bandwidth is above a threshold

Evaluate separately for L1 and L2 Evaluate periodically after fixed number of cycles.

Adjust aggressiveness if justified.

9

Storage efficient Miss Status Hit Registers (MSHRs)

Used to track all inflight / inqueue memory requests at both cache levels

MSHR Entry

1. Entry allocated for each outstanding L1 and / or L2 request. Valid bit set.

2. Two bit cache level field indicates L1 only, L2 only or combined L1 / L2

3. Two prefetch bits indicate prefetch requests

4. Concurrent L1 and L2 requests to the same line share the sameMSHR entry

10

Implementing Feedback Metrics

Prefetch Accuracy Prefetch bit set for prefetched line brought into

cache Bit set in MSHR for inflight / inqueue prefetched

lines Increment accurate count if demand request finds

a set bit Reset bit after increment Accuracy is based on percentage of total

prefetches issued

11

Implementing Feedback Metrics

Prefetch Lateness Prefetch bit (s) set in MSHR for a prefetched

inflight / inqueue line On demand miss, late prefetch detected

If a valid MSHR entry exists for this miss If prefetch bit for the correct cache level is set

Reset bit after incrementing late count Lateness is based on percentage of useful

prefetches

12

Implementing Feedback Metrics

Bandwidth Contention - 1 Use MSHR to monitor total outstanding L1 and L2

requests in every cycle Increment counter for every cycle that total is

above threshold The contention rate is based on percentage of total

cycles

Bandwidth Contention - 2 Prefetches not issued if outstanding requests are

above threshold

13

Adjusting Aggressiveness

Evaluate metrics at fixed intervals Determine if high or low based on a threshold May adjust aggressiveness based on following

criteria

Aggressiveness Policy

14

Prefetcher Aggressiveness Levels

Aggressiveness adjusted in increments of one

Prefetcher Aggressiveness Levels

Middle Aggressiveness

Very Conservative

Very Aggressive

15

Experimental Evaluation - Setup

Evaluate 15 SPEC CPU 2006 Benchmarks using CMPSim Simulator

Evaluate for three competition configurations Config 1 – 2048 KB L2 Cache, unlimited bandwidth Config 2 – 2048 KB L2 Cache, limited bandwidth Config 3 – 512 KB L2 Cache, limited bandwidth

Limited bandwidth configs allow one L1 issue per cycle and one L2 per 10 cycles

16

Experimental Evaluation - Setup

Compare our scheme, Hybrid Adaptive Predictor (HAP) to four configurations No prefetching Middle Aggressive Stride Very Aggressive Stride Modified Feedback Directed Prefetcher

Uses both L1 / L2 prefetching Does not use a cache pollution metric

17

Results - Expectations

Very aggressive stride will do better on some, worse on other benchmarks

Adaptive schemes will perform at least as well as non-adaptive

Unlimited bandwidth and large cache configurations benefit aggressive schemes

18

Results – Bandwidth Unlimited, 2 MB L2 Config

•HAP outperforms other prefetchers for all benchmarks except lbm

•Performance benefit compared to mid-aggressive stride is 11% on average and 46% versus no prefetching.

19

Results – Bandwidth Limited, 2 MB L2 Config

•HAP is best on average. Aggressive stride performs best in three benchmarks (mcf, lbm and soplex)

•Performance benefit compared to mid-aggressive stride is 9% on average and 45% versus no prefetching.

20

Results – Bandwidth Limited, 512 KB L2 Config

•Results are similar to Config 2

•Performance benefit compared to mid-aggressive stride is 8% on average and 44% versus no prefetching.

21

Results (All benchmarks) – Bandwidth Limited, 2 MB L2 Config

•Additional benchmarks are mostly unaffected by prefetching

•Performance benefit compared to mid-aggressive stride is 6% on average and 29% versus no prefetching for all benchmarks.

22

Conclusions

A well designed and adaptive prefetching scheme is very effective

Very aggressive stride works best for some benchmarks

A cache pollution metric may improve results

23

THANK YOU

QUESTIONS?