ECE/CS 752 Final Project:The Best-Offset & Signature Path
Prefetcher Implementation
Qisi WangHui-Shun HungChien-Fu Chen
Outline Data Prefetching Exist Data Prefetcher
• Stride Data Prefetcher• Offset Prefetcher (Best-Offest Prefetcher)• Look-Ahead Prefetcher (Signature Pattern Prefetcher)
Experiment Result• Tool Background• Simulation Result
Conclusion
Data Prefetching(background) Prefetching the data before it is needed
• Reduce the compulsory miss• Reduce the memory access latency if
- High prefetching accuracy- Prefetch early enough
Goal: Predict which address is needed in the future Next N Lines Prefetching
• Always prefetch next N cache lines after a demand access or a demand miss
• Pros- Easy to implement- Suitable for sequential accessing
• Cons- Waste bandwidth on unwanted data if data pattern is irregular
Data Prefetching(background) Offset Prefetching
• Prefetch the address with an offset X• If X = 1 => Next Line Prefetching
Prefetcher with offset X
Demanded Address[A]
Prefetch Address[A] +X
Stride Prefetcher A kind of offset prefetcher with fixed distance 2 kind of stride prefetcher
• Program Counter (PC) based- Record the distance of memory access by load instruction- Next time fetch the same load instruction is fetched, prefetch last address +
distance• Cache block address based
- Prefetch A + X, A + 2X, A + 3X …..- Stream Buffer is a special case of this type of prefetcher
– Avoid cache pollution– If load miss, check stream buffer and pop to cache– If stream buffer also miss, allocate a new stream buffer
Cons• Distance (Stride) is fixed• Several varied offset scheme are proposed
- Best Offset (BO) Prefetcher- Signature Path Prefetcher (SPP)
Best-Offset Prefetcher (Idea) Varied offset through a learning procedure
• Finding the best offset value of different application• Several candidate of offset are tested
RR table records the completed prefetch requests• Prefetch Y, current offset is O => Y-O saves into RR table
Best-Offset Prefetcher (Learning) In learning phase, all the offsets in list will be tested (1 round)
• Each L2 access test 1 offset• DPC ver.: 46 offsets, paper ver.: 52 offsets
If hit in RR table, score + 1• All scores reset to 0 when learning phase begin
If learning phase finish (ex. 100 round) or some offset reach SCORE_MAX (DPC ver. = 31), the phase ends
The offset with highest score will be the best offset • New learning phase starts
Best-Offset Prefetcher 1-degree prefetcher (only prefetch 1 address)
• Prefetch 2 offset result many useless prefetch
Turn off the prefetcher if the best score too low• BAD_SCORE is the threshold• Learning procedure still work
MSHR threshold varied depends on BO score and L3 access rate
Signature Path Prefetcher Path confidence-based prefetcher History lookahead prefetching SPP table trained by L2 access Prefetching depend on
• The signature and pattern in SPP• The overall probability
Signature Path Prefetcher Table Updating
• When L2 access a page, the corresponding signature table will update
- Offset update- Offset difference (delta)
use to generate new signature
- The old signature is used for modifying pattern table
Same pattern will have same signature
• Reduce training time and PT store entries
Signature Path Prefetcher Prefetching
• Search the signature of current accessed page
• Choose the delta with highest probability Pi (Cdelta/Csig) of ith prefetch depth
• If multiply of all P larger than threshold
- Prefetch current address + delta
- Use delta to update signature and access pattern table again
• If P < threshold, the procedure end
System Setting CPU: TimingSimpleCPU
L1 Caches (Data/Instruction) L2 Cache
Size 16 KB 128 KB
Associativity 2 8
Tag Latency 2 Cycle 20 Cycles
Data Latency 2 Cycle 20 Cycles
MSHR Size 4 Entries 16 Entries
Replacement LRU LRU
Gem5 Implementation
CPU
L1DCache
L1ICache
L2 Cache
WriteQueue MSHR
Prefetcher
PriorityQueue
MemoryInterface
L2 Cache-Prefetcher Interface
L2 Cache
WriteQueue MSHR
Prefetcher
PriorityQueue
MemoryInterface
Notify onAccess&
Fill
insert
hit/missPCAddress
setwayis prefetchEvicted address
Compute Prefetch
Bechmark SettingPrefetcher Configuration
• basic PF Types: Baseline, Stride (PC&Addr)• DPC-2 PF Types: Best Offset, SPP, AMPM,
Benchmark• SPEC 2006
- 450.soplex- 454.calculix- 456 Hmmer- 462.libquantum- 998.specrand
Conclusion Contribution
• Open source Github repository @ hfsken/gem5-with-DPC-2-prefetcher
- With DPC-2 Wrapper for adding DPC PFs- Integrated with following DPC PFs: Best-Offset, AMPM, Stride, SPP
Summary• For a short term running time …
- Best-offset Prefetcher have better performance in benchmarks which has more regular access pattern and higher overall miss rate
- Performance gain in random access pattern is ignorable
Future Work• Complete documentation on Github repo• Analysis benchmark behavior in detail in the report
Reference [1] Pierre Michaud, “Best-Offset Hardware Prefetching” IEEE HPCA,
2016 [2] Pierre Michaud, “A Best-Offset Prefetcher” DPC-2, 2015 [3] J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson
and Z. Chishti, "Path confidence based lookahead prefetching,“ IEEE/ACM MICRO 2016
[4] Jinchun Kim, Paul V. Gratz and A. L. Narasimha Reddy, “Lookahead Prefetching with Signature Path”, DPC-2, 2015
[5] Course Slide of Prof. Onur Mutlu, CMU [6] Course Slide of Prof. Mikko Lipasti, UW Madison