Date post: | 21-Dec-2015 |
Category: |
Documents |
Upload: | jonathan-ward |
View: | 229 times |
Download: | 0 times |
ECE/CSC 506 - Yan Solihin 1
An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing
Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju and Huiyang Zhou
Department of Electrical and Computer Engineering
North Carolina State University
ECE/CSC 506 - Yan Solihin 1
Presentation Outline Access Map Pattern Matching (AMPM) Prefetcher Problems with AMPM
– Cold zone
– Inaccurate states within zones
Proposed Optimizations Configurable Block Sizing (CBS) Two-Level Prefetching Hardware Overhead Experimental Results Conclusion
ECE/CSC 506 - Yan Solihin 1
AMPM
・・・
0xAB04
0xAB03
0xAB05
0xAB06
0xABFF Cache Line
・・・
0xAB02
Prefetch
Access 3
Access 1
0xAB01
0xAB00
0xAAFF
Access 2Init/0 Access/2
Access
Access
Pre-Fetch/1
PrefetchCurrent access
ECE/CSC 506 - Yan Solihin 1
Problems with AMPM Cold Zone
– No Pattern is detected before the zone bitmaps is evicted from the zone table
… … 0 2 0 2 0 … … 0 2 2
0x480
0x4c0
0x500
0x580
0x540
0x9c0
0xa40
0xa00
Last Access before zone eviction
… … 0 0 0 0 0 … … 0 0 0
No pattern detected
ECE/CSC 506 - Yan Solihin 1
Problems with AMPM Cont. Inaccurate States in Zone
– The bits in zone bitmaps cannot reflect the actual states. (i.e. block evictions)
… … 2 2 2 2 0 … … 2 1 1
0x480
0x4c0
0x500
0x580
0x540
0x9c0
0xa40
0xa00
Access
Bitmap indicate “Access”, but is evicted previously
Cannot prefetch since AMPM treat it
as accessed and assumes it remain
in cache.
Prefetch Chance Lost!!!
… … 2 2 2 “2” 0 … … 2 1 1
ECE/CSC 506 - Yan Solihin 1
Proposed Optimizations Common Offset Table (COT)
– Record the most frequent accessed
offsets across different pages
– Update on every demand access
– Only init prefetch from COT when
COT gets high accuracy
… … 1 2 2 1 0 … …
… … 0 1 2 2 0 … …
… … 0 1 2 1 0 … …
Pref
Counter
Offset
LRU
Access map page 1
Access map page 2
Common Offset Table
ECE/CSC 506 - Yan Solihin 1
Proposed Optimizations Cont. Conflict Table
– Record how inaccurate the current information is
– Each entry in the table is corresponding to one page
– The entry counter will be increased when
inaccuracy is detected.
– The entry counter will be reset when the page
is evicted out
… … 0 1 2 2 0 … …
3
1
7
… …
4
Cache miss
update
Access map page
Conflict Table
3
1
8
… …
4
ECE/CSC 506 - Yan Solihin 1
Configurable Cache Line Sizing A block size monitor is used to select the best block size used for LLC. Block size selection algorithm (consider bandwidth and performance)
• Score = hit – A * (access – hit) * block_size
The selected blk size will be used to
guide the LLC prefetch.
ECE/CSC 506 - Yan Solihin 1
Two-Level Prefetching Specific for DPC2 framework. Change the state “Prefetch” in access map to “L2 Prefetch” and “LLC Prefetch”. Our main goal is to hide long main memory latency. And then try to hide the LLC
latency. During prefetch candidate selection, we will first choose the blocks which are not
prefetched. If the such candidates do not fill up the prefetch degree we will choose the blocks which are in “LLC prefetch” to transfer them into L2 cache.
ECE/CSC 506 - Yan Solihin 1
Hardware Overhead
Components StorageMemory Access
Map Table
Address Tag (64 b)LRU (6 b)
Access Map (3*64 b)
64entries
2.047KB
CBS monitor ATD 4 ATD 2.872KBCommon
Offset TableCounter (6 b)
LRU status (6 bits)Offset Map(64*6 bits
+64*1bit)
8entries
0.45KB
Conflict Table
Counter (6 bits) 64 entries
0.046KB
Prefetch Bit Prefetch (1 bit) 4096 blks
0.5KB
Cold ZoneMSHR
Tags (64 bits)LRU status (5 bits)
32 entries
0.27KB
Total 6.185KB
ECE/CSC 506 - Yan Solihin 1
Experimental Results The optimized prefetcher outperforms the baseline without prefetching by 10.8%.
Compared with the original AMPM, it achieves a speedup of 0.76% on average
ECE/CSC 506 - Yan Solihin 1
Conclusions We optimize the AMPM prefetcher by introducing two hardware components:
common offset table and conflict table. We combine the AMPM prefetcher with configurable block sizing and two-level
prefetching mechnisim.
ECE/CSC 506 - Yan Solihin 1
Question