+ All Categories
Home > Documents > ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line...

ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line...

Date post: 21-Dec-2015
Category:
Upload: jonathan-ward
View: 229 times
Download: 0 times
Share this document with a friend
13
ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University
Transcript
Page 1: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing

Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju and Huiyang Zhou

Department of Electrical and Computer Engineering

North Carolina State University

Page 2: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Presentation Outline Access Map Pattern Matching (AMPM) Prefetcher Problems with AMPM

– Cold zone

– Inaccurate states within zones

Proposed Optimizations Configurable Block Sizing (CBS) Two-Level Prefetching Hardware Overhead Experimental Results Conclusion

Page 3: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

AMPM

・・・

0xAB04

0xAB03

0xAB05

0xAB06

0xABFF Cache Line

・・・

0xAB02

Prefetch

Access 3

Access 1

0xAB01

0xAB00

0xAAFF

Access 2Init/0 Access/2

Access

Access

Pre-Fetch/1

PrefetchCurrent access

Page 4: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Problems with AMPM Cold Zone

– No Pattern is detected before the zone bitmaps is evicted from the zone table

… … 0 2 0 2 0 … … 0 2 2

0x480

0x4c0

0x500

0x580

0x540

0x9c0

0xa40

0xa00

Last Access before zone eviction

… … 0 0 0 0 0 … … 0 0 0

No pattern detected

Page 5: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Problems with AMPM Cont. Inaccurate States in Zone

– The bits in zone bitmaps cannot reflect the actual states. (i.e. block evictions)

… … 2 2 2 2 0 … … 2 1 1

0x480

0x4c0

0x500

0x580

0x540

0x9c0

0xa40

0xa00

Access

Bitmap indicate “Access”, but is evicted previously

Cannot prefetch since AMPM treat it

as accessed and assumes it remain

in cache.

Prefetch Chance Lost!!!

… … 2 2 2 “2” 0 … … 2 1 1

Page 6: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Proposed Optimizations Common Offset Table (COT)

– Record the most frequent accessed

offsets across different pages

– Update on every demand access

– Only init prefetch from COT when

COT gets high accuracy

… … 1 2 2 1 0 … …

… … 0 1 2 2 0 … …

… … 0 1 2 1 0 … …

Pref

Counter

Offset

LRU

Access map page 1

Access map page 2

Common Offset Table

Page 7: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Proposed Optimizations Cont. Conflict Table

– Record how inaccurate the current information is

– Each entry in the table is corresponding to one page

– The entry counter will be increased when

inaccuracy is detected.

– The entry counter will be reset when the page

is evicted out

… … 0 1 2 2 0 … …

3

1

7

… …

4

Cache miss

update

Access map page

Conflict Table

3

1

8

… …

4

Page 8: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Configurable Cache Line Sizing A block size monitor is used to select the best block size used for LLC. Block size selection algorithm (consider bandwidth and performance)

• Score = hit – A * (access – hit) * block_size

The selected blk size will be used to

guide the LLC prefetch.

Page 9: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Two-Level Prefetching Specific for DPC2 framework. Change the state “Prefetch” in access map to “L2 Prefetch” and “LLC Prefetch”. Our main goal is to hide long main memory latency. And then try to hide the LLC

latency. During prefetch candidate selection, we will first choose the blocks which are not

prefetched. If the such candidates do not fill up the prefetch degree we will choose the blocks which are in “LLC prefetch” to transfer them into L2 cache.

Page 10: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Hardware Overhead

Components StorageMemory Access

Map Table

Address Tag (64 b)LRU (6 b)

Access Map (3*64 b)

64entries

2.047KB

CBS monitor ATD 4 ATD 2.872KBCommon

Offset TableCounter (6 b)

LRU status (6 bits)Offset Map(64*6 bits

+64*1bit)

8entries

0.45KB

Conflict Table

Counter (6 bits) 64 entries

0.046KB

Prefetch Bit Prefetch (1 bit) 4096 blks

0.5KB

Cold ZoneMSHR

Tags (64 bits)LRU status (5 bits)

32 entries

0.27KB

Total     6.185KB

Page 11: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Experimental Results The optimized prefetcher outperforms the baseline without prefetching by 10.8%.

Compared with the original AMPM, it achieves a speedup of 0.76% on average

Page 12: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Conclusions We optimize the AMPM prefetcher by introducing two hardware components:

common offset table and conflict table. We combine the AMPM prefetcher with configurable block sizing and two-level

prefetching mechnisim.

Page 13: ECE/CSC 506 - Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

ECE/CSC 506 - Yan Solihin 1

Question


Recommended