+ All Categories
Home > Documents > A Mostly-Clean DRAM Cache for Effective Hit Speculation ... · 𝑁𝑂 − ℎ𝑖𝑝: # of...

A Mostly-Clean DRAM Cache for Effective Hit Speculation ... · 𝑁𝑂 − ℎ𝑖𝑝: # of...

Date post: 16-Oct-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
1
−ℎ : # of requests already waiting for the same bank in the off-chip memory −ℎ : Typical latency of one off-chip memory request, excluding queuing delays −ℎ : −ℎ × −ℎ (total expected queuing delay for off-chip) _ℎ : # of requests already waiting for the same bank in the DRAM cache _ℎ : Typical latency of one DRAM cache request, excluding queuing delays _ℎ : _ℎ × _ℎ (total expected queuing delay for DRAM cache) −ℎ < _ℎ : send request to off-chip −ℎ _ℎ : send request to DRAM$ Self-Balancing Dispatch (SBD) A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch Motivation: • The cache line tracking structure (MissMap) to avoid a DRAM cache access on a miss is expensive • Applying a conventional cache organization to DRAM caches makes the aggregate system bandwidth under-utilized • Dirty data in DRAM caches severely restrict the effectiveness of speculative techniques The 45 th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45) Jaewoong Sim Hyesoon Kim Gabriel H. Loh Mike O‘Connor Memory Request SBD Input (Clean or Dirty) to HMP and SBD DiRT HMP Our Solution: HMP + SBD + DiRT • Use a low-cost hit-miss predictor to avoid a DRAM cache access on a miss (HMP) • Steer hit requests to either a DRAM cache or off-chip memory based on the expected latency of both memory sources (SBD) • Maintain a mostly-clean DRAM cache via region-based WT/WB to guarantee the cleanliness of a memory request (DiRT) Hit/Miss phases within a 4KB region Die-Stacked DRAM Cache Hit-Miss Predictor (HMP) Base Predictor (Default) => Prediction for 4MB regions •2 nd -Level Table (On Tag Matching) => Prediction for 256KB regions •3 rd -Level Table (On Tag Matching) => Prediction for 4KB regions • Region-Based WT/WB policy Dirty Region Tracker (DiRT) • Steer hit requests to a DRAM cache or off-chip memory based on the expected latency of both memory sources Putting It All Together • HMP, SBD, and the DiRT can be accessed in parallel • Based on the outcomes of the mechanisms, memory requests are sent to either DRAM$ or off-chip DRAM Mithuna Thottethodi 0 20 40 60 80 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 #Lines installed in the cache for a page #Accesses to the page Miss Phase Hit Phase Miss Phase Hit Phase Increasing on misses Flat on hits 95%+ prediction accuracy less-than-1KB cost Dirty Request? DRAM$ Queue DRAM Queue Predicted Hit? E(DRAM$) < E(DRAM) YES NO NO YES NO YES DiRT HMP SBD Mechanism START 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 AVG. Speedup over no DRAM cache MM HMP HMP + DiRT HMP + DiRT + SBD Stacked DRAM$ Off-chip DRAM Another Hit Request Req. Buffer Req. Buffer Algorithm: Self-Balancing Dispatch 0% 20% 40% 60% 80% 100% WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 DiRT CLEAN 2 0 0 9 0 8 0 1 4KB pages (or segments) Write-Back Write-Through Clean! DiRT = Counting Bloom Filters (CBFs) + Dirty List •CBFs: Track the number of writes to different pages •Dirty List: Record most write-intensive pages Install in Dirty List TAG Memory request Counting Bloom Filters Dirty List NRU SET A TAG NRU SET B Write-back policy is applied to the pages in the Dirty List 0% 20% 40% 60% 80% 100% WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 Percentage of writebacks to DRAM DiRT WB WT 3 tag blocks 29 data blocks Row Decoder Sense Amplifier DRAM Bank DRAM (2KB ROW, 32 blocks for 64B line) MissMap 4MB for 1GB DRAM$! Where to architect this? Should we send this to the DRAM cache? 0% 20% 40% 60% 80% 100% WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 PH (To DRAM$) PH (To DRAM) Predicted Miss Same Tech/Logic (DRAM Stack) Processor Die Credit: IBM 3 tag blocks 29 data blocks Row Decoder Sense Amplifier DRAM Bank DRAM (2KB ROW, 32 blocks for 64B line) MissMap Hundreds of MBs On-Chip Stacked DRAM!! • Two main usages • Use it as main memory • Use it as a large cache (DRAM cache) This work is about the DRAM cache! HMPMG: Multi-Granular Hit Miss Predictor DRAM Cache Organization - Loh and Hill [MICRO’11] Break up memory space into coarse-grained regions (e.g. 4KB) Index into the HMPregion with a hash of the region’s base address HMPregion: Region-Based Hit Miss Predictor Provides the cache line existence info! Research Research
Transcript
Page 1: A Mostly-Clean DRAM Cache for Effective Hit Speculation ... · 𝑁𝑂 − ℎ𝑖𝑝: # of requests already waiting for the same bank in the off-chip memory 𝐿𝑂 − ℎ𝑖𝑝:

𝑁𝑂𝑓𝑓−𝐶ℎ𝑖𝑝: # of requests already waiting for the same

bank in the off-chip memory

𝐿𝑂𝑓𝑓−𝐶ℎ𝑖𝑝: Typical latency of one off-chip memory

request, excluding queuing delays

𝐸𝑂𝑓𝑓−𝐶ℎ𝑖𝑝: 𝑁𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 × 𝐿𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 (total expected

queuing delay for off-chip)

𝑁𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: # of requests already waiting for the same

bank in the DRAM cache

𝐿𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: Typical latency of one DRAM cache

request, excluding queuing delays

𝐸𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: 𝑁𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒 × 𝐿𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒 (total

expected queuing delay for DRAM cache)

𝐸𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 < 𝐸𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: send request to off-chip

𝐸𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 ≥ 𝐸𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: send request to DRAM$

Self-Balancing Dispatch (SBD)

A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

Motivation: • The cache line tracking structure (MissMap) to avoid a DRAM

cache access on a miss is expensive • Applying a conventional cache organization to DRAM caches

makes the aggregate system bandwidth under-utilized • Dirty data in DRAM caches severely restrict the effectiveness

of speculative techniques

The 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45)

Jaewoong Sim Hyesoon Kim Gabriel H. Loh Mike O‘Connor

Memory Request SBD

Input (Clean or Dirty) to HMP and SBD

DiRT HMP

Our Solution: HMP + SBD + DiRT • Use a low-cost hit-miss predictor to avoid a DRAM cache access on a miss (HMP) • Steer hit requests to either a DRAM cache or off-chip memory based on the

expected latency of both memory sources (SBD) • Maintain a mostly-clean DRAM cache via region-based WT/WB to guarantee the

cleanliness of a memory request (DiRT)

Hit/Miss phases within a 4KB region

Die-Stacked DRAM Cache Hit-Miss Predictor (HMP)

• Base Predictor (Default) => Prediction for 4MB regions

• 2nd-Level Table (On Tag Matching)

=> Prediction for 256KB regions

• 3rd-Level Table (On Tag Matching) => Prediction for 4KB regions

• Region-Based WT/WB policy

Dirty Region Tracker (DiRT) • Steer hit requests to a DRAM cache or off-chip

memory based on the expected latency of both memory sources

Putting It All Together

• HMP, SBD, and the DiRT can be accessed in parallel

• Based on the outcomes of the mechanisms, memory requests are sent to either DRAM$ or off-chip DRAM

Mithuna Thottethodi

0

20

40

60

80

1

10

19

28

37

46

55

64

73

82

91

10

0

10

9

11

8

12

7

13

6

14

5

15

4

16

3

17

2

18

1

19

0

19

9

#Lin

es

inst

alle

d

in t

he

cac

he

for

a p

age

#Accesses to the page

Miss Phase Hit Phase Miss Phase Hit Phase

Increasing on misses

Flat on hits

95%+ prediction accuracy less-than-1KB cost

Dirty Request?

DRAM$ Queue

DRAM Queue

Predicted Hit?

E(DRAM$) < E(DRAM)

YES

NO

NO

YES

NO

YES

DiRT HMP SBD

Mechanism

START

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 AVG.

Spe

ed

up

ove

r n

o

DR

AM

cac

he

MM HMP HMP + DiRT HMP + DiRT + SBD

Stacked DRAM$

Off-chip DRAM

Another Hit Request

Req. Buffer

Req. Buffer

Algorithm: Self-Balancing Dispatch

0%

20%

40%

60%

80%

100%

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10

DiRT

CLEAN

2 0 0 9 0 8 0 1

4KB pages (or segments)

Write-Back

Write-Through

Clean!

• DiRT = Counting Bloom Filters (CBFs) + Dirty List •CBFs: Track the number of writes to different pages •Dirty List: Record most write-intensive pages

Install in Dirty List

TAG Memory request

Counting Bloom Filters Dirty List

NRU SET A

TAG NRU SET B

Write-back policy is applied to the pages in the Dirty List

0%

20%

40%

60%

80%

100%

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10Pe

rce

nta

ge o

f w

rite

bac

ks

to D

RA

M

DiRT WB WT

3 tag blocks

29 data blocks

Ro

w D

eco

de

r

Sense Amplifier

DRAM Bank

DRAM (2KB ROW, 32 blocks for 64B line)

MissMap

4MB for 1GB DRAM$! Where to architect this?

Should we send this to the DRAM cache?

0%

20%

40%

60%

80%

100%

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10

PH (To DRAM$) PH (To DRAM) Predicted Miss

Same Tech/Logic (DRAM Stack)

Processor Die

Credit: IBM

3 tag blocks

29 data blocks

Ro

w D

eco

de

r

Sense Amplifier

DRAM Bank

DRAM (2KB ROW, 32 blocks for 64B line)

MissMap

Hundreds of MBs On-Chip Stacked DRAM!!

• Two main usages • Use it as main memory • Use it as a large cache (DRAM cache)

This work is about the DRAM cache!

HMPMG: Multi-Granular Hit Miss Predictor

DRAM Cache Organization - Loh and Hill [MICRO’11]

• Break up memory space into coarse-grained regions (e.g. 4KB)

• Index into the HMPregion with a hash of the region’s base address

HMPregion: Region-Based Hit Miss Predictor

Provides the cache line existence info!

Research Research

Recommended