𝑁𝑂𝑓𝑓−𝐶ℎ𝑖𝑝: # of requests already waiting for the same
bank in the off-chip memory
𝐿𝑂𝑓𝑓−𝐶ℎ𝑖𝑝: Typical latency of one off-chip memory
request, excluding queuing delays
𝐸𝑂𝑓𝑓−𝐶ℎ𝑖𝑝: 𝑁𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 × 𝐿𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 (total expected
queuing delay for off-chip)
𝑁𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: # of requests already waiting for the same
bank in the DRAM cache
𝐿𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: Typical latency of one DRAM cache
request, excluding queuing delays
𝐸𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: 𝑁𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒 × 𝐿𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒 (total
expected queuing delay for DRAM cache)
𝐸𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 < 𝐸𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: send request to off-chip
𝐸𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 ≥ 𝐸𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: send request to DRAM$
Self-Balancing Dispatch (SBD)
A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch
Motivation: • The cache line tracking structure (MissMap) to avoid a DRAM
cache access on a miss is expensive • Applying a conventional cache organization to DRAM caches
makes the aggregate system bandwidth under-utilized • Dirty data in DRAM caches severely restrict the effectiveness
of speculative techniques
The 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45)
Jaewoong Sim Hyesoon Kim Gabriel H. Loh Mike O‘Connor
Memory Request SBD
Input (Clean or Dirty) to HMP and SBD
DiRT HMP
Our Solution: HMP + SBD + DiRT • Use a low-cost hit-miss predictor to avoid a DRAM cache access on a miss (HMP) • Steer hit requests to either a DRAM cache or off-chip memory based on the
expected latency of both memory sources (SBD) • Maintain a mostly-clean DRAM cache via region-based WT/WB to guarantee the
cleanliness of a memory request (DiRT)
Hit/Miss phases within a 4KB region
Die-Stacked DRAM Cache Hit-Miss Predictor (HMP)
• Base Predictor (Default) => Prediction for 4MB regions
• 2nd-Level Table (On Tag Matching)
=> Prediction for 256KB regions
• 3rd-Level Table (On Tag Matching) => Prediction for 4KB regions
• Region-Based WT/WB policy
Dirty Region Tracker (DiRT) • Steer hit requests to a DRAM cache or off-chip
memory based on the expected latency of both memory sources
Putting It All Together
• HMP, SBD, and the DiRT can be accessed in parallel
• Based on the outcomes of the mechanisms, memory requests are sent to either DRAM$ or off-chip DRAM
Mithuna Thottethodi
0
20
40
60
80
1
10
19
28
37
46
55
64
73
82
91
10
0
10
9
11
8
12
7
13
6
14
5
15
4
16
3
17
2
18
1
19
0
19
9
#Lin
es
inst
alle
d
in t
he
cac
he
for
a p
age
#Accesses to the page
Miss Phase Hit Phase Miss Phase Hit Phase
Increasing on misses
Flat on hits
95%+ prediction accuracy less-than-1KB cost
Dirty Request?
DRAM$ Queue
DRAM Queue
Predicted Hit?
E(DRAM$) < E(DRAM)
YES
NO
NO
YES
NO
YES
DiRT HMP SBD
Mechanism
START
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 AVG.
Spe
ed
up
ove
r n
o
DR
AM
cac
he
MM HMP HMP + DiRT HMP + DiRT + SBD
Stacked DRAM$
Off-chip DRAM
Another Hit Request
Req. Buffer
Req. Buffer
Algorithm: Self-Balancing Dispatch
0%
20%
40%
60%
80%
100%
WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10
DiRT
CLEAN
2 0 0 9 0 8 0 1
4KB pages (or segments)
Write-Back
Write-Through
Clean!
• DiRT = Counting Bloom Filters (CBFs) + Dirty List •CBFs: Track the number of writes to different pages •Dirty List: Record most write-intensive pages
Install in Dirty List
TAG Memory request
Counting Bloom Filters Dirty List
NRU SET A
TAG NRU SET B
Write-back policy is applied to the pages in the Dirty List
0%
20%
40%
60%
80%
100%
WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10Pe
rce
nta
ge o
f w
rite
bac
ks
to D
RA
M
DiRT WB WT
3 tag blocks
29 data blocks
Ro
w D
eco
de
r
Sense Amplifier
DRAM Bank
…
DRAM (2KB ROW, 32 blocks for 64B line)
MissMap
4MB for 1GB DRAM$! Where to architect this?
Should we send this to the DRAM cache?
0%
20%
40%
60%
80%
100%
WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10
PH (To DRAM$) PH (To DRAM) Predicted Miss
Same Tech/Logic (DRAM Stack)
Processor Die
Credit: IBM
3 tag blocks
29 data blocks
Ro
w D
eco
de
r
Sense Amplifier
DRAM Bank
…
DRAM (2KB ROW, 32 blocks for 64B line)
MissMap
Hundreds of MBs On-Chip Stacked DRAM!!
• Two main usages • Use it as main memory • Use it as a large cache (DRAM cache)
This work is about the DRAM cache!
HMPMG: Multi-Granular Hit Miss Predictor
DRAM Cache Organization - Loh and Hill [MICRO’11]
• Break up memory space into coarse-grained regions (e.g. 4KB)
• Index into the HMPregion with a hash of the region’s base address
HMPregion: Region-Based Hit Miss Predictor
Provides the cache line existence info!
Research Research