Maurizio Palesi 1
Memory HierarchyMemory Hierarchy
Maurizio Palesi
Maurizio Palesi 2
ReferencesReferencesJohn L. Hennessy and David A. Patterson,
Computer Architecture a Quantitative Approach, second edition, Morgan KaufmannChapter 5
Maurizio Palesi 3
Who Cares About the Memory Hierarchy?Who Cares About the Memory Hierarchy?
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU19
82
ProcessorMemoryPerformance Gap:(grows 50% / year)
Per
form
ance
“Moore’s Law”
ProcessorDRAM Memory Gap (latency)
Maurizio Palesi 4
Maurizio Palesi 5
Maurizio Palesi 6
Maurizio Palesi 7
Levels of the Memory HierarchyLevels of the Memory Hierarchy
CPU Registers100s Bytes<10s ns
CacheK Bytes10100 ns10.1 cents/bit
Main MemoryM Bytes200ns 500ns$.0001.00001 cents /bit
DiskG Bytes, 10 ms (10,000,000 ns)105 106 cents/bit
Tapeinfinitesecmin108
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler18 bytes
cache cntl8128 bytes
OS5124K bytes
user/operatorMbytes
Faster
Larger
Maurizio Palesi 8
What is a Cache?What is a Cache? Small, fast storage used to improve average access time to
slow memory Exploits spatial and temporal locality In computer architecture, almost everything is a cache!
Registers a cache on variablesFirstlevel cache a cache on secondlevel cacheSecondlevel cache a cache on memoryMemory a cache on disk (virtual memory)TLB a cache on page tableBranchprediction a cache on prediction information?
Maurizio Palesi 9
The Principle of LocalityThe Principle of Locality The Principle of Locality
Program access a relatively small portion of the address space at any instant of time
Two Different Types of LocalityTemporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
Last 20 years, HW relied on locality for speed
Maurizio Palesi 10
Exploit LocalityExploit Locality By taking advantage of the principle of locality
Present the user with as much memory as is available in the cheapest technology
Provide access at the speed offered by the fastest technology
DRAM is slow but cheap and denseGood choice for presenting the user with a BIG memory
system SRAM is fast but expensive and not very dense
Good choice for providing the user FAST access time
Maurizio Palesi 11
General PrincipesGeneral Principes Locality
Temporal Locality: referenced again soonSpatial Locality: nearby items referenced soon
Locality + smaller HW is faster = memory hierarchyLevels: each smaller, faster, more expensive/byte than level below Inclusive: data found in top also found in the bottom
DefinitionsUpper is closer to processorBlock is the minimum unit that is present or not in upper levelAddress = Block frame address + block offset address
Maurizio Palesi 12
Memory Hierarchy: TerminologyMemory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)
Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 (Hit Rate) Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block to the processor Hit Time << Miss Penalty (500 instructions on 21264!)
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
Maurizio Palesi 13
Cache MeasuresCache Measures Average memoryaccess time
= Hit time + Miss rate x Miss penalty [ns or clocks] Miss penalty: time to replace a block from lower level, including time
to replace in CPUaccess time: time to lower level
= f(latency to lower level) transfer time: time to transfer block
=f(BW between upper & lower levels)
Maurizio Palesi 14
Block Size TradeoffBlock Size Tradeoff In general, larger block size take advantage of spatial locality BUT
Larger block size means larger miss penalty Takes longer time to fill up the block
If block size is too big relative to cache size, miss rate will go up Too few cache blocks
In general, Average Access Time= Hit Time + Miss Penalty x Miss Rate
MissPenalty
Block Size
MissRate Exploits Spatial Locality
Fewer blocks: compromisestemporal locality
AverageAccess
Time
Increased Miss Penalty& Miss Rate
Block Size Block Size
Maurizio Palesi 15
4 Questions for Memory Hierarchy4 Questions for Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement)Fully Associative, Set Associative, Direct Mapped
Q2: How is a block found if it is in the upper level?(Block identification)Tag/Block
Q3: Which block should be replaced on a miss? (Block replacement)Random, LRU
Q4: What happens on a write? (Write strategy)Write Back or Write Through (with Write Buffer)
Maurizio Palesi 16
Q1: Where can a block be placed in Q1: Where can a block be placed in the upper level?the upper level?
0 1 2 3 4 5 6 7Blockno.
Fully associative:block 12 can go anywhere
0 1 2 3 4 5 6 7Blockno.
Direct mapped:block 12 can go only into block 4 (12 mod 8)
0 1 2 3 4 5 6 7Blockno.
Set associative:block 12 can go anywhere in set 0 (12 mod 4)
Set0
Set1
Set2
Set3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Blockframe address
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.
Maurizio Palesi 17
Q2: How Is a Block Found If It Is in Q2: How Is a Block Found If It Is in the Upper Level?the Upper Level?
Tag on each blockNo need to check index or block offset
Increasing associativity shrinks index, expands tag
Full Associative: No indexDirect Mapped : Large index
Maurizio Palesi 18
Cache Direct MappedCache Direct Mapped
00000 ( 0)00001 ( 1)
00111 ( 7)01000 ( 8)
01111 (15)10000 (16)
10111 (23)11000 (24)
11111 (31)
Main Memory
Part. 0
Part. 1
Part. 2
Part. 3
0
7
Cache
Maurizio Palesi 19
Cache Direct MappedCache Direct Mapped
00000 ( 0)00001 ( 1)
00111 ( 7)01000 ( 8)
01111 (15)10000 (16)
10111 (23)11000 (24)
11111 (31)
Main Memory
Part. 0
Part. 1
Part. 2
Part. 3
Cache
0
7
0
7
tag
Maurizio Palesi 20
Q3: Which Block Should be Q3: Which Block Should be Replaced on a Miss?Replaced on a Miss?
Easy for Direct MappedS.A. or F.A.:
Random (large associativities)LRU (smaller associativities)
Associativity2way 4way 8way
Size LRU RND LRU RND LRU RND16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Maurizio Palesi 21
Q4: What Happens on a Write?Q4: What Happens on a Write?
Write through: The information is written to both the block in the cache and to the block in the lowerlevel memory
Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced Is block clean or dirty?
Pros and Cons of eachWT: read misses cannot result in writes (because of replacements)WB: no writes of repeated writes
WT always combined with write buffers so that don’t wait for lower level memory
Maurizio Palesi 22
Write Buffer for Write ThroughWrite Buffer for Write Through
A Write Buffer is needed between the Cache and MemoryProcessor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory
Write buffer is just a FIFOTypical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designer’s nightmareStore frequency (w.r.t. time) > 1 / DRAM write cycleWrite buffer saturation
ProcessorCache
Write buffer
DRAM
Maurizio Palesi 23
How a Block is Found in CacheHow a Block is Found in Cache
CPU addresstag index offset
Cache tag Cache data
Hit/miss
CPUData bus
compare
Maurizio Palesi 24
How a Block is Found in CacheHow a Block is Found in Cache
Two sets ofAddress tagsand data RAM
2:1 Muxfor the way
Use addressbits to selectcorrect DRAM
Maurizio Palesi 25
Reducing MissesReducing Misses Classifying Misses: 3 Cs
Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses
Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved
Conflict If blockplacement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses
Maurizio Palesi 26
3Cs Absolute Miss Rate 3Cs Absolute Miss Rate (SPEC92)(SPEC92)
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
Maurizio Palesi 27
2:1 Cache Rule2:1 Cache Rule miss rate 1way associative cache size X = miss rate 2way associative cache size X/2
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.141 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
Maurizio Palesi 28
3Cs Relative Miss Rate3Cs Relative Miss Rate
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0%
20%
40%
60%
80%
100%1 2 4 8
16
32
64
12
8
1-way
2-way4-way
8-way
Capacity
Compulsory
Conflict