8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 1/43
Ch 5.1CO&ISA, NLT 2013
Chapter 5: Memory
Ngo Lam Trung
[with materials from Computer Organization and Design, 4th Edition,
Patterson & Hennessy, © 2008, MK
and M.J. Irwin’s presentation, PSU 2008]
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 2/43
Ch 5.2CO&ISA, NLT 2013
Content
Memory hierarchy
Cache
Virtual memory
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 3/43
Ch 5.3CO&ISA, NLT 2013
Review: Major Components of a Computer
Processor
Control
Datapath
Memory
Devices
Input
Output
C a ch e
M ai n
M em or y
S e c on d ar y
M em or y
( Di sk )
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 4/43
Ch 5.4CO&ISA, NLT 2013
Memory technology
Static RAM (SRAM)
0.5ns – 2.5ns, $2000 – $5000 per GB
Dynamic RAM (DRAM)
50ns – 70ns, $20 – $75 per GB
Magnetic disk
5ms – 20ms, $0.20 – $2 per GB
Fact:
Large memories are slow
Fast memories are small (and expensive)
Ideal memory
Access time of SRAM
Capacity and cost/GB of disk
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 5/43
Ch 5.5CO&ISA, NLT 2013
A Typical Memory Hierarchy
SecondLevelCache
(SRAM)
Control
Datapath
SecondaryMemory(Disk)
On-Chip Components
R e gF i l e
MainMemory(DRAM)
D a
t a
C a c h e
I n s t r
C a c h e
I T L B
DT
L B
Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s
Size (bytes): 100’s 10K’s M’s G’s T’s
Cost: highest lowest
How to get an ideal memory
As fast as SRAM
As cheap as disk?
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 6/43
Ch 5.6CO&ISA, NLT 2013
Characteristics of the Memory Hierarchy
Increasingdistance
from theprocessorin access time
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusive – higher level issubset oflower level
4-8 bytes (word)
1 to 4 blocks
1,024+ bytes (disk sector = page)
8-32 bytes (block)
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 7/43Ch 5.7CO&ISA, NLT 2013
The Memory Hierarchy: Locality Principal
C program
int x[1000], temp;for (i = 0; i < 999; i++)
for (j = i+1; j < 1000; j++)if (x[i] < x[j]){
temp = x[i];x[i] = x[j];x[j] = temp;
}
Data memory space of x isaccess multiple times
Instruction memory spaceof the two for loops areused repeatedly
CPU
Cache
Main
memory
Transfer blocks of
instruction/data
Instruction fetch/data access
(multiple times)
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 8/43Ch 5.8CO&ISA, NLT 2013
The Memory Hierarchy: Locality Principal
Temporal Locality (locality in time)
If a memory location is referenced then it will tend to bereferenced again soon
Keep most recently accessed data items closer to the processor
Spatial Locality (locality in space) If a memory location is referenced, the locations with nearby
addresses will tend to be referenced soon
Move blocks consisting of contiguous words closer to theprocessor
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 9/43Ch 5.9CO&ISA, NLT 2013
Terminology
Block (or line): the minimum unit of information that ispresent in a cache
Hit: the requested memory item is found in a level of thememory hierarchy
Miss: the requested memory item is not found in a level ofthe memory hierarchy
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 10/43
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 11/43Ch 5.11CO&ISA, NLT 2013
Cache
The memory hierarchy between the processor and mainmemory
CPU fetch instructions and data from cache, if found (cache hit) fast access.
If not found (cache miss) load from main memory into cache,then access in cache miss penalty
Cache is a small windows to see the main memory
CPU
Cache
MainmemoryBlocks of data
Instruction fetchData access
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 12/43Ch 5.12CO&ISA, NLT 2013
Two questions to answer (in hardware):
Q1: How do we know if a data item is in the cache? Q2: If it is, how do we find it?
Direct mapped
Each memory block is mapped to exactly one block in thecache
- lots of lower level blocks must share blocks in the cache
Address mapping (to answer Q2):
(block address) modulo (# of blocks in the cache)
The tag field: associated with each cache block that containsthe address information (the upper portion of the address)required to identify the block (to answer Q1)
The valid bit: if there is data in the block or not
Cache Basics
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 13/43Ch 5.14CO&ISA, NLT 2013
Caching: A Simple First Example
00
01
10
11
Cache
Main Memory
Q2: How do we find it?
Use next 2 low ordermemory address bits
– the index – to
determine whichcache block (i.e.,modulo the number ofblocks in the cache)
Tag Data
Q1: Is it there?
Compare the cachetag to the high order 2memory address bits totell if the memory blockis in the cache
Valid
0000xx
0001xx
0010xx0011xx
0100xx
0101xx
0110xx
0111xx1000xx
1001xx
1010xx
1011xx
1100xx
1101xx
1110xx
1111xx
One word blocks
Two low order bitsdefine the byte in theword (32b words)
(block address) modulo (# of blocks in the cache)
Index
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 14/43
Ch 5.16CO&ISA, NLT 2013
Direct Mapped Cache
0 1 2 3
4 3 4 15
Consider the main memory word reference string
0 1 2 3 4 3 4 15
00 Mem(0) 00 Mem(0)
00 Mem(1)00 Mem(0) 00 Mem(0)
00 Mem(1)
00 Mem(2)
miss miss miss miss
miss misshit hit
00 Mem(0)
00 Mem(1)
00 Mem(2)00 Mem(3)
01 Mem(4)
00 Mem(1)
00 Mem(2)00 Mem(3)
01 Mem(4)
00 Mem(1)
00 Mem(2)00 Mem(3)
01 Mem(4)
00 Mem(1)
00 Mem(2)00 Mem(3)
01 4
11 15
00 Mem(1)00 Mem(2)
00 Mem(3)
Start with an empty cache - all
blocks initially marked as not valid
8 requests, 6 misses
What if we repeatedly request 1,000,000 times
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 15/43
Ch 5.17CO&ISA, NLT 2013
One word blocks, cache size = 1K words (or 4KB)
MIPS Direct Mapped Cache Example
20Tag 10Index
DataIndex TagValid0
1
2.
.
.
1021
1022
1023
31 30 . . . 13 12 11 . . . 2 1 0Byte
offset
What kind of locality are we taking advantage of?
20
Data
32
Hit
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 16/43
Ch 5.18CO&ISA, NLT 2013
Multiword Block Direct Mapped Cache
8Index
DataIndex TagValid0
12
.
.
.
253
254
255
31 30 . . . 13 12 11 . . . 4 3 2 1 0Byte
offset
20
20Tag
Hit Data
32
Block offset
Four words/block, cache size = 1K words
What kind of locality are we taking advantage of?
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 17/43
Ch 5.20CO&ISA, NLT 2013
Taking Advantage of Spatial Locality
0
Let cache block hold more than one word
0 1 2 3 4 3 4 15
1 2
3 4 3
4 15
00 Mem(1) Mem(0)
miss
00 Mem(1) Mem(0)
hit
00 Mem(3) Mem(2)
00 Mem(1) Mem(0)
miss
hit
00 Mem(3) Mem(2)
00 Mem(1) Mem(0)
miss
00 Mem(3) Mem(2)
00 Mem(1) Mem(0)01 5 4
hit
00 Mem(3) Mem(2)
01 Mem(5) Mem(4)
hit
00 Mem(3) Mem(2)
01 Mem(5) Mem(4)
00 Mem(3) Mem(2)
01 Mem(5) Mem(4)miss
11 15 14
Start with an empty cache - all
blocks initially marked as not valid
8 requests, 4 misses
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 18/43
Ch 5.21CO&ISA, NLT 2013
Miss Rate vs Block Size vs Cache Size
M i s s r a t e ( % )
Block size (bytes)
8 KB
16 KB
64 KB
256 KB
Miss rate goes up if the block size becomes a significantfraction of the cache size because the number of blocksthat can be held in the same size cache is smaller(increasing capacity misses)
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 19/43
Ch 5.22CO&ISA, NLT 2013
Cache Field Sizes
The number of bits in a cache includes both the storage
for data and for the tags 32-bit byte address
For a direct mapped cache with 2n blocks, n bits are used for theindex
For a block size of 2m words (2m+2 bytes), m bits are used to
address the word within the block and 2 bits are used to addressthe byte within the word
What is the size of the tag field?
The total number of bits in a direct-mapped cache is then
2n x (block size + tag field size + valid field size)
How many total bits are required for a direct mappedcache with 16KB of data and 4-word blocks assuming a32-bit address?
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 20/43
Ch 5.23CO&ISA, NLT 2013
Read hits (I$ and D$)
this is what we want!
Write hits (D$ only)
require the cache and memory to be consistent
- always write the data into both the cache block and the next level in
the memory hierarchy (write-through)- writes run at the speed of the next level in the memory hierarchy – so
slow! – or can use a write buffer and stall only if the write buffer is full
allow cache and memory to be inconsistent
- write the data only into the cache block (write-back the cache block to
the next level in the memory hierarchy when that cache block is“evicted”)
- need a dirty bit for each data cache block to tell if it needs to bewritten back to memory when it is evicted – can use a write buffer tohelp “buffer” write-backs of dirty blocks
Handling Cache Hits
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 21/43
Ch 5.24CO&ISA, NLT 2013
Sources of Cache Misses
Compulsory (cold start or process migration, firstreference):
First access to a block.
We cannot do much on this.
Solution: increase block size (but also increases miss penalty).
Capacity:
Cache cannot contain all blocks accessed by the program
Solution: increase cache size (may increase access time)
Conflict (collision):
Multiple memory locations mapped to the same cache location Solution 1: increase cache size
Solution 2: increase associativity (may increase access time)
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 22/43
Ch 5.25CO&ISA, NLT 2013
Handling Cache Misses (Single Word Blocks)
Read misses (I$ and D$)
stall the pipeline, fetch the block from the next level in the memory
hierarchy, install it in the cache and send the requested word tothe processor, then let the pipeline resume
Write misses (D$ only)
1. stall the pipeline, fetch the block from next level in the memory
hierarchy, install it in the cache (which may involve having to evicta dirty block if using a write-back cache), write the word from theprocessor to the cache, then let the pipeline resume
or
2. Write allocate – just write the word into the cache updating both
the tag and data, no need to check for cache hit, no need to stallor
3. No-write allocate – skip the cache write (but must invalidate thatcache block since it will now hold stale data) and just write theword to the write buffer (and eventually to the next memory level),
no need to stall if the write buffer isn’t full
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 23/43
Ch 5.26CO&ISA, NLT 2013
Multiword Block Considerations
Read misses (I$ and D$)
Processed the same as for single word blocks – a miss returnsthe entire block from memory
Miss penalty grows as block size grows
- Early restart – processor resumes execution as soon as therequested word of the block is returned
- Requested word first – requested word is transferred from thememory to the cache (and processor) first
Nonblocking cache – allows the processor to continue to accessthe cache while the cache is handling an earlier miss
Write misses (D$)
If using write allocate must first fetch the block from memory andthen write the word to the block (or could end up with a “garbled”block in the cache (e.g., for 4 word blocks, a new tag, one wordof data from the new block, and three words of data from the oldblock)
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 24/43
Ch 5.27CO&ISA, NLT 2013
The off-chip interconnect and memory architecture canaffect overall system performance in dramatic ways
Memory Systems that Support Caches
CPU
Cache
DRAMMemory
bus
One word wide organization (one word wide busand one word wide memory)
Assume
1. 1 memory bus clock cycle to send the addr
2. 15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle time), 5 memory bus clock cycles for 2nd,3rd, 4th words (column access time)
3. 1 memory bus clock cycle to return a word
of data
Memory-Bus to Cache bandwidth
number of bytes accessed from memoryand transferred to cache/CPU per memory
bus clock cycle
32-bit data&
32-bit addrper cycle
on-chip
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 25/43
Ch 5.29CO&ISA, NLT 2013
One Word Wide Bus, One Word Blocks
CPU
Cache
DRAMMemory
bus
on-chip
If the block size is one word, then for amemory access due to a cache miss,the pipeline will have to stall for thenumber of cycles required to return onedata word from memory
memory bus clock cycle to send address
memory bus clock cycles to read DRAMmemory bus clock cycle to return data
total clock cycles miss penalty
Number of bytes transferred per clockcycle (bandwidth) for a single miss is
bytes per memory bus clockcycle
1
151
17
4/17 = 0.235
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 26/43
Ch 5.31CO&ISA, NLT 2013
One Word Wide Bus, Four Word Blocks
CPU
Cache
DRAMMemory
bus
on-chip
What if the block size is four words andeach word is in a different DRAM row?
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
Number of bytes transferred per clockcycle (bandwidth) for a single miss is
bytes per clock
15 cycles
15 cycles
15 cycles
15 cycles
1
4 x 15 = 60
1
62
(4 x 4)/62 = 0.258
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 27/43
Ch 5.33CO&ISA, NLT 2013
One Word Wide Bus, Four Word Blocks
CPU
Cache
DRAMMemory
bus
on-chip
What if the block size is four words and allwords are in the same DRAM row?
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
Number of bytes transferred per clockcycle (bandwidth) for a single miss is
bytes per clock
15 cycles
5 cycles
5 cycles
5 cycles
1
15 + 3*5 = 30
1
32
(4 x 4)/32 = 0.5
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 28/43
Ch 5.35CO&ISA, NLT 2013
Interleaved Memory, One Word Wide Bus
For a block size of four words
cycle to send 1st
addresscycles to read DRAM banks
cycles to return last data word
total clock cycles miss penalty
CPU
Cache
bus
on-chip
Number of bytes transferredper clock cycle (bandwidth) for asingle miss is
bytes per clock
15 cycles
15 cycles
15 cycles
15 cycles
(4 x 4)/20 = 0.8
1
15
4*1 = 4
20
DRAMMemorybank 1
DRAMMemorybank 0
DRAMMemorybank 2
DRAMMemorybank 3
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 29/43
Ch 5.37CO&ISA, NLT 2013
Reducing Cache Miss Rates #1
1. Allow more flexible block placement
Direct mapped cache: a memory block maps to exactlyone cache block
Fully associative cache allow a memory block to bemapped to any cache block
A compromise is to divide the cache into sets each ofwhich consists of n “ways” (n-way set associative). Amemory block maps to a unique set (specified by theindex field) and can be placed in any way of that set (sothere are n choices)
(block address) modulo (# sets in the cache)
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 30/43
Ch 5.39CO&ISA, NLT 2013
Another Reference String Mapping
0 4 0 4
0 4 0 4
Consider the main memory word reference string
0 4 0 4 0 4 0 4
miss miss miss miss
miss miss miss miss
00 Mem(0) 00 Mem(0)01 4
01 Mem(4)000
00 Mem(0)01
4
00 Mem(0)
01 400 Mem(0)
014
01 Mem(4)000
01 Mem(4)000
Start with an empty cache - all
blocks initially marked as not valid
Ping pong effect due to conflict misses - two memory
locations that map into the same cache block
8 requests, 8 misses
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 31/43
Ch 5.40CO&ISA, NLT 2013
Set Associative Cache Example
0
Cache
Main Memory
Q2: How do we find it?
Use next 1 low ordermemory address bit to
determine whichcache set (i.e., modulothe number of sets inthe cache)
Tag Data
Q1: Is it there?
Compare all the cachetags in the set to thehigh order 3 memoryaddress bits to tell ifthe memory block is inthe cache
V
0000xx
0001xx0010xx
0011xx
0100xx
0101xx
0110xx
0111xx
1000xx
1001xx
1010xx
1011xx
1100xx1101xx
1110xx
1111xx
Set
1
0
1
Way
0
1
One word blocks
Two low order bitsdefine the byte in theword (32b words)
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 32/43
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 33/43
Ch 5.42CO&ISA, NLT 2013
Four-Way Set Associative Cache 28 = 256 sets each with four ways (each with one block)
31 30 . . . 13 12 11 . . . 2 1 0 Byte offset
DataTagV0
1
2
.
.
.
253
254
255
DataTagV0
1
2
.
.
.
253
254
255
DataTagV0
1
2
.
.
.
253
254
255
Index DataTagV0
1
2
.
.
.
253
254
255
8Index
22Tag
Hit Data
32
4x1 select
Way 0 Way 1 Way 2 Way 3
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 34/43
Ch 5.43CO&ISA, NLT 2013
Range of Set Associative Caches
For a fixed size cache, increase of the number of blocksper set results in decrease of the number of sets
Block offset Byte offsetIndexTag
Decreasing associativity
Fully associative(only one set)Tag is all the bits exceptblock and byte offset
Direct mapped(only one way)Smaller tags, only asingle comparator
Increasing associativity
Selects the setUsed for tag compare Selects the word in the block
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 35/43
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 36/43
Ch 5.45CO&ISA, NLT 2013
Reducing Cache Miss Rates #2
2. Use multiple levels of caches
Normally a unified L2 cache (holding both instructionsand data) and in some cases even a unified L3 cache
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 37/43
Ch 5.46CO&ISA, NLT 2013
Multilevel Cache Design Considerations Design considerations for L1 and L2 caches are very
different
Primary cache should focus on minimizing hit time in support ofa shorter clock cycle
- Smaller with smaller block sizes
Secondary cache(s) should focus on reducing miss rate toreduce the penalty of long main memory access times
- Larger with larger block sizes- Higher levels of associativity
The miss penalty of the L1 cache is significantly reducedby the presence of an L2 cache – so it can be smaller but
have a higher miss rate For the L2 cache, hit time is less important than miss rate
The L2$ hit time determines L1$’s miss penalty
L2$ local miss rate >> than the global miss rate
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 38/43
Ch 5.47CO&ISA, NLT 2013
Summary
Memory hierarchy and the locality principal
Cache design Direct mapped
Set associative
Memory access when cache hit and miss
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 39/43
Ch 5.48CO&ISA, NLT 2013
Review: The Memory Hierarchy
Increasingdistancefrom theprocessor inaccess time
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusive – whatis in L1$ is asubset of whatis in L2$ is asubset of whatis in MM that isa subset of is
in SM
4-8 bytes (word)
1 to 4 blocks
1,024+ bytes (disk sector = page)
8-32 bytes (block)
Take advantage of the principle of locality to present theuser with as much memory as is available in the cheapest
technology at the speed offered by the fastest technology
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 40/43
Ch 5.49CO&ISA, NLT 2013
Virtual Memory
Use main memory as a “cache” for secondary memory
Allows efficient and safe sharing of memory amongmultiple programs
Provides the ability to easily run programs larger than thesize of physical memory
Simplifies loading a program for execution by providingfor code relocation (i.e., the code can be loadedanywhere in main memory)
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 41/43
Ch 5.50CO&ISA, NLT 2013
Virtual Memory
What makes it work? – again the Principle of Locality
A program is likely to access a relatively small portion of itsaddress space during any period of time
Each program is compiled into its own address space – a“virtual” address space
During run-time each virtual address must be translated to a
physical address (an address in main memory)
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 42/43
Ch 5.51CO&ISA, NLT 2013
Virtual Memory
Program 1virtual address space
main memory
A program’s address space is divided into pages (all onefixed size) or segments (variable sizes)
The starting location of each page (either in main memory or insecondary memory) is contained in the program’s page table
Program 2virtual address space
8/12/2019 CA Chap5 Nlt2013
http://slidepdf.com/reader/full/ca-chap5-nlt2013 43/43
Virtual Memory
Some topics
Address Translation from virtual to physical memory address
Handling page fault (miss)
Cache design with virtual memory