1 Copyright © 2012, Elsevier Inc. All rights reserved.
Chapter 2
Memory Hierarchy Design
Computer Architecture A Quantitative Approach, Fifth Edition
2 Copyright © 2012, Elsevier Inc. All rights reserved.
Introduction Programmers want unlimited amounts of memory with
low latency Fast memory technology is more expensive per bit than
slower memory Solution: organize memory system into a hierarchy
Entire addressable memory space available in largest, slowest memory
Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor
Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the illusion of a large, fast memory being presented to the
processor
Introduction
4 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Performance Gap Introduction
Increase in memory requests/sec
Increase in memory acesses/sec
5 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Design Memory hierarchy design becomes more crucial
with recent multi-core processors: Aggregate peak bandwidth grows with # cores:
Intel Core i7 can generate two references per core per clock Four cores and 3.2 GHz clock
25.6 billion 64-bit data references/second + 12.8 billion 128-bit instruction references = 409.6 GB/s!
DRAM bandwidth is only 6% of this (25 GB/s) Requires:
Multi-port, pipelined caches Two levels of cache per core Shared third-level cache on chip
Introduction
6 Copyright © 2012, Elsevier Inc. All rights reserved.
Performance and Power High-end microprocessors have >10 MB on-chip
cache Consumes large amount of area and power budget Static power due to leakage when not operating and
dynamic power when performing reads/writes. Power consumption in PMDs due to caches can be
between 25% and 50% of the total power consumption.
Introduction
7 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Basics When a word is not found in the cache, a miss
occurs: Fetch word from lower level in hierarchy, requiring a
higher latency reference Lower level may be another cache or the main
memory Also fetch the other words contained within the block
Takes advantage of spatial locality Place block into cache in any location within its set,
determined by address block address MOD number of sets in the cache
Introduction
8 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Memory Hierarchy Basics Introduction
CPU
000 001 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
Read 007
CACHE MISS!
9 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Memory Hierarchy Basics Introduction
CPU
000 001 003 004
005 006 007 008
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
007
005 006 007 008
005 006 007 008
10 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Memory Hierarchy Basics Introduction
CPU
000 001 003 004
005 006 007 008
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
Read 005
005 006 007 008
CACHE HIT!
11 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Memory Hierarchy Basics Introduction
CPU
000 001 003 004
005 006 007 008
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
005
005 006 007 008
12 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Memory Hierarchy Basics Introduction
CPU
000 001 003 004
005 006 007 008
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
Read 018
005 006 007 008
CACHE MISS!
13 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Memory Hierarchy Basics Introduction
CPU
000 001 003 004
005 006 007 008
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
018
005 006 007 008
CACHE MISS!
017 018 019 020
017 018 019 020
14 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Basics Placement of blocks in a cache
Set associative: block is mapped into a set and the block is placed anywhere in the set
Finding a block: map block address to set and search set (usually in parallel) to find block.
n blocks in a set: n-way associative cache One block per set (n=1): direct-mapped cache One set per cache: fully associative cache
Introduction
15 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Fully Associative Cache Introduction
000 001 003 004
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
005 006 007 008 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5
Memory block 3 can go anywhere in the cache.
Block 1 Block 2 Block 3
Block 0
16 Copyright © 2013, Daniel A. Menasce. All rights reserved.
2-Way Associative Cache Introduction
000 001 003 004
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
005 006 007 008 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5
Memory block 3 can only go into cache set (3 mod 2) = 1 in the cache.
Block 1 Block 2 Block 3
Block 0 SET 0
SET 1
17 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Direct-mapped Cache Introduction
000 001 003 004
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
005 006 007 008 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5
Memory block 5 can only go into cache block (5 mod 4) = 1 in the cache.
Block 1 Block 2 Block 3
Block 0 SET 0 SET 1 SET 2 SET 3
18 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Processor/Cache Addressing Introduction
2-way associative cache.
Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
0
1
V B
INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.
1 1 0 0
TAG Index Block Offset
<2> <1> <1>
1
0 1 1 1 0 0 1 0
1 1
TAG
01110010
MEMORY
CACHE
Cache data
<8> address
Set 0: B0
Set 1: B1
Set 0: B2
Set 1: B3
Set 0: B4
Set 1: B5
Set 0: B6
Set 1: B7
1 0 1 0 0 0 1 1
10100011 1 1 0 11101111 10001001
1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1
19 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Processor/Cache Addressing Introduction
2-way associative cache.
Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.
1 1 0 0
TAG Index Block Offset
<2> <1> <1> 0 1 1 1 0 0 1 0
MEMORY
<8> address
= CACHE HIT
Set 0: B0
Set 1: B1
Set 0: B2
Set 1: B3
Set 0: B4
Set 1: B5
Set 0: B6
Set 1: B7
V B
1 1 1
TAG
01110010
CACHE
Cache data
10100011 1 1 0 11101111 10001001
0 1 1 1 0 0 1 0
1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1
1 0 1 0 0 0 1 1
0
1
20 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Processor/Cache Addressing Introduction
2-way associative cache.
Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.
1 1 0 0
TAG Index Block Offset
<2> <1> <1> 0 1 1 1 0 0 1 0
MEMORY
<8> address
= CACHE MISS
WHY?
Set 0: B0
Set 1: B1
Set 0: B2
Set 1: B3
Set 0: B4
Set 1: B5
Set 0: B6
Set 1: B7
V B
0 1 1
TAG
01001110
CACHE
Cache data
10110111 1 1 0 11101111 10001001
0
1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1
1 0 1 0 0 0 1 1
21 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Processor/Cache Addressing Introduction
2-way associative cache.
Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.
0 1 0 0
TAG Index Block Offset
<2> <1> <1> 0 1 1 1 0 0 1 0
MEMORY
<8> address
CACHE MISS
Set 0: B0
Set 1: B1
Set 0: B2
Set 1: B3
Set 0: B4
Set 1: B5
Set 0: B6
Set 1: B7
V B
0 1 1
TAG
01001110
CACHE
Cache data
10110111 1 1 0 11101111 10001001
0
1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1
1 0 1 0 0 0 1 1
≠
22 Copyright © 2012, Elsevier Inc. All rights reserved.
Cache Miss: Block Replacement Which Block to Replace?
Direct-mapped cache: Only one option: replace the block in the location where the
incoming block has to go. Fully Associative or Set Associative
Random: spreads allocation uniformly. Least Recently Used (LRU): the block replaced is the one
that has been unused for the longest time. Can be expensive to implement in hardware. Pseudo-LRU provides an approximation using bits associated to each set to record when blocks in a set were accessed.
First In First Out (FIFO): selects the oldest rather than the LRU block.
Introduction
23 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Basics Writing to cache: two strategies
Write-through Immediately update lower levels of hierarchy
Write-back Only update lower levels of hierarchy when an updated block
is replaced
Both strategies use write buffer to make writes asynchronous
Introduction
24 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Write-through Introduction
CPU
000 001 003 004
005 006 007 008
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
Write 018
005 006 007 008
CACHE HIT!
018
017 018 019 020 Block 1 Block 2 Block 3
Block 0
25 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Write-back – Part I Introduction
CPU
000 001 003 004
005 006 007 008
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
Write 018
005 006 007 008
CACHE HIT! 017 018 019 020 Block 1
Block 2 Block 3
Block 0
26 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Write-back – Part II Introduction
CPU
000 001 003 004
005 006 007 008
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
Read 024
005 006 007 008
CACHE MISS! 017 018 019 020 Block 1
Block 2 Block 3
Block 0
018
27 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Write-back – Part III Introduction
CPU
000 001 003 004
005 006 007 008
009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024
MEMORY
CACHE
Read 024
005 006 007 008
BLOCK 1 REPLACEMENT
Block 1 Block 2 Block 3
Block 0 021 022 023 024
021 022 023 024
024
28 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Basics Miss rate
Fraction of cache access that result in a miss
Causes of misses (3 C’s) Compulsory
First reference to a block. Would happen even for infinite caches.
Capacity Blocks discarded and later retrieved
Conflict Program makes repeated references to multiple addresses
from different blocks that map to the same location in the cache
Introduction
29
Note that speculative and multithreaded processors may execute other instructions during a miss Reduces performance impact of misses
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Basics Introduction
€
MemoryStallCycles = NumberOfMisses×MissPenalty =
= IC×Misses
Instruction×MissPenalty
= IC×MemoryAccesses
Instruction×MissRate ×MissPenalty
30 Copyright © 2013, Daniel A. Menasce All rights reserved.
Memory Performance Introduction Example: Assume that the CPI of a computer is 1.0 when all memory
accesses hit in the cache. Data access instructions (loads and stores) account for 50% of all instructions. The miss penalty is 25 cycles and the miss rate is 4%. Assume that 70% instructions are cache hits. What is the execution time in clock cycles (CC) and Instruction Counts (IC)?
Instruction Data Cache Hit/Miss Probability No. CCs
0.04 miss 0.014 1+25=26
0.96 hit 0.336 1
1
1
0.04 miss 0.006 25+25+1=51
0.96 hit 0.144 25+1=26
25+1=26
25+1=26
IC*(0.014*26+(0.336+0.35)*1+0.006*51+(0.144+0.15)*26)*CC = 9 * IC * CC
0.35
0.15
0.7 Hit
0.3 Miss
0.5 memory
0.5 no memory
0.5 memory
0.5 no memory
31 Copyright © 2013, Daniel A. Menasce All rights reserved.
Memory Performance Introduction Example: How would the execution time in the previous example
change if the miss rate were reduced to 2%?
Instruction Data Cache Hit/Miss Probability No. CCs
0.02 miss 0.007 1+25=26
0.98 0.343 1
1
1
0.02 0.003 25+25+1=51
0.98 0.147 25+1=26
25+1=26
25+1=26
IC*(0.007*26+(0.334+0.35)*1+0.003*51+(0.147+0.15)*26)*CC = 8.741 * IC * CC
0.7 Hit
0.3 Miss
0.5 memory
0.5 no memory
0.5 memory
0.5 no memory
0.35
0.15
32 Copyright © 2013, Daniel A. Menasce All rights reserved.
Memory Performance Introduction Example: What is the miss rate were kept at 4% but memory access
time were improved so that the miss penalty were reduced to 10 cycles? Instruction Data Cache Hit/Miss Probability No. CCs
0.04 miss 0.014 1+10=11
0.96 hit 0.336 1
1
1
0.04 miss 0.006 10+10+1=21
0.96 hit 0.144 10+1=11
10+1=11
10+1=11
IC*(0.014*11+(0.336+0.35)*1+0.006*21+(0.144+0.15)*11)*CC = 4.2 * IC * CC
0.7 Hit
0.3 Miss
0.5 memory
0.5 no memory
0.5 memory
0.5 no memory
0.35
0.15
Avg. of 3.2 stall cycles/instruction
33 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Basics Six basic cache optimizations:
Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty
Larger total cache capacity to reduce miss rate Increases hit time, increases power consumption
Higher associativity Reduces conflict misses Increases hit time, increases power consumption
Higher number of cache levels (Multi-level caches) Reduces overall memory access time:
Giving priority to read misses over writes Reduces miss penalty
Avoiding address translation in cache indexing Reduces hit time
Introduction
€
HitTimeL1 +MissRateL1 × (HitTimeL2 +MissRateL2 ×MissPenaltyL2)
34 Copyright © 2012, Elsevier Inc. All rights reserved.
Ten Advanced Optimizations Based on improvements on
Hit time Miss rate Miss penalty Cache bandwidth Power consumption
Advanced O
ptimizations
35 Copyright © 2012, Elsevier Inc. All rights reserved.
Ten Advanced Optimizations (1) Small and simple first level caches
Critical timing path: addressing tag memory, then comparing tags, then selecting correct set
Direct-mapped caches can overlap tag compare and transmission of data
Lower associativity reduces power because fewer cache lines are accessed
Advanced O
ptimizations
36 Copyright © 2012, Elsevier Inc. All rights reserved.
L1 Size and Associativity
Access time vs. size and associativity
Advanced O
ptimizations
37 Copyright © 2012, Elsevier Inc. All rights reserved.
L1 Size and Associativity
Energy per read vs. size and associativity
Advanced O
ptimizations
38 Copyright © 2013, Daniel A. Menasce All rights reserved.
Memory Performance Example Introduction Determine if a 32-KB 4-way set associative L1 cache has a faster memory
access time than a 32-KB 2-way set associative L1 cache. Assume that the miss penalty to L2 is 15 clock cycles and that the hit time for the the faster L1 cache is 1 clock cycle. Ignore misses beyond the L2 cache and assume miss rates for the L1 cache as below.
Miss rate for 2-way associative = 0.038
Miss rate for 4-way associative = 0.037
39 Copyright © 2013, Daniel A. Menasce All rights reserved.
Memory Performance Example Introduction
Miss rate for 2-way associative = 0.038
Miss rate for 4-way associative = 0.037
Hit time 2-way associative = 1. Avg. access time for 2-way associative = HitTime + MissRate * MissPenalty = 1 + 0.038 * 15 = 1.57
The figure on slide 36 shows that the access time for the 4-way L1 cache is approximately 1.4 that of 2-way associative. Avg. access time for 4-way associative = HitTime + MissRate * MissPenalty = 1.4 * 1 + 0.037 * 15 = 1.955
Determine if a 32-KB 4-way set associative L1 cache has a faster memory access time than a 32-KB 2-way set associative L1 cache. Assume that the miss penalty to L2 is 15 clock cycles and that the hit time for the the faster L1 cache is 1 clock cycle. Ignore misses beyond the L2 cache and assume miss rates for the L1 cache as below.
40 Copyright © 2012, Elsevier Inc. All rights reserved.
Way Prediction (2) To improve hit time, predict the way to pre-set
mux (i.e., the block within the set of next cache accesses). Mis-prediction gives longer hit time Prediction accuracy
> 90% for two-way > 80% for four-way I-cache has better accuracy than D-cache
First used on MIPS R10000 in mid-90s Used on ARM Cortex-A8
Advanced O
ptimizations
41 Copyright © 2012, Elsevier Inc. All rights reserved.
Pipelining Cache (3) Pipeline cache access to improve bandwidth
Examples: Pentium: 1 cycle Pentium Pro – Pentium III: 2 cycles Pentium 4 – Core i7: 4 cycles
Increases branch mis-prediction penalty Makes it easier to increase associativity
Advanced O
ptimizations
42 Copyright © 2012, Elsevier Inc. All rights reserved.
Nonblocking Caches (4) Allow hits before
previous misses complete “Hit under miss” “Hit under multiple
miss” L2 must support this In general,
processors can hide L1 miss penalty but not L2 miss penalty
Advanced O
ptimizations
43 Copyright © 2012, Elsevier Inc. All rights reserved.
Multibanked Caches (5) Organize cache as independent banks to
support simultaneous access ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for L2
Interleave banks according to block address
Advanced O
ptimizations
44 Copyright © 2012, Elsevier Inc. All rights reserved.
Critical Word First, Early Restart (6)
Critical word first Request missed word from memory first Send it to the processor as soon as it arrives
Early restart Request words in normal order Send missed work to the processor as soon as it
arrives
Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched
Advanced O
ptimizations
45 Copyright © 2012, Elsevier Inc. All rights reserved.
Merging Write Buffer (7) When storing to a block that is already pending in the
write buffer, update write buffer Reduces stalls due to full write buffer Do not apply to I/O addresses
Advanced O
ptimizations
No write buffering
Write buffering
46 Copyright © 2012, Elsevier Inc. All rights reserved.
Compiler Optimizations (8) Loop Interchange
Swap nested loops to access memory in sequential order
Blocking (see examples on pages 89-90) Instead of accessing entire rows or columns,
subdivide matrices into blocks Requires more memory accesses but improves
locality of accesses
Advanced O
ptimizations
47 Copyright © 2012, Elsevier Inc. All rights reserved.
Loop Interchange Assume [5000,100] array stored in row major
order (i.e., laid out by rows) Before: x[0,0], x[1,0], x[2,0] …
for (j=0; j < 100; j = j + 1) for (i=0; i < 5000; i = i + 1) x[i][j] = 2 * x[i][j];
After: x[0,0], x[0,1], x[0,1], … for (i=0; i < 5000; i = i + 1) for (j=0; j < 100; j = j + 1) x[i][j] = 2 * x[i][j];
Which one produces less misses?
Advanced O
ptimizations
48 Copyright © 2012, Elsevier Inc. All rights reserved.
Hardware Prefetching (9) Fetch two blocks on miss (include next
sequential block)
Advanced O
ptimizations
Pentium 4 Pre-fetching (subset of SPEC 2000 programs)
49 Copyright © 2012, Elsevier Inc. All rights reserved.
Compiler Prefetching (10) Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions
Register prefetch Loads data into register
Cache prefetch Loads data into cache
Combine with loop unrolling and software pipelining
Advanced O
ptimizations
51 Copyright © 2012, Elsevier Inc. All rights reserved.
Main Memory Technology Performance metrics
Latency is concern of cache Bandwidth is concern of multiprocessors and I/O Access time
Time between read request and when desired word arrives
Cycle time Minimum time between unrelated requests to memory
DRAM used for main memory, SRAM used for cache
Mem
ory Technology
52 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Technology SRAM
Does not require refresh Requires low power to retain bit Requires 6 transistors/bit
DRAM (used for main memory) Must be re-written after being read Must also be periodically refreshed
Every ~ 8 ms Each row can be refreshed simultaneously by reading
the row One transistor/bit Address lines are multiplexed:
Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS)
Mem
ory Technology
53 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Technology Access time
Time between a memory request and when the desired word arrives.
Cycle time Minimum time between unrelated request
to memory SRAMs don’t need to refresh => cycle
time and access times are very close.
Mem
ory Technology
54 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Technology Amdahl:
Memory capacity should grow linearly with processor speed Unfortunately, memory capacity and speed has not kept
pace with processors
Some optimizations: Multiple accesses to same row (using row buffer) Synchronous DRAM (SDRAM)
Added clock to DRAM interface Burst mode with critical word first
Wider interfaces (DDR2 and DDR3) Double data rate (DDR): transfer data at the rising and falling
edges of the clock cycle. Multiple independent banks on each DRAM device =>
address = (bank #, row #, column #)
Mem
ory Technology
57 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Optimizations DDR:
DDR2 Lower power (2.5 V -> 1.8 V) Higher clock rates (266 MHz, 333 MHz, 400 MHz)
DDR3 1.5 V Maximum clock frequency: 800 MHz
DDR4 1-1.2 V Maximum clock frequency: 1600 MHz
GDDR5 is graphics memory based on DDR3 Used in GPUs.
Mem
ory Technology
58 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Optimizations Graphics memory:
Achieve 2-5 X bandwidth per DRAM vs. DDR3 Wider interfaces (32 vs. 16 bit) Higher clock rate
Possible because they are attached via soldering instead of socketted DIMM modules
Reducing power in SDRAMs: Lower voltage (1.35V or 1.5V) Low power mode (ignores clock, continues to
refresh)
Mem
ory Technology
59 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Power Consumption M
emory Technology
Micron 1.5 V 2Gb DDR3-1066 SDRAM
60 Copyright © 2012, Elsevier Inc. All rights reserved.
Flash Memory Type of EEPROM (Electronically Erasable
Programmable Read-Only Memory) Must be erased (in blocks) before being
overwritten Non volatile Limited number of write cycles Cheaper than SDRAM, more expensive than
disk
Slower than SRAM, faster than disk
Mem
ory Technology
61 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Dependability Memory is susceptible to cosmic rays Soft errors: dynamic errors
Detected and fixed by error correcting codes (ECC)
Hard errors: permanent errors Use sparse rows to replace defective rows
Chipkill: a RAID-like error recovery technique: Data and ECC is distributed so that loss of a
single memory chip can be tolerated and its content reconstructed from the others.
Mem
ory Technology
62 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Basics The address space of a process is divided
into equal sized pages (e.g., 4KB in size). Main memory is divided into page frames that
can hold one page each. The size of a process’ address space can be
larger than main memory. At any time, some pages of a process may
not be in main memory. All pages are on the paging disk.
When a page that is not in memory is referenced, it must be brought from disk to memory. This is a page fault.
Virtual Mem
ory and Virtual Machines
63 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Basics If all page frames are occupied when a page
fault occurs, a page has to evicted from memory using a page replacement algorithm. Least Recently Used (LRU) is an example
of a PRA. There is a Page Table per process,
maintained by the operating system. Translation from a virtual address to a
physical address is done by the hardware.
A page fault invokes the operating system.
Virtual Mem
ory and Virtual Machines
64 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Basics Virtual M
emory and Virtual M
achines
Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15
Paging Disk (process address
space)
Page Frame 0 Page Frame 1 Page Frame 2 Page Frame 3
Main memory
Page Table
VPN Frame Number
… …
Page 0 Page 1 Page 2
Page 13 Page 14 Page 15
A B C D E F G H I J K L M N 0 P
C
0
B
3
P
1
N
2
Virtual Address
VPN Page Offset
Physical Address PFN Page Offset
65 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Basics Virtual M
emory and Virtual M
achines
Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15
Paging Disk (process address
space)
Page Frame 0 Page Frame 1 Page Frame 2 Page Frame 3
Main memory
Page Table
VPN Frame Number
… …
Page 0 Page 1 Page 2
Page 13 Page 14 Page 15
A B C D E F G H I J K L M N 0 P
C
0
B
3
P
1
N
2
Virtual Address VPN
Page Offset
Physical Address 0 Page Offset
2
-
-
66 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Basics Virtual M
emory and Virtual M
achines
Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15
Paging Disk (process address
space)
Page Frame 0 Page Frame 1 Page Frame 2 Page Frame 3
Main memory
Page Table
VPN Frame Number
… …
Page 0 Page 1 Page 2
Page 13 Page 14 Page 15
A B C D E F G H I J K L M N 0 P
C
0
B
3
P
1
N
2
Virtual Address VPN
Page Offset
Physical Address Page Offset
14
-
- Page Fault!
67 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Basics Virtual M
emory and Virtual M
achines
Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15
Paging Disk (process address
space)
Page Frame 0 Page Frame 1 Page Frame 2 Page Frame 3
Main memory
Page Table
VPN Frame Number
… …
Page 0 Page 1 Page 2
Page 13 Page 14 Page 15
A B C D E F G H I J K L M N 0 P
C
0
B
3
P
1
O
2
Virtual Address VPN
Page Offset
Physical Address Page Offset
14
-
-
2
68 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long. How many pages are there in the virtual address space? How many page frames are there in memory? How many bits are needed for the VPN field of an address? How many bits are needed for the page frame bit of an address? How many bits are needed for the page offset field?
Virtual Mem
ory and Virtual Machines
69 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long. How many pages are there in the virtual address space?
64 * 2^30 / (8 * 2^10) = 8 * 2^20 = 2^23 pages
Virtual Mem
ory and Virtual Machines
70 Copyright © 2013, D.A. Menasce All rights reserved.
Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long.
How many page frames are there in memory?
4 * 2^30 / (8 * 2^10) = 4 * 2^17 = 2^15 page frames
Virtual Mem
ory and Virtual Machines
71 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long.
How many bits are needed for the VPN field of an address?
2^23 pages => 23 bits
Virtual Mem
ory and Virtual Machines
72 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long.
How many bits are needed for the page frame bit of an address?
2^15 page frames => 15 bits
Virtual Mem
ory and Virtual Machines
73 Copyright © 2013, D.A. Menasce. All rights reserved.
Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long.
How many bits are needed for the page offset field?
If a word is 8 bytes long, then each page has 2^3 * 2^10 / 2^3 = 2^10 words
=> 10 bits are needed for the offset
Virtual Mem
ory and Virtual Machines
74 Copyright © 2012, Elsevier Inc. All rights reserved.
Virtual Memory Protection via virtual memory
Keeps processes in their own memory space
Role of architecture: Provide user mode and supervisor mode Protect certain aspects of CPU state Provide mechanisms for switching between user
mode and supervisor mode Provide mechanisms to limit memory accesses Provide TLB to translate addresses
Virtual Mem
ory and Virtual Machines
75 Copyright © 2012, Elsevier Inc. All rights reserved.
Virtual Machines Supports isolation and security Sharing a computer among many unrelated users Enabled by raw speed of processors, making the
overhead more acceptable
Allows different ISAs and operating systems to be presented to user programs “System Virtual Machines” SVM software is called “virtual machine monitor” or
“hypervisor” Individual virtual machines run under the monitor are called
“guest VMs”
Virtual Mem
ory and Virtual Machines
76 Copyright © 2012, Elsevier Inc. All rights reserved.
Impact of VMs on Virtual Memory Each guest OS maintains its own set of page
tables VMM adds a level of memory between physical
and virtual memory called “real memory” VMM maintains shadow page table that maps
guest virtual addresses to physical addresses Requires VMM to detect guest’s changes to its own page
table Occurs naturally if accessing the page table pointer is a
privileged operation
Virtual Mem
ory and Virtual Machines