+ All Categories
Home > Documents > Chapter 2 Memory Hierarchy Design - Home | George...

Chapter 2 Memory Hierarchy Design - Home | George...

Date post: 25-Apr-2018
Category:
Upload: lycong
View: 222 times
Download: 6 times
Share this document with a friend
76
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 Memory Hierarchy Design Computer Architecture A Quantitative Approach, Fifth Edition
Transcript

1 Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 2

Memory Hierarchy Design

Computer Architecture A Quantitative Approach, Fifth Edition

2 Copyright © 2012, Elsevier Inc. All rights reserved.

Introduction   Programmers want unlimited amounts of memory with

low latency   Fast memory technology is more expensive per bit than

slower memory   Solution: organize memory system into a hierarchy

  Entire addressable memory space available in largest, slowest memory

  Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor

  Temporal and spatial locality insures that nearly all references can be found in smaller memories   Gives the illusion of a large, fast memory being presented to the

processor

Introduction

3 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Introduction

4 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Performance Gap Introduction

Increase in memory requests/sec

Increase in memory acesses/sec

5 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Design   Memory hierarchy design becomes more crucial

with recent multi-core processors:   Aggregate peak bandwidth grows with # cores:

  Intel Core i7 can generate two references per core per clock   Four cores and 3.2 GHz clock

  25.6 billion 64-bit data references/second +   12.8 billion 128-bit instruction references   = 409.6 GB/s!

  DRAM bandwidth is only 6% of this (25 GB/s)   Requires:

  Multi-port, pipelined caches   Two levels of cache per core   Shared third-level cache on chip

Introduction

6 Copyright © 2012, Elsevier Inc. All rights reserved.

Performance and Power   High-end microprocessors have >10 MB on-chip

cache   Consumes large amount of area and power budget   Static power due to leakage when not operating and

dynamic power when performing reads/writes.   Power consumption in PMDs due to caches can be

between 25% and 50% of the total power consumption.

Introduction

7 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics   When a word is not found in the cache, a miss

occurs:   Fetch word from lower level in hierarchy, requiring a

higher latency reference   Lower level may be another cache or the main

memory   Also fetch the other words contained within the block

  Takes advantage of spatial locality   Place block into cache in any location within its set,

determined by address   block address MOD number of sets in the cache

Introduction

8 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Memory Hierarchy Basics Introduction

CPU

000 001 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

Read 007

CACHE MISS!

9 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Memory Hierarchy Basics Introduction

CPU

000 001 003 004

005 006 007 008

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

007

005 006 007 008

005 006 007 008

10 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Memory Hierarchy Basics Introduction

CPU

000 001 003 004

005 006 007 008

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

Read 005

005 006 007 008

CACHE HIT!

11 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Memory Hierarchy Basics Introduction

CPU

000 001 003 004

005 006 007 008

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

005

005 006 007 008

12 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Memory Hierarchy Basics Introduction

CPU

000 001 003 004

005 006 007 008

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

Read 018

005 006 007 008

CACHE MISS!

13 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Memory Hierarchy Basics Introduction

CPU

000 001 003 004

005 006 007 008

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

018

005 006 007 008

CACHE MISS!

017 018 019 020

017 018 019 020

14 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics   Placement of blocks in a cache

  Set associative: block is mapped into a set and the block is placed anywhere in the set

  Finding a block: map block address to set and search set (usually in parallel) to find block.

  n blocks in a set: n-way associative cache   One block per set (n=1): direct-mapped cache   One set per cache: fully associative cache

Introduction

15 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Fully Associative Cache Introduction

000 001 003 004

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

005 006 007 008 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5

Memory block 3 can go anywhere in the cache.

Block 1 Block 2 Block 3

Block 0

16 Copyright © 2013, Daniel A. Menasce. All rights reserved.

2-Way Associative Cache Introduction

000 001 003 004

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

005 006 007 008 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5

Memory block 3 can only go into cache set (3 mod 2) = 1 in the cache.

Block 1 Block 2 Block 3

Block 0 SET 0

SET 1

17 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Direct-mapped Cache Introduction

000 001 003 004

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

005 006 007 008 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5

Memory block 5 can only go into cache block (5 mod 4) = 1 in the cache.

Block 1 Block 2 Block 3

Block 0 SET 0 SET 1 SET 2 SET 3

18 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Processor/Cache Addressing Introduction

2-way associative cache.

Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

0

1

V B

INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.

1 1 0 0

TAG Index Block Offset

<2> <1> <1>

1

0 1 1 1 0 0 1 0

1 1

TAG

01110010

MEMORY

CACHE

Cache data

<8> address

Set 0: B0

Set 1: B1

Set 0: B2

Set 1: B3

Set 0: B4

Set 1: B5

Set 0: B6

Set 1: B7

1 0 1 0 0 0 1 1

10100011 1 1 0 11101111 10001001

1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1

19 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Processor/Cache Addressing Introduction

2-way associative cache.

Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.

1 1 0 0

TAG Index Block Offset

<2> <1> <1> 0 1 1 1 0 0 1 0

MEMORY

<8> address

= CACHE HIT

Set 0: B0

Set 1: B1

Set 0: B2

Set 1: B3

Set 0: B4

Set 1: B5

Set 0: B6

Set 1: B7

V B

1 1 1

TAG

01110010

CACHE

Cache data

10100011 1 1 0 11101111 10001001

0 1 1 1 0 0 1 0

1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1

1 0 1 0 0 0 1 1

0

1

20 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Processor/Cache Addressing Introduction

2-way associative cache.

Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.

1 1 0 0

TAG Index Block Offset

<2> <1> <1> 0 1 1 1 0 0 1 0

MEMORY

<8> address

= CACHE MISS

WHY?

Set 0: B0

Set 1: B1

Set 0: B2

Set 1: B3

Set 0: B4

Set 1: B5

Set 0: B6

Set 1: B7

V B

0 1 1

TAG

01001110

CACHE

Cache data

10110111 1 1 0 11101111 10001001

0

1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1

1 0 1 0 0 0 1 1

21 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Processor/Cache Addressing Introduction

2-way associative cache.

Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.

0 1 0 0

TAG Index Block Offset

<2> <1> <1> 0 1 1 1 0 0 1 0

MEMORY

<8> address

CACHE MISS

Set 0: B0

Set 1: B1

Set 0: B2

Set 1: B3

Set 0: B4

Set 1: B5

Set 0: B6

Set 1: B7

V B

0 1 1

TAG

01001110

CACHE

Cache data

10110111 1 1 0 11101111 10001001

0

1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1

1 0 1 0 0 0 1 1

22 Copyright © 2012, Elsevier Inc. All rights reserved.

Cache Miss: Block Replacement   Which Block to Replace?

  Direct-mapped cache:   Only one option: replace the block in the location where the

incoming block has to go.   Fully Associative or Set Associative

  Random: spreads allocation uniformly.   Least Recently Used (LRU): the block replaced is the one

that has been unused for the longest time. Can be expensive to implement in hardware. Pseudo-LRU provides an approximation using bits associated to each set to record when blocks in a set were accessed.

  First In First Out (FIFO): selects the oldest rather than the LRU block.

Introduction

23 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics   Writing to cache: two strategies

  Write-through   Immediately update lower levels of hierarchy

  Write-back   Only update lower levels of hierarchy when an updated block

is replaced

  Both strategies use write buffer to make writes asynchronous

Introduction

24 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Write-through Introduction

CPU

000 001 003 004

005 006 007 008

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

Write 018

005 006 007 008

CACHE HIT!

018

017 018 019 020 Block 1 Block 2 Block 3

Block 0

25 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Write-back – Part I Introduction

CPU

000 001 003 004

005 006 007 008

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

Write 018

005 006 007 008

CACHE HIT! 017 018 019 020 Block 1

Block 2 Block 3

Block 0

26 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Write-back – Part II Introduction

CPU

000 001 003 004

005 006 007 008

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

Read 024

005 006 007 008

CACHE MISS! 017 018 019 020 Block 1

Block 2 Block 3

Block 0

018

27 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Write-back – Part III Introduction

CPU

000 001 003 004

005 006 007 008

009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024

MEMORY

CACHE

Read 024

005 006 007 008

BLOCK 1 REPLACEMENT

Block 1 Block 2 Block 3

Block 0 021 022 023 024

021 022 023 024

024

28 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics   Miss rate

  Fraction of cache access that result in a miss

  Causes of misses (3 C’s)   Compulsory

  First reference to a block. Would happen even for infinite caches.

  Capacity   Blocks discarded and later retrieved

  Conflict   Program makes repeated references to multiple addresses

from different blocks that map to the same location in the cache

Introduction

29

  Note that speculative and multithreaded processors may execute other instructions during a miss   Reduces performance impact of misses

Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics Introduction

MemoryStallCycles = NumberOfMisses×MissPenalty =

= IC×Misses

Instruction×MissPenalty

= IC×MemoryAccesses

Instruction×MissRate ×MissPenalty

30 Copyright © 2013, Daniel A. Menasce All rights reserved.

Memory Performance Introduction Example: Assume that the CPI of a computer is 1.0 when all memory

accesses hit in the cache. Data access instructions (loads and stores) account for 50% of all instructions. The miss penalty is 25 cycles and the miss rate is 4%. Assume that 70% instructions are cache hits. What is the execution time in clock cycles (CC) and Instruction Counts (IC)?

Instruction Data Cache Hit/Miss Probability No. CCs

0.04 miss 0.014 1+25=26

0.96 hit 0.336 1

1

1

0.04 miss 0.006 25+25+1=51

0.96 hit 0.144 25+1=26

25+1=26

25+1=26

IC*(0.014*26+(0.336+0.35)*1+0.006*51+(0.144+0.15)*26)*CC = 9 * IC * CC

0.35

0.15

0.7 Hit

0.3 Miss

0.5 memory

0.5 no memory

0.5 memory

0.5 no memory

31 Copyright © 2013, Daniel A. Menasce All rights reserved.

Memory Performance Introduction Example: How would the execution time in the previous example

change if the miss rate were reduced to 2%?

Instruction Data Cache Hit/Miss Probability No. CCs

0.02 miss 0.007 1+25=26

0.98 0.343 1

1

1

0.02 0.003 25+25+1=51

0.98 0.147 25+1=26

25+1=26

25+1=26

IC*(0.007*26+(0.334+0.35)*1+0.003*51+(0.147+0.15)*26)*CC = 8.741 * IC * CC

0.7 Hit

0.3 Miss

0.5 memory

0.5 no memory

0.5 memory

0.5 no memory

0.35

0.15

32 Copyright © 2013, Daniel A. Menasce All rights reserved.

Memory Performance Introduction Example: What is the miss rate were kept at 4% but memory access

time were improved so that the miss penalty were reduced to 10 cycles? Instruction Data Cache Hit/Miss Probability No. CCs

0.04 miss 0.014 1+10=11

0.96 hit 0.336 1

1

1

0.04 miss 0.006 10+10+1=21

0.96 hit 0.144 10+1=11

10+1=11

10+1=11

IC*(0.014*11+(0.336+0.35)*1+0.006*21+(0.144+0.15)*11)*CC = 4.2 * IC * CC

0.7 Hit

0.3 Miss

0.5 memory

0.5 no memory

0.5 memory

0.5 no memory

0.35

0.15

Avg. of 3.2 stall cycles/instruction

33 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics   Six basic cache optimizations:

  Larger block size   Reduces compulsory misses   Increases capacity and conflict misses, increases miss penalty

  Larger total cache capacity to reduce miss rate   Increases hit time, increases power consumption

  Higher associativity   Reduces conflict misses   Increases hit time, increases power consumption

  Higher number of cache levels (Multi-level caches)   Reduces overall memory access time:

  Giving priority to read misses over writes   Reduces miss penalty

  Avoiding address translation in cache indexing   Reduces hit time

Introduction

HitTimeL1 +MissRateL1 × (HitTimeL2 +MissRateL2 ×MissPenaltyL2)

34 Copyright © 2012, Elsevier Inc. All rights reserved.

Ten Advanced Optimizations   Based on improvements on

  Hit time   Miss rate   Miss penalty   Cache bandwidth   Power consumption

Advanced O

ptimizations

35 Copyright © 2012, Elsevier Inc. All rights reserved.

Ten Advanced Optimizations (1)   Small and simple first level caches

  Critical timing path:   addressing tag memory, then   comparing tags, then   selecting correct set

  Direct-mapped caches can overlap tag compare and transmission of data

  Lower associativity reduces power because fewer cache lines are accessed

Advanced O

ptimizations

36 Copyright © 2012, Elsevier Inc. All rights reserved.

L1 Size and Associativity

Access time vs. size and associativity

Advanced O

ptimizations

37 Copyright © 2012, Elsevier Inc. All rights reserved.

L1 Size and Associativity

Energy per read vs. size and associativity

Advanced O

ptimizations

38 Copyright © 2013, Daniel A. Menasce All rights reserved.

Memory Performance Example Introduction Determine if a 32-KB 4-way set associative L1 cache has a faster memory

access time than a 32-KB 2-way set associative L1 cache. Assume that the miss penalty to L2 is 15 clock cycles and that the hit time for the the faster L1 cache is 1 clock cycle. Ignore misses beyond the L2 cache and assume miss rates for the L1 cache as below.

Miss rate for 2-way associative = 0.038

Miss rate for 4-way associative = 0.037

39 Copyright © 2013, Daniel A. Menasce All rights reserved.

Memory Performance Example Introduction

Miss rate for 2-way associative = 0.038

Miss rate for 4-way associative = 0.037

Hit time 2-way associative = 1. Avg. access time for 2-way associative = HitTime + MissRate * MissPenalty = 1 + 0.038 * 15 = 1.57

The figure on slide 36 shows that the access time for the 4-way L1 cache is approximately 1.4 that of 2-way associative. Avg. access time for 4-way associative = HitTime + MissRate * MissPenalty = 1.4 * 1 + 0.037 * 15 = 1.955

Determine if a 32-KB 4-way set associative L1 cache has a faster memory access time than a 32-KB 2-way set associative L1 cache. Assume that the miss penalty to L2 is 15 clock cycles and that the hit time for the the faster L1 cache is 1 clock cycle. Ignore misses beyond the L2 cache and assume miss rates for the L1 cache as below.

40 Copyright © 2012, Elsevier Inc. All rights reserved.

Way Prediction (2)   To improve hit time, predict the way to pre-set

mux (i.e., the block within the set of next cache accesses).   Mis-prediction gives longer hit time   Prediction accuracy

  > 90% for two-way   > 80% for four-way   I-cache has better accuracy than D-cache

  First used on MIPS R10000 in mid-90s   Used on ARM Cortex-A8

Advanced O

ptimizations

41 Copyright © 2012, Elsevier Inc. All rights reserved.

Pipelining Cache (3)   Pipeline cache access to improve bandwidth

  Examples:   Pentium: 1 cycle   Pentium Pro – Pentium III: 2 cycles   Pentium 4 – Core i7: 4 cycles

  Increases branch mis-prediction penalty   Makes it easier to increase associativity

Advanced O

ptimizations

42 Copyright © 2012, Elsevier Inc. All rights reserved.

Nonblocking Caches (4)   Allow hits before

previous misses complete   “Hit under miss”   “Hit under multiple

miss”   L2 must support this   In general,

processors can hide L1 miss penalty but not L2 miss penalty

Advanced O

ptimizations

43 Copyright © 2012, Elsevier Inc. All rights reserved.

Multibanked Caches (5)   Organize cache as independent banks to

support simultaneous access   ARM Cortex-A8 supports 1-4 banks for L2   Intel i7 supports 4 banks for L1 and 8 banks for L2

  Interleave banks according to block address

Advanced O

ptimizations

44 Copyright © 2012, Elsevier Inc. All rights reserved.

Critical Word First, Early Restart (6)

  Critical word first   Request missed word from memory first   Send it to the processor as soon as it arrives

  Early restart   Request words in normal order   Send missed work to the processor as soon as it

arrives

  Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched

Advanced O

ptimizations

45 Copyright © 2012, Elsevier Inc. All rights reserved.

Merging Write Buffer (7)   When storing to a block that is already pending in the

write buffer, update write buffer   Reduces stalls due to full write buffer   Do not apply to I/O addresses

Advanced O

ptimizations

No write buffering

Write buffering

46 Copyright © 2012, Elsevier Inc. All rights reserved.

Compiler Optimizations (8)   Loop Interchange

  Swap nested loops to access memory in sequential order

  Blocking (see examples on pages 89-90)   Instead of accessing entire rows or columns,

subdivide matrices into blocks   Requires more memory accesses but improves

locality of accesses

Advanced O

ptimizations

47 Copyright © 2012, Elsevier Inc. All rights reserved.

Loop Interchange   Assume [5000,100] array stored in row major

order (i.e., laid out by rows)   Before: x[0,0], x[1,0], x[2,0] …

for (j=0; j < 100; j = j + 1) for (i=0; i < 5000; i = i + 1) x[i][j] = 2 * x[i][j];

  After: x[0,0], x[0,1], x[0,1], … for (i=0; i < 5000; i = i + 1) for (j=0; j < 100; j = j + 1) x[i][j] = 2 * x[i][j];

  Which one produces less misses?

Advanced O

ptimizations

48 Copyright © 2012, Elsevier Inc. All rights reserved.

Hardware Prefetching (9)   Fetch two blocks on miss (include next

sequential block)

Advanced O

ptimizations

Pentium 4 Pre-fetching (subset of SPEC 2000 programs)

49 Copyright © 2012, Elsevier Inc. All rights reserved.

Compiler Prefetching (10)   Insert prefetch instructions before data is needed   Non-faulting: prefetch doesn’t cause exceptions

  Register prefetch   Loads data into register

  Cache prefetch   Loads data into cache

  Combine with loop unrolling and software pipelining

Advanced O

ptimizations

50 Copyright © 2012, Elsevier Inc. All rights reserved.

Summary A

dvanced Optim

izations

51 Copyright © 2012, Elsevier Inc. All rights reserved.

Main Memory Technology   Performance metrics

  Latency is concern of cache   Bandwidth is concern of multiprocessors and I/O   Access time

  Time between read request and when desired word arrives

  Cycle time   Minimum time between unrelated requests to memory

  DRAM used for main memory, SRAM used for cache

Mem

ory Technology

52 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Technology   SRAM

  Does not require refresh   Requires low power to retain bit   Requires 6 transistors/bit

  DRAM (used for main memory)   Must be re-written after being read   Must also be periodically refreshed

  Every ~ 8 ms   Each row can be refreshed simultaneously by reading

the row   One transistor/bit   Address lines are multiplexed:

  Upper half of address: row access strobe (RAS)   Lower half of address: column access strobe (CAS)

Mem

ory Technology

53 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Technology   Access time

  Time between a memory request and when the desired word arrives.

  Cycle time   Minimum time between unrelated request

to memory   SRAMs don’t need to refresh => cycle

time and access times are very close.

Mem

ory Technology

54 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Technology   Amdahl:

  Memory capacity should grow linearly with processor speed   Unfortunately, memory capacity and speed has not kept

pace with processors

  Some optimizations:   Multiple accesses to same row (using row buffer)   Synchronous DRAM (SDRAM)

  Added clock to DRAM interface   Burst mode with critical word first

  Wider interfaces (DDR2 and DDR3)   Double data rate (DDR): transfer data at the rising and falling

edges of the clock cycle.   Multiple independent banks on each DRAM device =>

address = (bank #, row #, column #)

Mem

ory Technology

55 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Optimizations M

emory Technology

56 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Optimizations M

emory Technology

57 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Optimizations   DDR:

  DDR2   Lower power (2.5 V -> 1.8 V)   Higher clock rates (266 MHz, 333 MHz, 400 MHz)

  DDR3   1.5 V   Maximum clock frequency: 800 MHz

  DDR4   1-1.2 V   Maximum clock frequency: 1600 MHz

  GDDR5 is graphics memory based on DDR3   Used in GPUs.

Mem

ory Technology

58 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Optimizations   Graphics memory:

  Achieve 2-5 X bandwidth per DRAM vs. DDR3   Wider interfaces (32 vs. 16 bit)   Higher clock rate

  Possible because they are attached via soldering instead of socketted DIMM modules

  Reducing power in SDRAMs:   Lower voltage (1.35V or 1.5V)   Low power mode (ignores clock, continues to

refresh)

Mem

ory Technology

59 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Power Consumption M

emory Technology

Micron 1.5 V 2Gb DDR3-1066 SDRAM

60 Copyright © 2012, Elsevier Inc. All rights reserved.

Flash Memory   Type of EEPROM (Electronically Erasable

Programmable Read-Only Memory)   Must be erased (in blocks) before being

overwritten   Non volatile   Limited number of write cycles   Cheaper than SDRAM, more expensive than

disk

  Slower than SRAM, faster than disk

Mem

ory Technology

61 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Dependability   Memory is susceptible to cosmic rays   Soft errors: dynamic errors

  Detected and fixed by error correcting codes (ECC)

  Hard errors: permanent errors   Use sparse rows to replace defective rows

  Chipkill: a RAID-like error recovery technique:   Data and ECC is distributed so that loss of a

single memory chip can be tolerated and its content reconstructed from the others.

Mem

ory Technology

62 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Basics   The address space of a process is divided

into equal sized pages (e.g., 4KB in size).   Main memory is divided into page frames that

can hold one page each.   The size of a process’ address space can be

larger than main memory.   At any time, some pages of a process may

not be in main memory. All pages are on the paging disk.

  When a page that is not in memory is referenced, it must be brought from disk to memory. This is a page fault.

Virtual Mem

ory and Virtual Machines

63 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Basics   If all page frames are occupied when a page

fault occurs, a page has to evicted from memory using a page replacement algorithm.   Least Recently Used (LRU) is an example

of a PRA.   There is a Page Table per process,

maintained by the operating system.   Translation from a virtual address to a

physical address is done by the hardware.

  A page fault invokes the operating system.

Virtual Mem

ory and Virtual Machines

64 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Basics Virtual M

emory and Virtual M

achines

Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15

Paging Disk (process address

space)

Page Frame 0 Page Frame 1 Page Frame 2 Page Frame 3

Main memory

Page Table

VPN Frame Number

… …

Page 0 Page 1 Page 2

Page 13 Page 14 Page 15

A B C D E F G H I J K L M N 0 P

C

0

B

3

P

1

N

2

Virtual Address

VPN Page Offset

Physical Address PFN Page Offset

65 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Basics Virtual M

emory and Virtual M

achines

Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15

Paging Disk (process address

space)

Page Frame 0 Page Frame 1 Page Frame 2 Page Frame 3

Main memory

Page Table

VPN Frame Number

… …

Page 0 Page 1 Page 2

Page 13 Page 14 Page 15

A B C D E F G H I J K L M N 0 P

C

0

B

3

P

1

N

2

Virtual Address VPN

Page Offset

Physical Address 0 Page Offset

2

-

-

66 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Basics Virtual M

emory and Virtual M

achines

Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15

Paging Disk (process address

space)

Page Frame 0 Page Frame 1 Page Frame 2 Page Frame 3

Main memory

Page Table

VPN Frame Number

… …

Page 0 Page 1 Page 2

Page 13 Page 14 Page 15

A B C D E F G H I J K L M N 0 P

C

0

B

3

P

1

N

2

Virtual Address VPN

Page Offset

Physical Address Page Offset

14

-

- Page Fault!

67 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Basics Virtual M

emory and Virtual M

achines

Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15

Paging Disk (process address

space)

Page Frame 0 Page Frame 1 Page Frame 2 Page Frame 3

Main memory

Page Table

VPN Frame Number

… …

Page 0 Page 1 Page 2

Page 13 Page 14 Page 15

A B C D E F G H I J K L M N 0 P

C

0

B

3

P

1

O

2

Virtual Address VPN

Page Offset

Physical Address Page Offset

14

-

-

2

68 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long. How many pages are there in the virtual address space? How many page frames are there in memory? How many bits are needed for the VPN field of an address? How many bits are needed for the page frame bit of an address? How many bits are needed for the page offset field?

Virtual Mem

ory and Virtual Machines

69 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long. How many pages are there in the virtual address space?

64 * 2^30 / (8 * 2^10) = 8 * 2^20 = 2^23 pages

Virtual Mem

ory and Virtual Machines

70 Copyright © 2013, D.A. Menasce All rights reserved.

Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long.

How many page frames are there in memory?

4 * 2^30 / (8 * 2^10) = 4 * 2^17 = 2^15 page frames

Virtual Mem

ory and Virtual Machines

71 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long.

How many bits are needed for the VPN field of an address?

2^23 pages => 23 bits

Virtual Mem

ory and Virtual Machines

72 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long.

How many bits are needed for the page frame bit of an address?

2^15 page frames => 15 bits

Virtual Mem

ory and Virtual Machines

73 Copyright © 2013, D.A. Menasce. All rights reserved.

Virtual Memory Example A computer system has 4 GB of main memory and a virtual address space equal to 64 GB. Each page is 8 KB long.

How many bits are needed for the page offset field?

If a word is 8 bytes long, then each page has 2^3 * 2^10 / 2^3 = 2^10 words

=> 10 bits are needed for the offset

Virtual Mem

ory and Virtual Machines

74 Copyright © 2012, Elsevier Inc. All rights reserved.

Virtual Memory   Protection via virtual memory

  Keeps processes in their own memory space

  Role of architecture:   Provide user mode and supervisor mode   Protect certain aspects of CPU state   Provide mechanisms for switching between user

mode and supervisor mode   Provide mechanisms to limit memory accesses   Provide TLB to translate addresses

Virtual Mem

ory and Virtual Machines

75 Copyright © 2012, Elsevier Inc. All rights reserved.

Virtual Machines   Supports isolation and security   Sharing a computer among many unrelated users   Enabled by raw speed of processors, making the

overhead more acceptable

  Allows different ISAs and operating systems to be presented to user programs   “System Virtual Machines”   SVM software is called “virtual machine monitor” or

“hypervisor”   Individual virtual machines run under the monitor are called

“guest VMs”

Virtual Mem

ory and Virtual Machines

76 Copyright © 2012, Elsevier Inc. All rights reserved.

Impact of VMs on Virtual Memory   Each guest OS maintains its own set of page

tables   VMM adds a level of memory between physical

and virtual memory called “real memory”   VMM maintains shadow page table that maps

guest virtual addresses to physical addresses   Requires VMM to detect guest’s changes to its own page

table   Occurs naturally if accessing the page table pointer is a

privileged operation

Virtual Mem

ory and Virtual Machines


Recommended