+ All Categories
Home > Documents > Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting...

Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting...

Date post: 01-Jan-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
53
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface ARM Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 2 Principle of Locality n Programs access a small proportion of their address space at any time n Temporal locality n Items accessed recently are likely to be accessed again soon n e.g., instructions in a loop, induction variables n Spatial locality n Items near those accessed recently are likely to be accessed soon n E.g., sequential instruction access, array data §5.1 Introduction
Transcript
Page 1: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

COMPUTERORGANIZATION AND DESIGNThe Hardware/Software Interface

ARMEdition

Chapter 5Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 2

Principle of Localityn Programs access a small proportion of

their address space at any timen Temporal locality

n Items accessed recently are likely to be accessed again soon

n e.g., instructions in a loop, induction variables

n Spatial localityn Items near those accessed recently are likely

to be accessed soonn E.g., sequential instruction access, array data

§5.1 Introduction

Page 2: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3

Taking Advantage of Localityn Memory hierarchyn Store everything on diskn Copy recently accessed (and nearby)

items from disk to smaller DRAM memoryn Main memory

n Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memoryn Cache memory attached to CPU

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4

Memory Hierarchy Levelsn Block (aka line): unit of copying

n May be multiple words

n If accessed data is present in upper leveln Hit: access satisfied by upper level

n Hit ratio: hits/accesses

n If accessed data is absentn Miss: block copied from lower level

n Time taken: miss penaltyn Miss ratio: misses/accesses

= 1 – hit ratio

n Then accessed data supplied from upper level

Page 3: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5

Memory Technologyn Static RAM (SRAM)

n 0.5ns – 2.5ns, $2000 – $5000 per GB

n Dynamic RAM (DRAM)n 50ns – 70ns, $20 – $75 per GB

n Magnetic diskn 5ms – 20ms, $0.20 – $2 per GB

n Ideal memoryn Access time of SRAMn Capacity and cost/GB of disk

§5.2 Mem

ory Technologies

DRAM Technologyn Data stored as a charge in a capacitor

n Single transistor used to access the chargen Must periodically be refreshed

n Read contents and write backn Performed on a DRAM “row”

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6

Page 4: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7

Advanced DRAM Organizationn Bits in a DRAM are organized as a

rectangular arrayn DRAM accesses an entire rown Burst mode: supply successive words from a

row with reduced latency

n Double data rate (DDR) DRAMn Transfer on rising and falling clock edges

n Quad data rate (QDR) DRAMn Separate DDR inputs and outputs

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8

DRAM Generations

0

50

100

150

200

250

300

'80 '83 '85 '89 '92 '96 '98 '00 '04 '07

TracTcac

Year Capacity $/GB

1980 64Kbit $1500000

1983 256Kbit $500000

1985 1Mbit $200000

1989 4Mbit $50000

1992 16Mbit $15000

1996 64Mbit $10000

1998 128Mbit $4000

2000 256Mbit $1000

2004 512Mbit $250

2007 1Gbit $50

Page 5: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

DRAM Performance Factorsn Row buffer

n Allows several words to be read and refreshed in parallel

n Synchronous DRAMn Allows for consecutive accesses in bursts without

needing to send each addressn Improves bandwidth

n DRAM bankingn Allows simultaneous access to multiple DRAMsn Improves bandwidth

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10

Increasing Memory Bandwidth

n 4-word wide memoryn Miss penalty = 1 + 15 + 1 = 17 bus cyclesn Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle

n 4-bank interleaved memoryn Miss penalty = 1 + 15 + 4×1 = 20 bus cyclesn Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle

Page 6: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 6 — Storage and Other I/O Topics — 11

Flash Storagen Nonvolatile semiconductor storage

n 100× – 1000× faster than diskn Smaller, lower power, more robustn But more $/GB (between disk and DRAM)

§6.4 Flash Storage

Chapter 6 — Storage and Other I/O Topics — 12

Flash Typesn NOR flash: bit cell like a NOR gate

n Random read/write accessn Used for instruction memory in embedded systems

n NAND flash: bit cell like a NAND gaten Denser (bits/area), but block-at-a-time accessn Cheaper per GBn Used for USB keys, media storage, …

n Flash bits wears out after 1000’s of accessesn Not suitable for direct RAM or disk replacementn Wear leveling: remap data to less used blocks

Page 7: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 6 — Storage and Other I/O Topics — 13

Disk Storagen Nonvolatile, rotating magnetic storage

§6.3 Disk S

torage

Chapter 6 — Storage and Other I/O Topics — 14

Disk Sectors and Accessn Each sector records

n Sector IDn Data (512 bytes, 4096 bytes proposed)n Error correcting code (ECC)

n Used to hide defects and recording errorsn Synchronization fields and gaps

n Access to a sector involvesn Queuing delay if other accesses are pendingn Seek: move the headsn Rotational latencyn Data transfern Controller overhead

Page 8: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 6 — Storage and Other I/O Topics — 15

Disk Access Examplen Given

n 512B sector, 15,000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk

n Average read timen 4ms seek time

+ ½ / (15,000/60) = 2ms rotational latency+ 512 / 100MB/s = 0.005ms transfer time+ 0.2ms controller delay= 6.2ms

n If actual average seek time is 1msn Average read time = 3.2ms

Chapter 6 — Storage and Other I/O Topics — 16

Disk Performance Issuesn Manufacturers quote average seek time

n Based on all possible seeksn Locality and OS scheduling lead to smaller actual

average seek times

n Smart disk controller allocate physical sectors on diskn Present logical sector interface to hostn SCSI, ATA, SATA

n Disk drives include cachesn Prefetch sectors in anticipation of accessn Avoid seek and rotational delay

Page 9: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17

Cache Memoryn Cache memory

n The level of the memory hierarchy closest to the CPU

n Given accesses X1, …, Xn–1, Xn

§5.3 The Basics of C

aches

n How do we know if the data is present?

n Where do we look?

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18

Direct Mapped Cachen Location determined by addressn Direct mapped: only one choice

n (Block address) modulo (#Blocks in cache)

n #Blocks is a power of 2

n Use low-order address bits

Page 10: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19

Tags and Valid Bitsn How do we know which particular block is

stored in a cache location?n Store block address as well as the datan Actually, only need the high-order bitsn Called the tag

n What if there is no data in a location?n Valid bit: 1 = present, 0 = not presentn Initially 0

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20

Cache Examplen 8-blocks, 1 word/block, direct mappedn Initial state

Index V Tag Data

000 N

001 N

010 N

011 N

100 N

101 N

110 N

111 N

Page 11: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21

Cache Example

Index V Tag Data

000 N

001 N

010 N

011 N

100 N

101 N

110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block

22 10 110 Miss 110

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22

Cache Example

Index V Tag Data

000 N

001 N

010 Y 11 Mem[11010]011 N

100 N

101 N

110 Y 10 Mem[10110]

111 N

Word addr Binary addr Hit/miss Cache block

26 11 010 Miss 010

Page 12: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23

Cache Example

Index V Tag Data

000 N

001 N

010 Y 11 Mem[11010]

011 N

100 N

101 N

110 Y 10 Mem[10110]

111 N

Word addr Binary addr Hit/miss Cache block

22 10 110 Hit 110

26 11 010 Hit 010

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24

Cache Example

Index V Tag Data

000 Y 10 Mem[10000]001 N

010 Y 11 Mem[11010]

011 Y 00 Mem[00011]100 N

101 N

110 Y 10 Mem[10110]

111 N

Word addr Binary addr Hit/miss Cache block

16 10 000 Miss 000

3 00 011 Miss 011

16 10 000 Hit 000

Page 13: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25

Cache Example

Index V Tag Data

000 Y 10 Mem[10000]

001 N

010 Y 10 Mem[10010]011 Y 00 Mem[00011]

100 N

101 N

110 Y 10 Mem[10110]

111 N

Word addr Binary addr Hit/miss Cache block

18 10 010 Miss 010

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26

Address Subdivision

Page 14: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27

Example: Larger Block Sizen 64 blocks, 16 bytes/block

n To what block number does address 1200 map?

n Block address = ë1200/16û = 75n Block number = 75 modulo 64 = 11

Tag Index Offset03491031

4 bits6 bits22 bits

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28

Block Size Considerationsn Larger blocks should reduce miss rate

n Due to spatial locality

n But in a fixed-sized cachen Larger blocks Þ fewer of them

n More competition Þ increased miss rate

n Larger blocks Þ pollution

n Larger miss penaltyn Can override benefit of reduced miss raten Early restart and critical-word-first can help

Page 15: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29

Cache Missesn On cache hit, CPU proceeds normallyn On cache miss

n Stall the CPU pipelinen Fetch block from next level of hierarchyn Instruction cache miss

n Restart instruction fetch

n Data cache missn Complete data access

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30

Write-Throughn On data-write hit, could just update the block in

cachen But then cache and memory would be inconsistent

n Write through: also update memoryn But makes writes take longer

n e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles

n Effective CPI = 1 + 0.1×100 = 11

n Solution: write buffern Holds data waiting to be written to memoryn CPU continues immediately

n Only stalls on write if write buffer is already full

Page 16: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31

Write-Backn Alternative: On data-write hit, just update

the block in cachen Keep track of whether each block is dirty

n When a dirty block is replacedn Write it back to memoryn Can use a write buffer to allow replacing block

to be read first

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32

Write Allocationn What should happen on a write miss?n Alternatives for write-through

n Allocate on miss: fetch the blockn Write around: don’t fetch the block

n Since programs often write a whole block before reading it (e.g., initialization)

n For write-backn Usually fetch the block

Page 17: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33

Example: Intrinsity FastMATHn Embedded MIPS processor

n 12-stage pipelinen Instruction and data access on each cycle

n Split cache: separate I-cache and D-cachen Each 16KB: 256 blocks × 16 words/blockn D-cache: write-through or write-back

n SPEC2000 miss ratesn I-cache: 0.4%n D-cache: 11.4%n Weighted average: 3.2%

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34

Example: Intrinsity FastMATH

Page 18: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35

Main Memory Supporting Cachesn Use DRAMs for main memory

n Fixed width (e.g., 1 word)n Connected by fixed-width clocked bus

n Bus clock is typically slower than CPU clock

n Example cache block readn 1 bus cycle for address transfern 15 bus cycles per DRAM accessn 1 bus cycle per data transfer

n For 4-word block, 1-word-wide DRAMn Miss penalty = 1 + 4×15 + 4×1 = 65 bus cyclesn Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36

Measuring Cache Performancen Components of CPU time

n Program execution cyclesn Includes cache hit time

n Memory stall cyclesn Mainly from cache misses

n With simplifying assumptions:

§5.4 Measuring and Im

proving Cache P

erformance

penalty MissnInstructio

MissesProgram

nsInstructio

penalty Missrate MissProgram

accessesMemory

cycles stallMemory

´´=

´´=

Page 19: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37

Cache Performance Examplen Given

n I-cache miss rate = 2%n D-cache miss rate = 4%n Miss penalty = 100 cyclesn Base CPI (ideal cache) = 2n Load & stores are 36% of instructions

n Miss cycles per instructionn I-cache: 0.02 × 100 = 2n D-cache: 0.36 × 0.04 × 100 = 1.44

n Actual CPI = 2 + 2 + 1.44 = 5.44n Ideal CPU is 5.44/2 =2.72 times faster

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38

Average Access Timen Hit time is also important for performancen Average memory access time (AMAT)

n AMAT = Hit time + Miss rate × Miss penalty

n Examplen CPU with 1ns clock, hit time = 1 cycle, miss

penalty = 20 cycles, I-cache miss rate = 5%n AMAT = 1 + 0.05 × 20 = 2ns

n 2 cycles per instruction

Page 20: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39

Performance Summaryn When CPU performance increased

n Miss penalty becomes more significant

n Decreasing base CPIn Greater proportion of time spent on memory

stalls

n Increasing clock raten Memory stalls account for more CPU cycles

n Can’t neglect cache behavior when evaluating system performance

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40

Associative Cachesn Fully associative

n Allow a given block to go in any cache entryn Requires all entries to be searched at oncen Comparator per entry (expensive)

n n-way set associativen Each set contains n entriesn Block number determines which set

n (Block number) modulo (#Sets in cache)

n Search all entries in a given set at oncen n comparators (less expensive)

Page 21: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41

Associative Cache Example

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42

Spectrum of Associativityn For a cache with 8 entries

Page 22: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43

Associativity Examplen Compare 4-block caches

n Direct mapped, 2-way set associative,fully associative

n Block access sequence: 0, 8, 0, 6, 8

n Direct mappedBlock

addressCache index

Hit/miss Cache content after access0 1 2 3

0 0 miss Mem[0]8 0 miss Mem[8]0 0 miss Mem[0]6 2 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 44

Associativity Examplen 2-way set associative

Block address

Cache index

Hit/miss Cache content after accessSet 0 Set 1

0 0 miss Mem[0]8 0 miss Mem[0] Mem[8]0 0 hit Mem[0] Mem[8]6 0 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]

n Fully associativeBlock

addressHit/miss Cache content after access

0 miss Mem[0]8 miss Mem[0] Mem[8]0 hit Mem[0] Mem[8]6 miss Mem[0] Mem[8] Mem[6]8 hit Mem[0] Mem[8] Mem[6]

Page 23: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45

How Much Associativityn Increased associativity decreases miss

raten But with diminishing returns

n Simulation of a system with 64KBD-cache, 16-word blocks, SPEC2000n 1-way: 10.3%n 2-way: 8.6%n 4-way: 8.3%n 8-way: 8.1%

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46

Set Associative Cache Organization

Page 24: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47

Replacement Policyn Direct mapped: no choicen Set associative

n Prefer non-valid entry, if there is onen Otherwise, choose among entries in the set

n Least-recently used (LRU)n Choose the one unused for the longest time

n Simple for 2-way, manageable for 4-way, too hard beyond that

n Randomn Gives approximately the same performance

as LRU for high associativity

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48

Multilevel Cachesn Primary cache attached to CPU

n Small, but fast

n Level-2 cache services misses from primary cachen Larger, slower, but still faster than main

memory

n Main memory services L-2 cache missesn Some high-end systems include L-3 cache

Page 25: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49

Multilevel Cache Examplen Given

n CPU base CPI = 1, clock rate = 4GHzn Miss rate/instruction = 2%n Main memory access time = 100ns

n With just primary cachen Miss penalty = 100ns/0.25ns = 400 cyclesn Effective CPI = 1 + 0.02 × 400 = 9

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50

Example (cont.)n Now add L-2 cache

n Access time = 5nsn Global miss rate to main memory = 0.5%

n Primary miss with L-2 hitn Penalty = 5ns/0.25ns = 20 cycles

n Primary miss with L-2 missn Extra penalty = 500 cycles

n CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4n Performance ratio = 9/3.4 = 2.6

Page 26: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 51

Multilevel Cache Considerationsn Primary cache

n Focus on minimal hit time

n L-2 cachen Focus on low miss rate to avoid main memory

accessn Hit time has less overall impact

n Resultsn L-1 cache usually smaller than a single cachen L-1 block size smaller than L-2 block size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52

Interactions with Advanced CPUsn Out-of-order CPUs can execute

instructions during cache missn Pending store stays in load/store unitn Dependent instructions wait in reservation

stationsn Independent instructions continue

n Effect of miss depends on program data flown Much harder to analysen Use system simulation

Page 27: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 53

Interactions with Softwaren Misses depend on

memory access patternsn Algorithm behaviorn Compiler

optimization for memory access

Software Optimization via Blockingn Goal: maximize accesses to data before it

is replacedn Consider inner loops of DGEMM:

for (int j = 0; j < n; ++j){

double cij = C[i+j*n];for( int k = 0; k < n; k++ )

cij += A[i+k*n] * B[k+j*n];C[i+j*n] = cij;

}

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 54

Page 28: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

DGEMM Access Patternn C, A, and B arrays

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 55

older accesses

new accesses

Cache Blocked DGEMM1 #define BLOCKSIZE 322 void do_block (int n, int si, int sj, int sk, double *A, double3 *B, double *C)4 {5 for (int i = si; i < si+BLOCKSIZE; ++i)6 for (int j = sj; j < sj+BLOCKSIZE; ++j)7 {8 double cij = C[i+j*n];/* cij = C[i][j] */9 for( int k = sk; k < sk+BLOCKSIZE; k++ )10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */11 C[i+j*n] = cij;/* C[i][j] = cij */12 }13 }14 void dgemm (int n, double* A, double* B, double* C)15 {16 for ( int sj = 0; sj < n; sj += BLOCKSIZE )17 for ( int si = 0; si < n; si += BLOCKSIZE )18 for ( int sk = 0; sk < n; sk += BLOCKSIZE )19 do_block(n, si, sj, sk, A, B, C);20 }

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 56

Page 29: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Blocked DGEMM Access Pattern

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 57

Unoptimized Blocked

Chapter 6 — Storage and Other I/O Topics — 58

Dependability

n Fault: failure of a componentn May or may not lead

to system failure

Service accomplishmentService delivered

as specified

Service interruptionDeviation from

specified service

FailureRestoration

§5.5 Dependable M

emory H

ierarchy

Page 30: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 6 — Storage and Other I/O Topics — 59

Dependability Measuresn Reliability: mean time to failure (MTTF)n Service interruption: mean time to repair (MTTR)n Mean time between failures

n MTBF = MTTF + MTTR

n Availability = MTTF / (MTTF + MTTR)n Improving Availability

n Increase MTTF: fault avoidance, fault tolerance, fault forecasting

n Reduce MTTR: improved tools and processes for diagnosis and repair

The Hamming SEC Coden Hamming distance

n Number of bits that are different between two bit patterns

n Minimum distance = 2 provides single bit error detectionn E.g. parity code

n Minimum distance = 3 provides single error correction, 2 bit error detection

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 60

Page 31: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Encoding SECn To calculate Hamming code:

n Number bits from 1 on the leftn All bit positions that are a power 2 are parity

bitsn Each parity bit checks certain data bits:

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 61

Decoding SECn Value of parity bits indicates which bits are

in errorn Use numbering from encoding proceduren E.g.

n Parity bits = 0000 indicates no errorn Parity bits = 1010 indicates bit 10 was flipped

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 62

Page 32: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

SEC/DEC Coden Add an additional parity bit for the whole word

(pn)n Make Hamming distance = 4n Decoding:

n Let H = SEC parity bitsn H even, pn even, no errorn H odd, pn odd, correctable single bit errorn H even, pn odd, error in pn bitn H odd, pn even, double error occurred

n Note: ECC DRAM uses SEC/DEC with 8 bits protecting each 64 bits

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 63

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 64

Virtual Machinesn Host computer emulates guest operating system

and machine resourcesn Improved isolation of multiple guestsn Avoids security and reliability problemsn Aids sharing of resources

n Virtualization has some performance impactn Feasible with modern high-performance comptuers

n Examplesn IBM VM/370 (1970s technology!)n VMWaren Microsoft Virtual PC

§5.6 Virtual M

achines

Page 33: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 65

Virtual Machine Monitorn Maps virtual resources to physical

resourcesn Memory, I/O devices, CPUs

n Guest code runs on native machine in user moden Traps to VMM on privileged instructions and

access to protected resourcesn Guest OS may be different from host OSn VMM handles real I/O devices

n Emulates generic virtual I/O devices for guest

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 66

Example: Timer Virtualizationn In native machine, on timer interrupt

n OS suspends current process, handles interrupt, selects and resumes next process

n With Virtual Machine Monitorn VMM suspends current VM, handles interrupt,

selects and resumes next VM

n If a VM requires timer interruptsn VMM emulates a virtual timern Emulates interrupt for VM when physical timer

interrupt occurs

Page 34: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 67

Instruction Set Supportn User and System modesn Privileged instructions only available in

system moden Trap to system if executed in user mode

n All physical resources only accessible using privileged instructionsn Including page tables, interrupt controls, I/O

registersn Renaissance of virtualization support

n Current ISAs (e.g., x86) adapting

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 68

Virtual Memoryn Use main memory as a “cache” for

secondary (disk) storagen Managed jointly by CPU hardware and the

operating system (OS)n Programs share main memory

n Each gets a private virtual address space holding its frequently used code and data

n Protected from other programsn CPU and OS translate virtual addresses to

physical addressesn VM “block” is called a pagen VM translation “miss” is called a page fault

§5.7 Virtual M

emory

Page 35: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 69

Address Translationn Fixed-size pages (e.g., 4K)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 70

Page Fault Penaltyn On page fault, the page must be fetched

from diskn Takes millions of clock cyclesn Handled by OS code

n Try to minimize page fault raten Fully associative placementn Smart replacement algorithms

Page 36: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 71

Page Tablesn Stores placement information

n Array of page table entries, indexed by virtual page number

n Page table register in CPU points to page table in physical memory

n If page is present in memoryn PTE stores the physical page numbern Plus other status bits (referenced, dirty, …)

n If page is not presentn PTE can refer to location in swap space on

disk

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 72

Translation Using a Page Table

Page 37: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 73

Mapping Pages to Storage

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 74

Replacement and Writesn To reduce page fault rate, prefer least-

recently used (LRU) replacementn Reference bit (aka use bit) in PTE set to 1 on

access to pagen Periodically cleared to 0 by OSn A page with reference bit = 0 has not been

used recentlyn Disk writes take millions of cycles

n Block at once, not individual locationsn Write through is impracticaln Use write-backn Dirty bit in PTE set when page is written

Page 38: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 75

Fast Translation Using a TLBn Address translation would appear to require

extra memory referencesn One to access the PTEn Then the actual memory access

n But access to page tables has good localityn So use a fast cache of PTEs within the CPUn Called a Translation Look-aside Buffer (TLB)n Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100

cycles for miss, 0.01%–1% miss raten Misses could be handled by hardware or software

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 76

Fast Translation Using a TLB

Page 39: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 77

TLB Missesn If page is in memory

n Load the PTE from memory and retryn Could be handled in hardware

n Can get complex for more complicated page table structures

n Or in softwaren Raise a special exception, with optimized handler

n If page is not in memory (page fault)n OS handles fetching the page and updating

the page tablen Then restart the faulting instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 78

TLB Miss Handlern TLB miss indicates

n Page present, but PTE not in TLBn Page not preset

n Must recognize TLB miss before destination register overwrittenn Raise exception

n Handler copies PTE from memory to TLBn Then restarts instructionn If page not present, page fault will occur

Page 40: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 79

Page Fault Handlern Use faulting virtual address to find PTEn Locate page on diskn Choose page to replace

n If dirty, write to disk first

n Read page into memory and update page table

n Make process runnable againn Restart from faulting instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 80

TLB and Cache Interactionn If cache tag uses

physical addressn Need to translate

before cache lookup

n Alternative: use virtual address tagn Complications due to

aliasingn Different virtual

addresses for shared physical address

Page 41: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 81

Memory Protectionn Different tasks can share parts of their

virtual address spacesn But need to protect against errant accessn Requires OS assistance

n Hardware support for OS protectionn Privileged supervisor mode (aka kernel mode)n Privileged instructionsn Page tables and other state information only

accessible in supervisor moden System call exception (e.g., syscall in MIPS)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 82

The Memory Hierarchy

n Common principles apply at all levels of the memory hierarchyn Based on notions of caching

n At each level in the hierarchyn Block placementn Finding a blockn Replacement on a missn Write policy

§5.8 A Com

mon Fram

ework for M

emory H

ierarchies

The BIG Picture

Page 42: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 83

Block Placementn Determined by associativity

n Direct mapped (1-way associative)n One choice for placement

n n-way set associativen n choices within a set

n Fully associativen Any location

n Higher associativity reduces miss raten Increases complexity, cost, and access time

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 84

Finding a Block

n Hardware cachesn Reduce comparisons to reduce cost

n Virtual memoryn Full table lookup makes full associativity feasiblen Benefit in reduced miss rate

Associativity Location method Tag comparisons

Direct mapped Index 1

n-way set associative

Set index, then search entries within the set

n

Fully associative Search all entries #entries

Full lookup table 0

Page 43: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 85

Replacementn Choice of entry to replace on a miss

n Least recently used (LRU)n Complex and costly hardware for high associativity

n Randomn Close to LRU, easier to implement

n Virtual memoryn LRU approximation with hardware support

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 86

Write Policyn Write-through

n Update both upper and lower levelsn Simplifies replacement, but may require write

buffern Write-back

n Update upper level onlyn Update lower level when block is replacedn Need to keep more state

n Virtual memoryn Only write-back is feasible, given disk write

latency

Page 44: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 87

Sources of Missesn Compulsory misses (aka cold start misses)

n First access to a blockn Capacity misses

n Due to finite cache sizen A replaced block is later accessed again

n Conflict misses (aka collision misses)n In a non-fully associative cachen Due to competition for entries in a setn Would not occur in a fully associative cache of

the same total size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 88

Cache Design Trade-offs

Design change Effect on miss rate Negative performance effect

Increase cache size Decrease capacity misses

May increase access time

Increase associativity Decrease conflict misses

May increase access time

Increase block size Decrease compulsory misses

Increases miss penalty. For very large block size, may increase miss rate due to pollution.

Page 45: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 89

Cache Controln Example cache characteristics

n Direct-mapped, write-back, write allocaten Block size: 4 words (16 bytes)n Cache size: 16 KB (1024 blocks)n 32-bit byte addressesn Valid bit and dirty bit per blockn Blocking cache

n CPU waits until access is complete

§5.9 Using a Finite S

tate Machine to C

ontrol A Sim

ple Cache

Tag Index Offset03491031

4 bits10 bits18 bits

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 90

Interface Signals

CacheCPU Memory

Read/Write

Valid

Address

Write Data

Read Data

Ready

32

32

32

Read/Write

Valid

Address

Write Data

Read Data

Ready

32

128

128

Multiple cycles per access

Page 46: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 91

Finite State Machinesn Use an FSM to

sequence control stepsn Set of states, transition

on each clock edgen State values are binary

encodedn Current state stored in a

registern Next state

= fn (current state,current inputs)

n Control output signals= fo (current state)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 92

Cache Controller FSM

Could partition into

separate states to

reduce clock cycle time

Page 47: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 93

Cache Coherence Problemn Suppose two CPU cores share a physical

address spacen Write-through caches

§5.10 Parallelism

and Mem

ory Hierarchies: C

ache Coherence

Time step

Event CPU A’s cache

CPU B’s cache

Memory

0 0

1 CPU A reads X 0 0

2 CPU B reads X 0 0 0

3 CPU A writes 1 to X 1 0 1

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 94

Coherence Definedn Informally: Reads return most recently

written valuen Formally:

n P writes X; P reads X (no intervening writes)Þ read returns written value

n P1 writes X; P2 reads X (sufficiently later)Þ read returns written value

n c.f. CPU B reading X after step 3 in examplen P1 writes X, P2 writes XÞ all processors see writes in the same order

n End up with the same final value for X

Page 48: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 95

Cache Coherence Protocolsn Operations performed by caches in

multiprocessors to ensure coherencen Migration of data to local caches

n Reduces bandwidth for shared memoryn Replication of read-shared data

n Reduces contention for access

n Snooping protocolsn Each cache monitors bus reads/writes

n Directory-based protocolsn Caches and memory record sharing status of

blocks in a directory

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 96

Invalidating Snooping Protocolsn Cache gets exclusive access to a block

when it is to be writtenn Broadcasts an invalidate message on the busn Subsequent read in another cache misses

n Owning cache supplies updated value

CPU activity Bus activity CPU A’s cache

CPU B’s cache

Memory

0

CPU A reads X Cache miss for X 0 0

CPU B reads X Cache miss for X 0 0 0

CPU A writes 1 to X Invalidate for X 1 0

CPU B read X Cache miss for X 1 1 1

Page 49: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 97

Memory Consistencyn When are writes seen by other processors

n “Seen” means a read returns the written valuen Can’t be instantaneously

n Assumptionsn A write completes only when all processors have seen

itn A processor does not reorder writes with other

accessesn Consequence

n P writes X then writes YÞ all processors that see new Y also see new X

n Processors can reorder reads, but not writes

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 98

Multilevel On-Chip Caches

§5.13 The AR

M C

ortex-A8 and Intel C

ore i7 Mem

ory Hierarchies

Page 50: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 99

2-Level TLB Organization

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 100

Supporting Multiple Issuen Both have multi-banked caches that allow

multiple accesses per cycle assuming no bank conflicts

n Core i7 cache optimizationsn Return requested word firstn Non-blocking cache

n Hit under missn Miss under miss

n Data prefetching

Page 51: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

DGEMMn Combine cache blocking and subword

parallelism

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 101

§5.14 Going Faster: C

ache Blocking and M

atrix Multiply

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 102

Pitfallsn Byte vs. word addressing

n Example: 32-byte direct-mapped cache,4-byte blocks

n Byte 36 maps to block 1n Word 36 maps to block 4

n Ignoring memory system effects when writing or generating coden Example: iterating over rows vs. columns of

arraysn Large strides result in poor locality

§5.15 Fallacies and Pitfalls

Page 52: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 103

Pitfallsn In multiprocessor with shared L2 or L3

cachen Less associativity than cores results in conflict

missesn More cores Þ need to increase associativity

n Using AMAT to evaluate performance of out-of-order processorsn Ignores effect of non-blocked accessesn Instead, evaluate performance by simulation

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 104

Pitfallsn Extending address range using segments

n E.g., Intel 80286n But a segment is not always big enoughn Makes address arithmetic complicated

n Implementing a VMM on an ISA not designed for virtualizationn E.g., non-privileged instructions accessing

hardware resourcesn Either extend ISA, or require guest OS not to

use problematic instructions

Page 53: Chapter 5harmanani.github.io/classes/csc320/Notes/ch05.pdf · Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2 Principle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 105

Concluding Remarksn Fast memories are small, large memories are

slown We really want fast, large memories Ln Caching gives this illusion J

n Principle of localityn Programs use a small part of their memory space

frequentlyn Memory hierarchy

n L1 cache « L2 cache « … « DRAM memory« disk

n Memory system design is critical for multiprocessors

§5.16 Concluding R

emarks


Recommended