Large and Fast: Exploiting Memory Hierarchyweb.cse.ohio-state.edu/~babic.1/H.2s.pdf · Memory...

1

Large and Fast:Exploiting

Memory Hierarchy

Presentation H

CSE 2431: Introduction to Operating Systems

Gojko Babić

04/06/2020

• A computer system contains a hierarchy of storage devices with different costs, capacities, and access times.

• With a memory hierarchy, a faster storage device at one level of the hierarchy acts as a staging area for a slower storage device at the next lower level.

• Software that is well-written takes advantage of the hierarchy accessing the faster storage device at a particular level more frequently than the storage at the next level.

• As a programmer, understanding the memory hierarchy will result in better performance of applications.

2Presentation H

Memory Hierarchy

2

Registers

L1 cache(SRAM)

Main memory(DRAM)

Local secondary storage(local disks)

Larger, slower, cheaper per byte

Remote secondary storage(tapes, distributed file systems, Web servers)

Local disks hold files retrieved from disks on remote network servers

Main memory holds disk blocks retrieved from local disks

L2 cache(SRAM)

L1 cache holds cache lines retrieved from L2 cache

CPU registers hold words retrieved from L1 cache

L2 cache holds cache lines retrieved from main memory

L0:

L1:

L2:

L3:

L4:

L5:

Smaller,faster,costlierper byte

3Presentation H

An Example of Memory Hierarchy

Presentation H 4

The Levels in Memory Hierarchy

Access time:< 0.25nsec 0.5-2.5nsec 50-70nsec 5-50μsec 5-20msec Technology: SRAM DRAM SSD magnetic disk Capacity: 103 bytes 106 bytes 109 bytes 1011 bytes 1012 bytesCost per GB: $500-$1,000 $10-$20 $0.75-$1.00 $0.05-$0.10

in 2012

• Higher the level, smaller and faster the memory. • Try to keep most of the action in the higher levels.• Locality of reference is the most important program

property that is exploited in many parts of memory hierarchy. g. babic

3

Features

Basic storage unit is a cell (one bit per cell); RAM is traditionally packaged as a chip; multiple chips form memory

Static RAM (SRAM)

Each cell implemented with a six-transistor circuit

Relatively insensitive to disturbances such as electrical noise, radiation, etc.

Faster and more expensive than DRAM

Dynamic RAM (DRAM)

Each bit stored as charge on a capacitor

Value must be refreshed every 10-100 ms

Sensitive to disturbances

Slower and cheaper than SRAM5Presentation Hg. babic

Random Access Memory (RAM)

Transistors Access Needs Sensitive? Cost Power Power Chipper bit time refresh? requirements dissipation density

SRAM 4 or 6 1X No No 100x high high lowDRAM 1 10X Yes Yes 1X low low high

6Presentation Hg. babic

SRAM vs DRAM

4

• (d × w) bit DRAM chip is organized as d supercells of size w bits

• Usually w=8, i.e. one byte

16 x 8 DRAM chip

supercell(2,1)

columns

rows

0 1 2 3

0

1

2

3

Internal row buffer

addr

data

2 bits/

8 bits/

MemorycontrollerData and address

to/from CPU

7

Conventional DRAM Organization

• Read from address 9 corresponds to read from supercell (2,1)

Read address 9

Step 1: Row access strobe (RAS) selects row 2.

Row 2 copied from DRAM array to row buffer.

Columns

Rows

RAS=9/4=2 0 1 2 3

0

1

2

Internal row buffer

16 x 8 DRAM chip

3

addr

data

2/

8/

Memorycontroller

8Presentation H

Reading 16x8 DRAM Supercell [2,1]

Number of columns

5

Step 2: Column access strobe (CAS) selects column 1.

Supercell [2,1] copied from buffer to data lines to CPU.

Cols

Rows

0 1 2 3

0

1

2

3

Internal row buffer

16 x 8 DRAM chip

CAS=9%4=1

addr

data

2/

8/

Memorycontroller

supercell [2,1]

Byte with address 9

To CPU

9

Reading 16x8 DRAM Supercell [2,1] (cont.)

• Step 3: Since a read is distractive, the internal row buffer has to be written back into the corresponding row of DRAM array after each read.

• 4Gbit DRAM can be organized as 512M supercells of size 8 bits

• 512M = 29 x 220 = 214 x 215 = 16,384 x 32,768

columns

rows

…

…

…

…

0 1 … 32,767

0

1

.

.

.

16,383

…

Internal row buffer

512M x 8 DRAM chip

addr

data

15 bits/

8 bits/

Memorycontroller

(to/from CPU)

10Presentation H

Conventional 4Gbit DRAM Organization

6

• CPU provides address and data to be written

• Step 1 write is identical to step 1 read.

• Step 2 write includes updating the appropriate cell in the

internal row buffer with data received from CPU

• Step 3 write is identical to step 3 read.


Writing DRAM

: supercell [i,j]

addr (row = i, col = j)

Memorycontroller

DRAM 7

DRAM 0

031 78151623243263 394047485556

64-bit doubleword at main memory address A

bits0-7

bits8-15

bits16-23

bits24-31

bits32-39

bits40-47

bits48-55

bits56-63

64-bit doubleword

031 78151623243263 394047485556

12Presentation H

4GB Memory out of Eight 512Mx8 DRAMs

7

• Enhanced DRAMS have optimizations that improve the

speed with which the basic DRAM cells can be accessed.

• Examples:

– Fast page mode DRAM (FPM DRAM); up to 1996

– Extended data out DRAM (EDO DRAM); 1996-99

– Synchronous DRAM (SDRAM)

– Double Data-Rate Synchronous DRAM (DDR SDRAM)

– Rambus DRAM (RDRAM)


Enhanced DRAMs

Growth of Capacity per DRAM chip

Presentation H 14g. babic

8

15

Prices of Six Generations of DRAMs

Presentation Hg. babic

• DRAM size increased by multiples of four approximately once every three years until 1996, and thereafter considerably slower.

• The improvements in access time have been slower but continuous, and cost roughly tracks density improvements, although cost is often affected by other issues, such as availability and demand.

• The cost per gibibyte (=109 bytes) is not adjusted for inflation.• Source: Paterson & Hennessy “Computer Organization & Design”, 5th Edition

Presentation H 16g. babic

9

• Information retained if supply voltage is turned off• Referred to as read-only memories (ROM), although some may

be written to as well as read • Read-only memory (ROM): programmed during production• Programmable ROM (PROM): fuse associated with cell that is

blown once by zapping with current; can be programmed once• Eraseable PROM (EPROM): cells cleared by shining ultraviolet

light, special device used to write 1’s; can be erased and reprogrammed about 1000 times

• Electrically eraseable PROM (EEPROM): similar to EPROM but does not require a physically separate programming device, can be re-programmed in place on printed circuit cards; can be reprogrammed about 100,000 times

• Flash Memory– Based on EEPROM technology


Nonvolatile Memory

• SSD package plugs into a standard disk slot on the I/O bus (typically USB or SATA) and behaves like a disk reading and writing logical blocks.

• Consists of one or more flash memory chips and a flash translation layer (hardware/firmware device) that plays the same role as a disk controller.


Solid State Disk - SSD

Advantages of SSD over rotating disks:• No moving parts – semiconductor memory is more rugged• Much faster random access times• Use less power

Disadvantages of SSD over rotating disks:• SSDs wear out with usage• 10-20 times more expensive than disks

10

Basic (Single CPU) Computer Structure

• CPU and device controllers connect through common bus providing access to shared memory

g. babic Presentation A 19

• Pages: 512B to 4KB, Blocks: 32 to 128 pages

• Data read in units of pages.

• Page can be written only after its block has been erased

• A block wears out after 100,000 repeated writes.

Flash translation layer

I/O bus

Page 0 Page 1 Page P-1…Block 0

… Page 0 Page 1 Page P-1…Block B-1

Flash memory

Solid State Disk (SSD)

Requests to read and write logical disk blocks


Solid State Disk – SSD (cont.)

11

g. babic Presentation H 21

Moving-Head Disk Mechanism

• A sector (usually 512 bytes) is a basic unit of transfer (read/write)

22Presentation H

Sequential read throughput 250 MB/s Sequential write throughput 170 MB/sRandom read throughput 140 MB/s Random write throughput 14 MB/sRandom read access 30 us Random write access 300 us

g. babic

SSD Characteristics

12


Growth in Microprocessor Performance

24

Growth in Performance of RAM & CPU

• Mismatch between CPU performance growth and memory performance growth “memory wall”

• Importance of cache.Presentation Hg. babic

13

1980 1990 1995 2000 2003 2005 2010 2010:1980

CPU 8080 386 Pentium P-III P-4 Core 2 Core i7 ---

Clock rate(MHz) 1 20 150 600 3300 2000 2500 2500

Cycle time(ns) 1000 50 6 1.6 0.3 0.50 0.4 2500

Cores 1 1 1 1 1 2 4 4

Effectivecycle 1000 50 6 1.6 0.3 0.25 0.1 10,000time(ns)

Inflection point in computer historywhen designers hit the “Power Wall”

25Presentation H

CPU Trends: “Power Wall”

• At about the same time, besides “memory wall” and “power wall”, processor designers also reached limits in taking advantage of instruction level parallelism – ILP in (sequential) programs.

• Since early 2000’s processors have not been (significantly) getting faster.• Instead, multi-core processors.

g. babic

Presentation H 26

Principle of Locality of Reference

• Programs access a small proportion of their address space atany time and they tend to reuse instructions and data they haveused recently.

– temporal locality – recently accessed items are likely to be accessed soon,

– spatial locality – items near those accessed recently are likely to be accessed soon, i.e. items near one another tend to bereferenced close together in time.

• An implication of principle of locality is that we can predict withreasonable accuracy what instructions and data a program willuse in near future based on its accesses in the recent past.

• Principle of locality applies more strongly to code accesses thandata accesses.

g. babic

14

• Use memory hierarchy

• Store everything on disk

• Copy recently accessed (and nearby) items from disk to

smaller DRAM memory

– Main memory (virtual memory)

• Copy recently accessed (and nearby) items from DRAM

memory to smaller SRAM memory

– Cache attached to CPU

Taking Advantage of Locality


28

Typical System Organization Without Cache

CPU chip

Mainmemory

I/O bridge

Bus interface

ALU

Register file

System bus Memory bus

Disk controller

Graphicsadapter

USBcontroller

Mouse Keyboard Monitor

Disk

I/O bus Expansion slots forother devices suchas network adapters.


15

MainmemoryI/O

bridgeBus interface

ALU

Register file

CPU chip

System bus Memory bus

Cache memories

29

Typical Processor Organization with Cache


Cache is fast but because of that it has to be small. Why?

Presentation H 30

Model of Memory + Cache + CPU System

CPU CacheMain Memory

Data Block

g. babic

16

31

Basics of Cache Operation• We will first (and mostly) consider a cache read operation,

since it is more important. Why?

• When CPU provides an address requesting a content of a

given main memory location:

– first check the cache for the content; caches include a tag in

each cache entry to identify the memory address of a block,

– if the block present, this is a hit, get the content (fast) and

CPU proceeds normally, without (or small) delay,

– if the block not present, this is a miss, stall CPU and read

the block with the required content from main memory,

– long CPU slowdown: miss penalty time to access main

memory and to place a block into the cache,

– And now (after miss penalty) CPU gets the required content.

• How do we know what block is stored in a given cache

location?

– store block address as well as the data in a cache entry

– actually, it may only need the high-order bits of block

address called the tag

• What if there is no data in a location?

– introduce valid bit in each entry

– valid bit: 1 = present, 0 = not present

– initially 0

Tags and Valid Bits


17

33

CPU

0

0

0

0

Cache(SRAM)

Cache ExampleMain memory in GB

(DRAM)

bus

address address

4B dataaddress

16B data 16B data

• CPU generates 4-byte word read at address 100 (16 bit address assumed), i.e. read for contents of bytes at addresses 100-103

• Since cache is empty cache mis

v/i bit

Tag 10 bits Data 4x4B = 16B

g. babic Presentation H


CPU

0

0

1 0000000001 [108-111] [104-107] [100-103] [96-99]

0

Cache after 4B Read at Address 100

Main memory in GB(DRAM)

bus

address address

4B dataaddress

16B data 16B data

1. Blok of 16 byte read from RAM and stored in cache2. Cache entry chosen according to cache placement algorithm (direct mapped

cache assumed).

v/i bit

Cache(SRAM)

18

g. babic PresentationH 35

CPU

1 0000000011 [202-207] [200-203] [196-199] [192-195]

0

1 0000000001 [108-111] [104-107] [100-103] [96-99]

0

Cache after 4B Read at Address 204

Main memory in GB(DRAM)

bus

address address

4B dataaddress

16B data 16B data

1. Blok of 16 byte read from RAM and stored in cache2. Cache entry chosen according to cache placement algorithm (direct mapped

cache assumed)

v/i bit

Cache(SRAM)

Presentation H 36

Cache & DRAM Memory: Performance

0 1

t1+t2

t1

Hit ratio

Average Access Time

• t2: main memory access time (e.g. 50 nsec)• t1: cache access time (e.g. 2 nsec)• Hit ratio is a ratio of a number of hits and a total number of

memory accesses; Miss ratio = 1 – Hit ratio

• Because of locality of reference, hit rates are normally well over 90%

AverageAccessTime=CacheAccessTime+(1–hit ratio)xMissPenaltyg. babic

19

• In direct mapped caches, a block from memory has only one cache entry where it has to be stored.

• Location of cache entry is determined by block address and number of entries in the cache:

– location = (block address) mod (number of entries in cache)

• Block address includes all address bits excluding block offset bits, i.e. rightmost n bits where:

- 2n = a number of bytes in a block

• Example: 16 bit addresses and block size = 8 bytes

- then address: 0x1234 = 0001 0010 0011 01002

has block address 0x0246 = 0 0010 0100 0110 since block offset = 3 (23 = 8)

- then if the cache has 16 entries location is 0110, i.e. 4 right-

most bits of block address, since 24 = 16, that is the index.

Direct Mapped Cache

37

• Assume processor with 10-bit addresses & a direct mapped cache with 8 entries, and 4 byte blocks;

• 4 byte blocks = 22 = 2block offset block offset = 2

• blok address = 10-2 = 8 bits

• 8 entries = 23 = 2index index = 3 bits

• Address format = 5 bit (tag) +3 bits (index) + 2 bits (block offset)

V Tag = 5 bits Data = 32 bits = 4 bytes

N

N

N

N

N

N

N

N

Direct Mapped Cache Example

38g. babic

20

Index V Tag Data

000 N

001 N

010 N

011 N

100 N

101 N

110 Y 00010 Mem[88-91]

111 N

Word address Binary address Hit/miss Cache entry

88 00010 110 00 Miss 110

Cache Example: Access 1


Index V Tag Data

000 N

001 N

010 Y 00011 Mem[104-107]

011 N

100 N

101 N

110 Y 00010 Mem[88-91]

111 N


104 00011 010 00 Miss 010

Cache Example: Access 2


21

Index V Tag Data

000 N

001 N

010 Y 00011 Mem[104-107]

011 N

100 N

101 N

110 Y 00010 Mem[88-91]

111 N


88 00010 110 00 Hit 110

104 00011 010 00 Hit 010

Cache Example: Accesses 3, 4


Index V Tag Data

000 Y 00010 Mem[64-67]

001 N

010 Y 00011 Mem[104-107]

011 Y 00000 Mem[12-15]

100 N

101 N

110 Y 00010 Mem[88-91]

111 N


64 00010 000 00 Miss 000

12 00000 011 00 Miss 011

64 00010 000 00 Hit 000

42Presentation H

Cache Example: Accesses 5, 6, 7

g. babic

22

Index V Tag Data

000 Y 00010 Mem[64-67]

001 N

010 Y 00010 Mem[72-75]

011 Y 00000 Mem[12-15]

100 N

101 N

110 Y 00010 Mem[88-91]

111 N


72 00010 010 00 Miss 010

43Presentation H

Cache Example: Accesses 8

g. babic

• Direct mapped caches: only one choice for cache entry

• Location of cache entry determined by block address

– location = (block address) mod (number of entries in cache)

• Use low-order address bits as a location for cache entry

Direct Mapped Cache

44Presentation H

cache

Main memory

g. babic

23

Presentation H 45

Direct Mapping Cache: 1 × 4-byte BlocksA d d re s s ( s h o w in g b it p o s i t io n s )

1 6 1 4 B y te

o ffs e t

V a lid T a g D a ta

H i t D a ta

1 6 3 2

1 6 K

e n tr ie s

1 6 b i ts 3 2 b it s

3 1 3 0 17 16 15 5 4 3 2 1 0

Block offset = 2 bits since 22 = 4 bytesHere, block=data then block offset=byte offsetandword offset =0

Index = 14 bits since214 = 16K number of

cache entries

Block offset

g. babic

Presentation H 46

• address 10010 = 0000000000000000 00000000011001 00block address = 0000000000000000 00000000011001

2

4-byte block [100-103] will be stored incache entry 25 = 00000000011001

2Note: bytes with addresses 100-103 have same block address

• address 20410 = 0000000000000000 00000000110011 00

block address = 0000000000000000 000000001100112

4-byte block [204-207] will be stored in

cache entry 51 = 000000001100112

Note: bytes with addresses 204-207 have same block address

g. babic

Cache after Addresses 100 and 204

24

Presentation H 47

Direct Mapping Cache: 4 × 4-byte Blocks

Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0indextag

Word offset

• Block offset = 4 bits (24 = 16 bytes); Index = 12 bits (212 = 4K entries)• Since 4-byte words (data) block offset=2 bits word offset + 2 bits byte offset

since 4= 22 = 2byte offset

g. babic

Presentation H 48

• address 10010 = 0000000000000000 000000000110 0100block address = 0000000000000000 000000000110

2

16-byte block [96-111] will be stored in cache entry 6 = 000000000110

2Note: bytes with addresses 96-111 have same block address

• address 20410 = 0000000000000000 000000001100 1100

block address = 0000000000000000 0000000011002

16-byte block [192-207] will be stored in

cache entry 12 = 0000000011002

Note: bytes with addresses 192-207 have same block address

g. babic

Cache after Addresses 100 and 204

25

Presentation H 49

ProgramBlock size in

bytesInstruction miss rate

Data miss rate

Effective combined miss rate

gcc 4 6.1% 2.1% 5.4%16 2.0% 1.7% 1.9%

spice 4 1.2% 1.3% 1.2%16 0.3% 0.6% 0.4%

• Cache with 4-byte blocks had 16K entries total of 64KB data• Cache with 16-byte blocks had 4K entries total of 64KB data• Thus, both caches can accommodate the same amount of data

• Larger blocks reduced miss rate due to spatial locality• But for a fixed-sized cache, larger blocks fewer of them more competition may increased miss rate

• Larger miss penalty may override benefit of reduced miss rate• Thus keep in mind, the miss rate is not the only parameter:AverageAccessTime=CacheAccessTime+MissRatexMissPenalty

Block Size Consideration

g. babic

• DRAM memory has a width of its read/write operations determined by a width of its bus data lines.

• Numbers used in examples that follow for reading from main memory :

– 2nsec for address transfer from CPU to memory controller (mostly propagation delay),

– 50nsec per DRAM access,

– 2nsec per data transfer from memory controller to cache (mostly propagation delay)

Main Memory Supporting Caches


26

Presentation H 51

C P U

C a c h e

M e m o ry

a . O n e - w o rd - w id e

m e m o r y o rg a n iz a t io n

4-Byte Main Memory Bus

4 bytewide bus

g. babic

• For 16-byte block, and 4-byte-wide DRAM bus: Miss penalty =

2 + 4×50 + 4×2 = 210 nsec,Bandwidth =

16 bytes/210= 0.08 bytes/nsec.

• Although caches are interested in low–latency memory, it is generally easier to improve memory bandwidth with new memory organization than it is to reduce memory latency.

Presentation H 52

C P U

M e m o r y

M u l t ip le x o r

C a c h e

16-Byte (Wider) Main Memory Bus

16 bytewide bus

g. babic

• CPU accesses a word at a time, so a need for a multiplexer

• For 16-byte block, and 16-byte-wide DRAM bus: Miss penalty =

2 + 50 + 2 = 54 nsec,Bandwidth =

16 bytes/54 = 0.30 bytes/nsec.

27

Wider Main Memory Bus & Level-2 Cache

C P U

B u s

M e m o r y

M u l t ip le x o r

C a c h e

• But the multiplexer is on the critical timing path.

53Presentation H

• Level-2 cache helps since the multiplexing is now between level-1 and level-2 caches and not on critical timing path.

g. babic

54

Interleaved Memory Organization

• This is four-way interleaved memory with 4 byte memory bus─ saves wires in bus

• Miss penalty = 2 + 50 + 4×2 = 60nsec,• Throughput = 16 bytes / 60 = 0.27 bytes/nsec

C P U

C a c h e

B u s

M e m o r y

b a n k 1

M e m o r y

b a n k 2

M e m o r y

b a n k 3

M e m o r y

b a n k 0


28

• Level-1 (primary) cache attached to CPU

– small, but very fast

• Level-2 (secondary) cache services misses from level-1 cache

– larger, slower, but still faster than main memory

• Main memory services level-2 cache misses

• Some high-end systems include level-3 cache

• Level-1 cache: focus on minimal hit time

• Level-2 cache:

– focus on low miss rate to avoid main memory access

– hit time has less overall impact

Multilevel Caches


Presentation H 56

Multilevel Caches (cont.)

• Increased logic density => on-chip cache with processor

– internal cache: level 1

– internal cache: level 2

– external or internal cache: level 3

• Unified cache, i.e. a cache for both instruction and data

– balances the load between instruction and data fetches,

– only one cache needs to be designed and implemented;

• Split cache:

– data cache (d-cache) and instruction cache (i-cache)

– essential for pipelined & other advanced architectures;

• Level 1 caches are normally split caches.g. babic

29

• In addition to direct mapped, two more cache organizations

• Fully associative caches:

– allow a given block to go in any cache entry

– requires all entries to be searched at once

– tag comparator per entry (expensive)

– Note: no index, thus a tag equals a block address

• n-way set associative caches:

– each set contains n entries (blocks)

– block number determines which set in the cache

set number = (block number) mod (number of sets)

– search only n entries in a given set (at once)

– n comparators (less expensive)

Associative Caches


Presentation H 58

Illustration of Cache Organizations

2 way set associative

g. babic

30

Presentation H 59

4-Way Set Associate Cache with 4-byte Blocks

A d d r e s s

2 2 8

V T a gI n d e x

0

1

2

2 5 3

2 5 4

2 5 5

D a t a V T a g D a t a V T a g D a t a V T a g D a t a

3 22 2

4 - t o - 1 m u l t ip le x o r

H i t D a t a

123891 01 11 23 03 1 0

indextag block offset 2 bits

• 4-way set associative cache builtout of 4 direct-mapped caches

g. babic

• There are 256 sets• Indexing sets

• Consider 2-way set associative cache with 4 sets, and 8-byte blocks. Assume 16-bit address.

a. provide address format.8 byte blocks 3-bit block offset; 4 sets 2-bit index4-byte data (words) 2-bit byte offset3-bit block offset = 2-bit byte offset + 1-bit word offset16-bit address= 11-bit tag + 2-bit index+1-bit word offset

+2-bit byte offset b. Provide cache layout

v tag (11bits) data 4B data 4B v tag (11bits) data 4B data 4B

c. For address sequence: 8, 0, 52, 20, 56, 16, 24, 116, 20, 8, 16indicate hit/miss and content of the cache, initially empty.

Set Associative Cache: Example

60

31

0

1 0 [12-15] [8-11]

0

0

0

0

0

0

1 0 [4-7] [0-3]

1 0 [12-15] [8-11]

0

0

0

0

0

0

1 0 [4-7] [0-3]

1 0 [12-15] [8-11]

1 1 [52-55] [48-51]

0

0

0

0

0

1 0 [4-7] [0-3]

1 0 [12-15] [8-11]

1 1 [52-55] [48-51]

0

0

0

1 0 [20-23] [16-19]

0

8 miss

0 miss

52 miss

20 miss

tag 1 w offset 0 w offset tag 1 w offset 0 w offset

1 0 [4-7] [0-3]

1 0 [12-15] [8-11]

1 1 [52-55] [48-51]

1 1 [60-63] [56-59]

0

0

1 0 [20-23] [16-19]

0

1 0 [4-7] [0-3]

1 0 [12-15] [8-11]

1 1 [52-55] [48-51]

1 1 [60-63] [56-59]

0

0

1 0 [20-23] [16-19]

1 0 [28-31] [24-27]

1 0 [4-7] [0-3]

1 0 [12-15] [8-11]

1 3 [116-119] [112-115]

1 1 [60-63] [56-59]

0

0

1 0 [20-23] [16-19]

1 0 [28-31] [24-27]

56 miss

16 hit

24 miss

116 missLRU rep.

20 hit = 00000000000 10 1 00 8 hit = 00000000000 01 0 0016 hit = 00000000000 10 0 00

tag 1 w offset 0 w offset tag 1 w offset 0 w offset

32

• Reduce a number of comparisons to reduce cost.

Associativity Location method Tag comparisons

Direct mapped Index entry 1

n-way set associative

Index set, then search n entries within the set

n

Fully associative Search all entries Number of entries

Comparing Finding a Block


• Refers to the situation when on a miss there is no room for a new block and one of existing blocks in the cache has to be removed from caches.

• Direct mapped cache: no choice.

• N-way set associative cache:

– choose among n entries in the set;

• Fully associative cache:

– choose among all entries in the cache;

• Least-recently used (LRU) replacement policy:

– choose the one unused for the longest time,

– simple for 2-way, manageable for up to 16-way, probably too inefficient beyond that;

• Random replacement policy also possibility;

Replacement Policy


33

• Write is more complex and longer than read since:

─ write and tag comparison can not proceed simultaneously as read and tag comparison can in the case of cache read,

─ only a portion of the block has to be updated;

• Two approaches when cache hit: write through & write back

• Write through technique:

– updates a main memory and the block in cache,

– but it makes writes take longer and it does not take advantage of cache, although simple to implement.

• Improvements to this technique special write buffer:

– it holds data waiting to be written to memory,

– CPU continues immediately, and CPU stalls on write only if write buffer is already full.

Cache Writing on Hit


• Write back technique:

– just update the block in cache,

– keep track of whether or not a block has been updated; each entry has a dirty bit for that purpose.

– update in memory when cache entry has to be replaced

• Cache and memory are inconsistent; problem in

multi-processor (-core) systems. Why?

• When a dirty block is to be replaced:

– write it back to memory,– can also use write buffer,

• Write through is usually found in level 1 data caches backed by level 2 cache that uses write back.

Cache Writing on Hit (cont.)


34

• What should happen on a write miss?

• Write-allocate: load block into cache, and update in cache

(good if more writes to the location follow);

– Write-back usually uses write-allocate;

• No-write-allocate: writes immediately only to main memory;

– Write-through usually uses no-write-allocate

Cache Writing on Miss


Registers

L1 d-cache

L1 i-cache

L2 unified cache

Core 0

Registers

L1 d-cache

L1 i-cache

L2 unified cache

Core 3

…

L3 unified cache(shared by all cores)

Main memory

Processor package Intel Core i7 (4 cores):L1: i‐cache and d‐cache:

32 KB each, 8‐way, Access: 4 cycles

L2: unified cache:256 KB, 8‐way, Access: 11 cycles

L3: unified cache:8 MB, 16‐way,Access: 30‐40 cycles

Block size: 64 bytes for all caches.

Intel Core i7 Cache Hierarchy

Presentation Hg. babic 68

Date post:	14-Jun-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Large and Fast: Exploiting Memory Hierarchyweb.cse.ohio-state.edu/~babic.1/H.2s.pdf · Memory...

Documents