Computer Organization: A Programmer's Perspective · Computer Organization: A Programmer's...

Computer Organization:A Programmer's Perspective

The Memory Hierarchy

Gal A. [email protected]

Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 2

The CPU-Memory GapThe gap widens between DRAM, disk, and CPU speeds.

1985.0 1990.0 1995.0 2000.0 2003.0 2005.0 2010.0 2015.00.0

0.1

1.0

10.0

100.0

1,000.0

10,000.0

100,000.0

1,000,000.0

10,000,000.0

100,000,000.0

Disk seek timeSSD access timeDRAM access timeSRAM access timeCPU cycle timeEffective CPU cycle time

Year

Tim

e (n

s)

DRAM

CPU

SSD

Disk


Key Question:How to have fast, cheap memory?


Principle of LocalityPrinciple of LocalityPrograms tend to reuse data and instructions: near those they have used recently Or same as those they have used recently

Temporal locality: Recently referenced items are likely to be referenced in the near future.

Spatial locality: Items with nearby addresses tend to be referenced close together in time.


Example

Locality Example:Data

Reference array elements in succession (stride-1 reference pattern):Reference sum each iteration:

InstructionsReference instructions in sequence:Cycle through loop repeatedly:

sum = 0;for (i = 0; i < n; i++)

sum += a[i];return sum;

Spatial locality

Spatial localityTemporal locality

Temporal locality


Locality Qualitative EstimationLocality Qualitative Estimation

Question: Does this function have good locality?

int sumarray(int a[M][M]){ int i, j, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < M; j++) sum += a[i][j]; return sum}


Locality ExampleLocality Example

Question: Does this function have good locality?

int sumarray(int a[M][M]){ int i, j, sum = 0;

for (j = 0; j < M; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum}

Important Skill for Professional Programmer:Be able to look at code, get a qualitative sense of its locality


Memory HierarchiesMemory Hierarchies Some fundamental and enduring properties of hardware

and software: Fast storage technologies cost more per byte and have less

capacity. The gap between CPU and main memory speed is widening. Well-written programs tend to exhibit good locality.

These properties complement each other beautifully.

They suggest an approach for organizing memory and storage systems known as a memory hierarchy.


Example Memory Hierarchy

Regs

L1 cache (SRAM)

Main memory(DRAM)

Local secondary storage(local disks)

Larger, slower, and cheaper (per byte)storagedevices

Remote secondary storage(e.g., Web servers)

Local disks hold files retrieved from disks on remote servers

L2 cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache.

CPU registers hold words retrieved from the L1 cache.

L2 cache holds cache lines retrieved from L3 cache

L0:

L1:

L2:

L3:

L4:

L5:

Smaller,faster,and costlier(per byte)storage devices

L3 cache (SRAM)

L3 cache holds cache lines retrieved from main memory.

L6:

Main memory holds disk blocks retrieved from local disks.


CachesCaches Cache: A smaller, faster storage that acts as a staging area

for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy:

For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.

Why do memory hierarchies work? Programs tend to access the data at level k more often than they

access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and

cheaper per bit. Net effect: A large pool of memory that costs as much as the

cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.


Caching in a Memory HierarchyCaching in a Memory Hierarchy

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Larger, slower, cheaper storagedevice at level k+1 is partitionedinto blocks.

Data is copied betweenlevels in block-sized transfer units

8 9 14 3Smaller, faster, more expensivedevice at level k caches a subset of the blocks from level k+1

Level k:

Level k+1: 4

4

4 10

10

10


Request14

Request12

General Caching ConceptsGeneral Caching ConceptsProgram needs object d, which is stored in some

block b.

Cache hitProgram finds b in the cache at level k. E.g.,

block 14.

Cache missb is not at level k, so level k cache must fetch it

from level k+1. E.g., block 12.If level k cache is full, then some current block

must be replaced (evicted). Which one is the “victim”? Placement policy: where can the new block go? E.g.,

b mod 4Replacement policy: which block should be evicted?

E.g., LRU

9 3

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Level k:

Level k+1:

1414

12

14

4*

4*12

12

0 1 2 3

Request12

4*4*12


General Caching ConceptsGeneral Caching ConceptsTypes of cache misses:

Cold (compulsary) miss Cold misses occur because the cache is empty.

Conflict miss Most caches limit blocks at level k+1 to a small subset (sometimes a

singleton) of the block positions at level k. e.g. Block i at level k+1 is placed in block (i mod 4) at level k+1. Conflict misses occur when the level k cache is large enough, but

multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.

Capacity miss Occurs when the set of active cache blocks (working set) is larger than

the cache.


Cache MemoriesCache MemoriesCache memories are small, fast SRAM-based memories

managed automatically in hardware. Hold frequently accessed blocks of main memory

CPU looks first for data in L1, then in L2, then in main memory.

Typical bus structure:

mainmemory

I/Obridgebus interfaceL2 cache

ALU

register file

CPU chip

cache bus system bus memory bus

L1 cache



General Organization of a Cache MemoryGeneral Organization of a Cache Memory

• • • B–110

• • • B–110

valid

valid

tag

tagset 0:

B = 2b bytesper cache block

E lines per set

S = 2s sets

t tag bitsper line

1 valid bitper line

Cache size: C = B x E x S data bytes

• • •

• • • B–110

• • • B–110

valid

valid

tag

tagset 1: • • •

• • • B–110

• • • B–110

valid

valid

tag

tagset S-1: • • •

• • •

Cache is an arrayof sets.

Each set containsone or more lines.

Each line holds ablock of data.


Addressing CachesAddressing Caches

t bits s bits b bits

0m-1

<tag> <set index> <block offset>

Address A:

• • • B–110

• • • B–110

v

v

tag


• • • B–110

• • • B–110

v

v

tag


• • • B–110

• • • B–110

v

v

tag

tagset S-1: • • •

• • •

The word at address A is in the cache ifthe tag bits in one of the <valid> lines in set <set index> match <tag>.

The word contents begin at offset <block offset> bytes from the beginning of the block.


Direct-Mapped CacheDirect-Mapped Cache

Simplest kind of cache

Characterized by exactly one line per set (E=1).

valid

valid

valid

tag

tag

tag

• • •

set 0:

set 1:

set S-1:

E=1 lines per setcache block

cache block

cache block


Accessing Direct-Mapped CachesAccessing Direct-Mapped Caches

Set selectionUse the set index bits to determine the set of interest.

valid

valid

valid

tag

tag

tag

• • •

set 0:

set 1:

set S-1:t bits s bits

0 0 0 0 10m-1

b bits

tag set index block offset

selected set

cache block

cache block

cache block


Accessing Direct-Mapped CachesAccessing Direct-Mapped CachesLine matching and word selection

Line matching: Find a valid line in the selected set with a matching tag

Word selection: Then extract the word (here 32bits made from 4 bytes B

0..B

3)

1

t bits s bits100i0110

0m-1

b bits


selected set (i):

(3) If (1) and (2), then cache hit,

and block offset selects

starting byte.

=1? (1) The valid bit must be set

= ?(2) The tag bits in the cacheline must match the

tag bits in the address

0110 B3B0 B1 B2

30 1 2 74 5 6


Direct-Mapped Cache SimulationDirect-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set

Address trace (reads):0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]

xt=1 s=2 b=1

xx x

1 0 m[1] m[0]

v tag data0 [00002] (cold miss)

(1)




xt=1 s=2 b=1

xx x

1 0 m[1] m[0]

v tag data0 [00002] (cold miss)

(1)0 M[0] M[1]1




xt=1 s=2 b=1

xx x

1 0 m[1] m[0]

v tag data1 [00012] (hit!)

(2)0 M[0] M[1]1




xt=1 s=2 b=1

xx x

1 0 m[1] m[0]

v tag data

(1)1 0 m[1] m[0]

v tag data

1 1 m[13] m[12]

13 [11012] (cold miss)

(3)1 M[12] M[13]1

0 M[0] M[1]10 M[0] M[1]1



Address trace (reads):0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]x

t=1 s=2 b=1xx x

1 0 m[1] m[0]

v tag data

(1)1 0 m[1] m[0]

v tag data

1 1 m[13] m[12]

13 [11012] (cold miss)

(3)

1 1 m[9] m[8]

v tag data8 [10002] (conflict miss)

(4)

0 M[0] M[1]1

1 M[12] M[13]1

1 M[8] M[9]1

1 M[12] M[13]1

0 M[0] M[1]1




xt=1 s=2 b=1

xx x

1 0 m[1] m[0]

v tag data

(1)1 0 m[1] m[0]

v tag data

1 1 m[13] m[12]

13 [11012] (miss)

(3)

1 1 m[9] m[8]

v tag data8 [10002] (conflict miss)

(4)1 0 m[1] m[0]

v tag data

1 1 m[13] m[12]

0 [00002] (conflict miss)

(5)

0 M[0] M[1]1

1 M[12] M[13]1

1 M[8] M[9]1

1 M[12] M[13]1

0 M[0] M[1]1

1 M[12] M[13]1

0 M[0] M[1]1


Why Use Middle Bits as Index?Why Use Middle Bits as Index?

High-Order Bit IndexingAdjacent memory lines would map to

same cache entryPoor use of spatial locality

Middle-Order Bit IndexingConsecutive memory lines map to

different cache linesCan hold C-byte region of address space

in cache at one time

4-line Cache High-OrderBit Indexing

Middle-OrderBit Indexing

00

01

10

11

0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1010

1011

1100

1101

1110

1111

0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1010

1011

1100

1101

1110

1111


Set Associative CachesSet Associative Caches

Characterized by more than one line per set

valid tagset 0: E=2 lines per set

set 1:

set S-1:

• • •

cache block

valid tag cache block






Accessing Set Associative CachesAccessing Set Associative Caches

Set selectionidentical to direct-mapped cache

valid

valid

tag

tagset 0:

valid

valid

tag

tagset 1:

valid

valid

tag

tagset S-1:

• • •

t bits s bits0 0 0 0 1

0m-1

b bits


Selected set

cache block

cache block

cache block

cache block

cache block

cache block


Accessing Set Associative CachesAccessing Set Associative Caches

Line matching and word selectionmust compare the tag in each valid line in the selected set.

1 0110 b3b0 b1 b2

1 1001

t bits s bits100i0110

0m-1

b bits


selected set (i):

=1? (1) The valid bit must be set.

= ?(2) The tag bits in one of the cache lines must

match the tag bits inthe address

(3) If (1) and (2), then cache hit, and

block offset selects starting byte. The four bytes

here make up a full 32bit word

30 1 2 74 5 6


What about writes?

Multiple copies of data exist: L1, L2, L3, Main Memory, Disk

What to do on a write-hit? Write-through (write immediately to one level down)

Write-back (defer write to one level down until replacement of line) Need a dirty bit (line different from memory or not)

What to do on a write-miss? Write-allocate (load into cache, update line in cache)

Good if more writes to the location follow

No-write-allocate (writes straight to one level down, does not load into cache)

Typical Write-through + No-write-allocate

Write-back + Write-allocate


Processor ChipProcessor Chip

Intel Pentium Cache HierarchyIntel Pentium Cache Hierarchy

L1 Data1 cycle latency

16 KB4-way assoc

Write-through32B lines

L1 Instruction16 KB, 4-way

32B lines

Regs.L2 Unified

128KB--2 MB4-way assocWrite-back

Write allocate32B lines

L2 Unified128KB--2 MB4-way assocWrite-back

Write allocate32B lines

MainMemory

Up to 4GB

MainMemory

Up to 4GB


Intel Core i7 Cache Hierarchy

Regs

L1 d-cache

L1 i-cache

L2 unified cache

Core 0

Regs

L1 d-cache

L1 i-cache

L2 unified cache

Core 3

…

L3 unified cache(shared by all cores)

Main memory

Processor package

L1 i-cache and d-cache:32 KB, 8-way, Access: 4 cycles

L2 unified cache: 256 KB, 8-way, Access: 10 cycles

L3 unified cache:8 MB, 16-way,Access: 40-75 cycles

Block size: 64 bytes for all caches.


Date post:	10-Mar-2021
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

Computer Organization: A Programmer's Perspective · Computer Organization: A Programmer's...

Documents