Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 2
The CPU-Memory GapThe gap widens between DRAM, disk, and CPU speeds.
1985.0 1990.0 1995.0 2000.0 2003.0 2005.0 2010.0 2015.00.0
0.1
1.0
10.0
100.0
1,000.0
10,000.0
100,000.0
1,000,000.0
10,000,000.0
100,000,000.0
Disk seek timeSSD access timeDRAM access timeSRAM access timeCPU cycle timeEffective CPU cycle time
Year
Tim
e (n
s)
DRAM
CPU
SSD
Disk
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 3
Key Question:How to have fast, cheap memory?
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 4
Principle of LocalityPrinciple of LocalityPrograms tend to reuse data and instructions: near those they have used recently Or same as those they have used recently
Temporal locality: Recently referenced items are likely to be referenced in the near future.
Spatial locality: Items with nearby addresses tend to be referenced close together in time.
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 5
Example
Locality Example:Data
Reference array elements in succession (stride-1 reference pattern):Reference sum each iteration:
InstructionsReference instructions in sequence:Cycle through loop repeatedly:
sum = 0;for (i = 0; i < n; i++)
sum += a[i];return sum;
Spatial locality
Spatial localityTemporal locality
Temporal locality
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 6
Locality Qualitative EstimationLocality Qualitative Estimation
Question: Does this function have good locality?
int sumarray(int a[M][M]){ int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < M; j++) sum += a[i][j]; return sum}
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 7
Locality ExampleLocality Example
Question: Does this function have good locality?
int sumarray(int a[M][M]){ int i, j, sum = 0;
for (j = 0; j < M; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum}
Important Skill for Professional Programmer:Be able to look at code, get a qualitative sense of its locality
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 8
Memory HierarchiesMemory Hierarchies Some fundamental and enduring properties of hardware
and software: Fast storage technologies cost more per byte and have less
capacity. The gap between CPU and main memory speed is widening. Well-written programs tend to exhibit good locality.
These properties complement each other beautifully.
They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 9
Example Memory Hierarchy
Regs
L1 cache (SRAM)
Main memory(DRAM)
Local secondary storage(local disks)
Larger, slower, and cheaper (per byte)storagedevices
Remote secondary storage(e.g., Web servers)
Local disks hold files retrieved from disks on remote servers
L2 cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache.
CPU registers hold words retrieved from the L1 cache.
L2 cache holds cache lines retrieved from L3 cache
L0:
L1:
L2:
L3:
L4:
L5:
Smaller,faster,and costlier(per byte)storage devices
L3 cache (SRAM)
L3 cache holds cache lines retrieved from main memory.
L6:
Main memory holds disk blocks retrieved from local disks.
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 10
CachesCaches Cache: A smaller, faster storage that acts as a staging area
for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy:
For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.
Why do memory hierarchies work? Programs tend to access the data at level k more often than they
access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and
cheaper per bit. Net effect: A large pool of memory that costs as much as the
cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 11
Caching in a Memory HierarchyCaching in a Memory Hierarchy
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Larger, slower, cheaper storagedevice at level k+1 is partitionedinto blocks.
Data is copied betweenlevels in block-sized transfer units
8 9 14 3Smaller, faster, more expensivedevice at level k caches a subset of the blocks from level k+1
Level k:
Level k+1: 4
4
4 10
10
10
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 12
Request14
Request12
General Caching ConceptsGeneral Caching ConceptsProgram needs object d, which is stored in some
block b.
Cache hitProgram finds b in the cache at level k. E.g.,
block 14.
Cache missb is not at level k, so level k cache must fetch it
from level k+1. E.g., block 12.If level k cache is full, then some current block
must be replaced (evicted). Which one is the “victim”? Placement policy: where can the new block go? E.g.,
b mod 4Replacement policy: which block should be evicted?
E.g., LRU
9 3
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Level k:
Level k+1:
1414
12
14
4*
4*12
12
0 1 2 3
Request12
4*4*12
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 13
General Caching ConceptsGeneral Caching ConceptsTypes of cache misses:
Cold (compulsary) miss Cold misses occur because the cache is empty.
Conflict miss Most caches limit blocks at level k+1 to a small subset (sometimes a
singleton) of the block positions at level k. e.g. Block i at level k+1 is placed in block (i mod 4) at level k+1. Conflict misses occur when the level k cache is large enough, but
multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
Capacity miss Occurs when the set of active cache blocks (working set) is larger than
the cache.
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 14
Cache MemoriesCache MemoriesCache memories are small, fast SRAM-based memories
managed automatically in hardware. Hold frequently accessed blocks of main memory
CPU looks first for data in L1, then in L2, then in main memory.
Typical bus structure:
mainmemory
I/Obridgebus interfaceL2 cache
ALU
register file
CPU chip
cache bus system bus memory bus
L1 cache
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 15
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 16
General Organization of a Cache MemoryGeneral Organization of a Cache Memory
• • • B–110
• • • B–110
valid
valid
tag
tagset 0:
B = 2b bytesper cache block
E lines per set
S = 2s sets
t tag bitsper line
1 valid bitper line
Cache size: C = B x E x S data bytes
• • •
• • • B–110
• • • B–110
valid
valid
tag
tagset 1: • • •
• • • B–110
• • • B–110
valid
valid
tag
tagset S-1: • • •
• • •
Cache is an arrayof sets.
Each set containsone or more lines.
Each line holds ablock of data.
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 17
Addressing CachesAddressing Caches
t bits s bits b bits
0m-1
<tag> <set index> <block offset>
Address A:
• • • B–110
• • • B–110
v
v
tag
tagset 0: • • •
• • • B–110
• • • B–110
v
v
tag
tagset 1: • • •
• • • B–110
• • • B–110
v
v
tag
tagset S-1: • • •
• • •
The word at address A is in the cache ifthe tag bits in one of the <valid> lines in set <set index> match <tag>.
The word contents begin at offset <block offset> bytes from the beginning of the block.
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 18
Direct-Mapped CacheDirect-Mapped Cache
Simplest kind of cache
Characterized by exactly one line per set (E=1).
valid
valid
valid
tag
tag
tag
• • •
set 0:
set 1:
set S-1:
E=1 lines per setcache block
cache block
cache block
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 19
Accessing Direct-Mapped CachesAccessing Direct-Mapped Caches
Set selectionUse the set index bits to determine the set of interest.
valid
valid
valid
tag
tag
tag
• • •
set 0:
set 1:
set S-1:t bits s bits
0 0 0 0 10m-1
b bits
tag set index block offset
selected set
cache block
cache block
cache block
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 20
Accessing Direct-Mapped CachesAccessing Direct-Mapped CachesLine matching and word selection
Line matching: Find a valid line in the selected set with a matching tag
Word selection: Then extract the word (here 32bits made from 4 bytes B
0..B
3)
1
t bits s bits100i0110
0m-1
b bits
tag set index block offset
selected set (i):
(3) If (1) and (2), then cache hit,
and block offset selects
starting byte.
=1? (1) The valid bit must be set
= ?(2) The tag bits in the cacheline must match the
tag bits in the address
0110 B3B0 B1 B2
30 1 2 74 5 6
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 21
Direct-Mapped Cache SimulationDirect-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set
Address trace (reads):0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
xt=1 s=2 b=1
xx x
1 0 m[1] m[0]
v tag data0 [00002] (cold miss)
(1)
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 22
Direct-Mapped Cache SimulationDirect-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set
Address trace (reads):0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
xt=1 s=2 b=1
xx x
1 0 m[1] m[0]
v tag data0 [00002] (cold miss)
(1)0 M[0] M[1]1
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 23
Direct-Mapped Cache SimulationDirect-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set
Address trace (reads):0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
xt=1 s=2 b=1
xx x
1 0 m[1] m[0]
v tag data1 [00012] (hit!)
(2)0 M[0] M[1]1
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 24
Direct-Mapped Cache SimulationDirect-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set
Address trace (reads):0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
xt=1 s=2 b=1
xx x
1 0 m[1] m[0]
v tag data
(1)1 0 m[1] m[0]
v tag data
1 1 m[13] m[12]
13 [11012] (cold miss)
(3)1 M[12] M[13]1
0 M[0] M[1]10 M[0] M[1]1
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 25
Direct-Mapped Cache SimulationDirect-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set
Address trace (reads):0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]x
t=1 s=2 b=1xx x
1 0 m[1] m[0]
v tag data
(1)1 0 m[1] m[0]
v tag data
1 1 m[13] m[12]
13 [11012] (cold miss)
(3)
1 1 m[9] m[8]
v tag data8 [10002] (conflict miss)
(4)
0 M[0] M[1]1
1 M[12] M[13]1
1 M[8] M[9]1
1 M[12] M[13]1
0 M[0] M[1]1
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 26
Direct-Mapped Cache SimulationDirect-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set
Address trace (reads):0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
xt=1 s=2 b=1
xx x
1 0 m[1] m[0]
v tag data
(1)1 0 m[1] m[0]
v tag data
1 1 m[13] m[12]
13 [11012] (miss)
(3)
1 1 m[9] m[8]
v tag data8 [10002] (conflict miss)
(4)1 0 m[1] m[0]
v tag data
1 1 m[13] m[12]
0 [00002] (conflict miss)
(5)
0 M[0] M[1]1
1 M[12] M[13]1
1 M[8] M[9]1
1 M[12] M[13]1
0 M[0] M[1]1
1 M[12] M[13]1
0 M[0] M[1]1
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 27
Why Use Middle Bits as Index?Why Use Middle Bits as Index?
High-Order Bit IndexingAdjacent memory lines would map to
same cache entryPoor use of spatial locality
Middle-Order Bit IndexingConsecutive memory lines map to
different cache linesCan hold C-byte region of address space
in cache at one time
4-line Cache High-OrderBit Indexing
Middle-OrderBit Indexing
00
01
10
11
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 28
Set Associative CachesSet Associative Caches
Characterized by more than one line per set
valid tagset 0: E=2 lines per set
set 1:
set S-1:
• • •
cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 29
Accessing Set Associative CachesAccessing Set Associative Caches
Set selectionidentical to direct-mapped cache
valid
valid
tag
tagset 0:
valid
valid
tag
tagset 1:
valid
valid
tag
tagset S-1:
• • •
t bits s bits0 0 0 0 1
0m-1
b bits
tag set index block offset
Selected set
cache block
cache block
cache block
cache block
cache block
cache block
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 30
Accessing Set Associative CachesAccessing Set Associative Caches
Line matching and word selectionmust compare the tag in each valid line in the selected set.
1 0110 b3b0 b1 b2
1 1001
t bits s bits100i0110
0m-1
b bits
tag set index block offset
selected set (i):
=1? (1) The valid bit must be set.
= ?(2) The tag bits in one of the cache lines must
match the tag bits inthe address
(3) If (1) and (2), then cache hit, and
block offset selects starting byte. The four bytes
here make up a full 32bit word
30 1 2 74 5 6
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 31
What about writes?
Multiple copies of data exist: L1, L2, L3, Main Memory, Disk
What to do on a write-hit? Write-through (write immediately to one level down)
Write-back (defer write to one level down until replacement of line) Need a dirty bit (line different from memory or not)
What to do on a write-miss? Write-allocate (load into cache, update line in cache)
Good if more writes to the location follow
No-write-allocate (writes straight to one level down, does not load into cache)
Typical Write-through + No-write-allocate
Write-back + Write-allocate
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 32
Processor ChipProcessor Chip
Intel Pentium Cache HierarchyIntel Pentium Cache Hierarchy
L1 Data1 cycle latency
16 KB4-way assoc
Write-through32B lines
L1 Instruction16 KB, 4-way
32B lines
Regs.L2 Unified
128KB--2 MB4-way assocWrite-back
Write allocate32B lines
L2 Unified128KB--2 MB4-way assocWrite-back
Write allocate32B lines
MainMemory
Up to 4GB
MainMemory
Up to 4GB
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 33
Intel Core i7 Cache Hierarchy
Regs
L1 d-cache
L1 i-cache
L2 unified cache
Core 0
Regs
L1 d-cache
L1 i-cache
L2 unified cache
Core 3
…
L3 unified cache(shared by all cores)
Main memory
Processor package
L1 i-cache and d-cache:32 KB, 8-way, Access: 4 cycles
L2 unified cache: 256 KB, 8-way, Access: 10 cycles
L3 unified cache:8 MB, 16-way,Access: 40-75 cycles
Block size: 64 bytes for all caches.
Computer Organization: A Programmer's Perspective Based on class notes by Bryant and O'Hallaron 34