1
Large and Fast:Exploiting
Memory Hierarchy
Presentation H
CSE 2431: Introduction to Operating Systems
Gojko Babić
04/06/2020
• A computer system contains a hierarchy of storage devices with different costs, capacities, and access times.
• With a memory hierarchy, a faster storage device at one level of the hierarchy acts as a staging area for a slower storage device at the next lower level.
• Software that is well-written takes advantage of the hierarchy accessing the faster storage device at a particular level more frequently than the storage at the next level.
• As a programmer, understanding the memory hierarchy will result in better performance of applications.
2Presentation H
Memory Hierarchy
2
Registers
L1 cache(SRAM)
Main memory(DRAM)
Local secondary storage(local disks)
Larger, slower, cheaper per byte
Remote secondary storage(tapes, distributed file systems, Web servers)
Local disks hold files retrieved from disks on remote network servers
Main memory holds disk blocks retrieved from local disks
L2 cache(SRAM)
L1 cache holds cache lines retrieved from L2 cache
CPU registers hold words retrieved from L1 cache
L2 cache holds cache lines retrieved from main memory
L0:
L1:
L2:
L3:
L4:
L5:
Smaller,faster,costlierper byte
3Presentation H
An Example of Memory Hierarchy
Presentation H 4
The Levels in Memory Hierarchy
Access time:< 0.25nsec 0.5-2.5nsec 50-70nsec 5-50μsec 5-20msec Technology: SRAM DRAM SSD magnetic disk Capacity: 103 bytes 106 bytes 109 bytes 1011 bytes 1012 bytesCost per GB: $500-$1,000 $10-$20 $0.75-$1.00 $0.05-$0.10
in 2012
• Higher the level, smaller and faster the memory. • Try to keep most of the action in the higher levels.• Locality of reference is the most important program
property that is exploited in many parts of memory hierarchy. g. babic
3
Features
Basic storage unit is a cell (one bit per cell); RAM is traditionally packaged as a chip; multiple chips form memory
Static RAM (SRAM)
Each cell implemented with a six-transistor circuit
Relatively insensitive to disturbances such as electrical noise, radiation, etc.
Faster and more expensive than DRAM
Dynamic RAM (DRAM)
Each bit stored as charge on a capacitor
Value must be refreshed every 10-100 ms
Sensitive to disturbances
Slower and cheaper than SRAM5Presentation Hg. babic
Random Access Memory (RAM)
Transistors Access Needs Sensitive? Cost Power Power Chipper bit time refresh? requirements dissipation density
SRAM 4 or 6 1X No No 100x high high lowDRAM 1 10X Yes Yes 1X low low high
6Presentation Hg. babic
SRAM vs DRAM
4
• (d × w) bit DRAM chip is organized as d supercells of size w bits
• Usually w=8, i.e. one byte
16 x 8 DRAM chip
supercell(2,1)
columns
rows
0 1 2 3
0
1
2
3
Internal row buffer
addr
data
2 bits/
8 bits/
MemorycontrollerData and address
to/from CPU
7
Conventional DRAM Organization
• Read from address 9 corresponds to read from supercell (2,1)
Read address 9
Step 1: Row access strobe (RAS) selects row 2.
Row 2 copied from DRAM array to row buffer.
Columns
Rows
RAS=9/4=2 0 1 2 3
0
1
2
Internal row buffer
16 x 8 DRAM chip
3
addr
data
2/
8/
Memorycontroller
8Presentation H
Reading 16x8 DRAM Supercell [2,1]
Number of columns
5
Step 2: Column access strobe (CAS) selects column 1.
Supercell [2,1] copied from buffer to data lines to CPU.
Cols
Rows
0 1 2 3
0
1
2
3
Internal row buffer
16 x 8 DRAM chip
CAS=9%4=1
addr
data
2/
8/
Memorycontroller
supercell [2,1]
Byte with address 9
To CPU
9
Reading 16x8 DRAM Supercell [2,1] (cont.)
• Step 3: Since a read is distractive, the internal row buffer has to be written back into the corresponding row of DRAM array after each read.
• 4Gbit DRAM can be organized as 512M supercells of size 8 bits
• 512M = 29 x 220 = 214 x 215 = 16,384 x 32,768
columns
rows
…
…
…
…
0 1 … 32,767
0
1
.
.
.
16,383
…
Internal row buffer
512M x 8 DRAM chip
addr
data
15 bits/
8 bits/
Memorycontroller
(to/from CPU)
10Presentation H
Conventional 4Gbit DRAM Organization
6
• CPU provides address and data to be written
• Step 1 write is identical to step 1 read.
• Step 2 write includes updating the appropriate cell in the
internal row buffer with data received from CPU
• Step 3 write is identical to step 3 read.
11Presentation Hg. babic
Writing DRAM
: supercell [i,j]
addr (row = i, col = j)
Memorycontroller
DRAM 7
DRAM 0
031 78151623243263 394047485556
64-bit doubleword at main memory address A
bits0-7
bits8-15
bits16-23
bits24-31
bits32-39
bits40-47
bits48-55
bits56-63
64-bit doubleword
031 78151623243263 394047485556
12Presentation H
4GB Memory out of Eight 512Mx8 DRAMs
7
• Enhanced DRAMS have optimizations that improve the
speed with which the basic DRAM cells can be accessed.
• Examples:
– Fast page mode DRAM (FPM DRAM); up to 1996
– Extended data out DRAM (EDO DRAM); 1996-99
– Synchronous DRAM (SDRAM)
– Double Data-Rate Synchronous DRAM (DDR SDRAM)
– Rambus DRAM (RDRAM)
13Presentation Hg. babic
Enhanced DRAMs
Growth of Capacity per DRAM chip
Presentation H 14g. babic
8
15
Prices of Six Generations of DRAMs
Presentation Hg. babic
• DRAM size increased by multiples of four approximately once every three years until 1996, and thereafter considerably slower.
• The improvements in access time have been slower but continuous, and cost roughly tracks density improvements, although cost is often affected by other issues, such as availability and demand.
• The cost per gibibyte (=109 bytes) is not adjusted for inflation.• Source: Paterson & Hennessy “Computer Organization & Design”, 5th Edition
Presentation H 16g. babic
9
• Information retained if supply voltage is turned off• Referred to as read-only memories (ROM), although some may
be written to as well as read • Read-only memory (ROM): programmed during production• Programmable ROM (PROM): fuse associated with cell that is
blown once by zapping with current; can be programmed once• Eraseable PROM (EPROM): cells cleared by shining ultraviolet
light, special device used to write 1’s; can be erased and reprogrammed about 1000 times
• Electrically eraseable PROM (EEPROM): similar to EPROM but does not require a physically separate programming device, can be re-programmed in place on printed circuit cards; can be reprogrammed about 100,000 times
• Flash Memory– Based on EEPROM technology
17Presentation Hg. babic
Nonvolatile Memory
• SSD package plugs into a standard disk slot on the I/O bus (typically USB or SATA) and behaves like a disk reading and writing logical blocks.
• Consists of one or more flash memory chips and a flash translation layer (hardware/firmware device) that plays the same role as a disk controller.
18Presentation Hg. babic
Solid State Disk - SSD
Advantages of SSD over rotating disks:• No moving parts – semiconductor memory is more rugged• Much faster random access times• Use less power
Disadvantages of SSD over rotating disks:• SSDs wear out with usage• 10-20 times more expensive than disks
10
Basic (Single CPU) Computer Structure
• CPU and device controllers connect through common bus providing access to shared memory
g. babic Presentation A 19
• Pages: 512B to 4KB, Blocks: 32 to 128 pages
• Data read in units of pages.
• Page can be written only after its block has been erased
• A block wears out after 100,000 repeated writes.
Flash translation layer
I/O bus
Page 0 Page 1 Page P-1…Block 0
… Page 0 Page 1 Page P-1…Block B-1
Flash memory
Solid State Disk (SSD)
Requests to read and write logical disk blocks
20Presentation Hg. babic
Solid State Disk – SSD (cont.)
11
g. babic Presentation H 21
Moving-Head Disk Mechanism
• A sector (usually 512 bytes) is a basic unit of transfer (read/write)
22Presentation H
Sequential read throughput 250 MB/s Sequential write throughput 170 MB/sRandom read throughput 140 MB/s Random write throughput 14 MB/sRandom read access 30 us Random write access 300 us
g. babic
SSD Characteristics
12
g. babic Presentation H 23
Growth in Microprocessor Performance
24
Growth in Performance of RAM & CPU
• Mismatch between CPU performance growth and memory performance growth “memory wall”
• Importance of cache.Presentation Hg. babic
13
1980 1990 1995 2000 2003 2005 2010 2010:1980
CPU 8080 386 Pentium P-III P-4 Core 2 Core i7 ---
Clock rate(MHz) 1 20 150 600 3300 2000 2500 2500
Cycle time(ns) 1000 50 6 1.6 0.3 0.50 0.4 2500
Cores 1 1 1 1 1 2 4 4
Effectivecycle 1000 50 6 1.6 0.3 0.25 0.1 10,000time(ns)
Inflection point in computer historywhen designers hit the “Power Wall”
25Presentation H
CPU Trends: “Power Wall”
• At about the same time, besides “memory wall” and “power wall”, processor designers also reached limits in taking advantage of instruction level parallelism – ILP in (sequential) programs.
• Since early 2000’s processors have not been (significantly) getting faster.• Instead, multi-core processors.
g. babic
Presentation H 26
Principle of Locality of Reference
• Programs access a small proportion of their address space atany time and they tend to reuse instructions and data they haveused recently.
– temporal locality – recently accessed items are likely to be accessed soon,
– spatial locality – items near those accessed recently are likely to be accessed soon, i.e. items near one another tend to bereferenced close together in time.
• An implication of principle of locality is that we can predict withreasonable accuracy what instructions and data a program willuse in near future based on its accesses in the recent past.
• Principle of locality applies more strongly to code accesses thandata accesses.
g. babic
14
• Use memory hierarchy
• Store everything on disk
• Copy recently accessed (and nearby) items from disk to
smaller DRAM memory
– Main memory (virtual memory)
• Copy recently accessed (and nearby) items from DRAM
memory to smaller SRAM memory
– Cache attached to CPU
Taking Advantage of Locality
27Presentation Hg. babic
28
Typical System Organization Without Cache
CPU chip
Mainmemory
I/O bridge
Bus interface
ALU
Register file
System bus Memory bus
Disk controller
Graphicsadapter
USBcontroller
Mouse Keyboard Monitor
Disk
I/O bus Expansion slots forother devices suchas network adapters.
Presentation Hg. babic
15
MainmemoryI/O
bridgeBus interface
ALU
Register file
CPU chip
System bus Memory bus
Cache memories
29
Typical Processor Organization with Cache
Presentation Hg. babic
Cache is fast but because of that it has to be small. Why?
Presentation H 30
Model of Memory + Cache + CPU System
CPU CacheMain Memory
Data Block
g. babic
16
31
Basics of Cache Operation• We will first (and mostly) consider a cache read operation,
since it is more important. Why?
• When CPU provides an address requesting a content of a
given main memory location:
– first check the cache for the content; caches include a tag in
each cache entry to identify the memory address of a block,
– if the block present, this is a hit, get the content (fast) and
CPU proceeds normally, without (or small) delay,
– if the block not present, this is a miss, stall CPU and read
the block with the required content from main memory,
– long CPU slowdown: miss penalty time to access main
memory and to place a block into the cache,
– And now (after miss penalty) CPU gets the required content.
• How do we know what block is stored in a given cache
location?
– store block address as well as the data in a cache entry
– actually, it may only need the high-order bits of block
address called the tag
• What if there is no data in a location?
– introduce valid bit in each entry
– valid bit: 1 = present, 0 = not present
– initially 0
Tags and Valid Bits
32Presentation Hg. babic
17
33
CPU
0
0
0
0
Cache(SRAM)
Cache ExampleMain memory in GB
(DRAM)
bus
address address
4B dataaddress
16B data 16B data
• CPU generates 4-byte word read at address 100 (16 bit address assumed), i.e. read for contents of bytes at addresses 100-103
• Since cache is empty cache mis
v/i bit
Tag 10 bits Data 4x4B = 16B
g. babic Presentation H
g. babic Presentation H 34
CPU
0
0
1 0000000001 [108-111] [104-107] [100-103] [96-99]
0
Cache after 4B Read at Address 100
Main memory in GB(DRAM)
bus
address address
4B dataaddress
16B data 16B data
1. Blok of 16 byte read from RAM and stored in cache2. Cache entry chosen according to cache placement algorithm (direct mapped
cache assumed).
v/i bit
Cache(SRAM)
18
g. babic PresentationH 35
CPU
1 0000000011 [202-207] [200-203] [196-199] [192-195]
0
1 0000000001 [108-111] [104-107] [100-103] [96-99]
0
Cache after 4B Read at Address 204
Main memory in GB(DRAM)
bus
address address
4B dataaddress
16B data 16B data
1. Blok of 16 byte read from RAM and stored in cache2. Cache entry chosen according to cache placement algorithm (direct mapped
cache assumed)
v/i bit
Cache(SRAM)
Presentation H 36
Cache & DRAM Memory: Performance
0 1
t1+t2
t1
Hit ratio
Average Access Time
• t2: main memory access time (e.g. 50 nsec)• t1: cache access time (e.g. 2 nsec)• Hit ratio is a ratio of a number of hits and a total number of
memory accesses; Miss ratio = 1 – Hit ratio
• Because of locality of reference, hit rates are normally well over 90%
AverageAccessTime=CacheAccessTime+(1–hit ratio)xMissPenaltyg. babic
19
• In direct mapped caches, a block from memory has only one cache entry where it has to be stored.
• Location of cache entry is determined by block address and number of entries in the cache:
– location = (block address) mod (number of entries in cache)
• Block address includes all address bits excluding block offset bits, i.e. rightmost n bits where:
- 2n = a number of bytes in a block
• Example: 16 bit addresses and block size = 8 bytes
- then address: 0x1234 = 0001 0010 0011 01002
has block address 0x0246 = 0 0010 0100 0110 since block offset = 3 (23 = 8)
- then if the cache has 16 entries location is 0110, i.e. 4 right-
most bits of block address, since 24 = 16, that is the index.
Direct Mapped Cache
37
• Assume processor with 10-bit addresses & a direct mapped cache with 8 entries, and 4 byte blocks;
• 4 byte blocks = 22 = 2block offset block offset = 2
• blok address = 10-2 = 8 bits
• 8 entries = 23 = 2index index = 3 bits
• Address format = 5 bit (tag) +3 bits (index) + 2 bits (block offset)
V Tag = 5 bits Data = 32 bits = 4 bytes
N
N
N
N
N
N
N
N
Direct Mapped Cache Example
38g. babic
20
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 Y 00010 Mem[88-91]
111 N
Word address Binary address Hit/miss Cache entry
88 00010 110 00 Miss 110
Cache Example: Access 1
39Presentation Hg. babic
Index V Tag Data
000 N
001 N
010 Y 00011 Mem[104-107]
011 N
100 N
101 N
110 Y 00010 Mem[88-91]
111 N
Word address Binary address Hit/miss Cache entry
104 00011 010 00 Miss 010
Cache Example: Access 2
40Presentation Hg. babic
21
Index V Tag Data
000 N
001 N
010 Y 00011 Mem[104-107]
011 N
100 N
101 N
110 Y 00010 Mem[88-91]
111 N
Word address Binary address Hit/miss Cache entry
88 00010 110 00 Hit 110
104 00011 010 00 Hit 010
Cache Example: Accesses 3, 4
41Presentation Hg. babic
Index V Tag Data
000 Y 00010 Mem[64-67]
001 N
010 Y 00011 Mem[104-107]
011 Y 00000 Mem[12-15]
100 N
101 N
110 Y 00010 Mem[88-91]
111 N
Word address Binary address Hit/miss Cache entry
64 00010 000 00 Miss 000
12 00000 011 00 Miss 011
64 00010 000 00 Hit 000
42Presentation H
Cache Example: Accesses 5, 6, 7
g. babic
22
Index V Tag Data
000 Y 00010 Mem[64-67]
001 N
010 Y 00010 Mem[72-75]
011 Y 00000 Mem[12-15]
100 N
101 N
110 Y 00010 Mem[88-91]
111 N
Word address Binary address Hit/miss Cache entry
72 00010 010 00 Miss 010
43Presentation H
Cache Example: Accesses 8
g. babic
• Direct mapped caches: only one choice for cache entry
• Location of cache entry determined by block address
– location = (block address) mod (number of entries in cache)
• Use low-order address bits as a location for cache entry
Direct Mapped Cache
44Presentation H
cache
Main memory
g. babic
23
Presentation H 45
Direct Mapping Cache: 1 × 4-byte BlocksA d d re s s ( s h o w in g b it p o s i t io n s )
1 6 1 4 B y te
o ffs e t
V a lid T a g D a ta
H i t D a ta
1 6 3 2
1 6 K
e n tr ie s
1 6 b i ts 3 2 b it s
3 1 3 0 17 16 15 5 4 3 2 1 0
Block offset = 2 bits since 22 = 4 bytesHere, block=data then block offset=byte offsetandword offset =0
Index = 14 bits since214 = 16K number of
cache entries
Block offset
g. babic
Presentation H 46
• address 10010 = 0000000000000000 00000000011001 00block address = 0000000000000000 00000000011001
2
4-byte block [100-103] will be stored incache entry 25 = 00000000011001
2Note: bytes with addresses 100-103 have same block address
• address 20410 = 0000000000000000 00000000110011 00
block address = 0000000000000000 000000001100112
4-byte block [204-207] will be stored in
cache entry 51 = 000000001100112
Note: bytes with addresses 204-207 have same block address
g. babic
Cache after Addresses 100 and 204
24
Presentation H 47
Direct Mapping Cache: 4 × 4-byte Blocks
Address (showing bit positions)
16 12 Byteoffset
V Tag Data
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 1 0indextag
Word offset
• Block offset = 4 bits (24 = 16 bytes); Index = 12 bits (212 = 4K entries)• Since 4-byte words (data) block offset=2 bits word offset + 2 bits byte offset
since 4= 22 = 2byte offset
g. babic
Presentation H 48
• address 10010 = 0000000000000000 000000000110 0100block address = 0000000000000000 000000000110
2
16-byte block [96-111] will be stored in cache entry 6 = 000000000110
2Note: bytes with addresses 96-111 have same block address
• address 20410 = 0000000000000000 000000001100 1100
block address = 0000000000000000 0000000011002
16-byte block [192-207] will be stored in
cache entry 12 = 0000000011002
Note: bytes with addresses 192-207 have same block address
g. babic
Cache after Addresses 100 and 204
25
Presentation H 49
ProgramBlock size in
bytesInstruction miss rate
Data miss rate
Effective combined miss rate
gcc 4 6.1% 2.1% 5.4%16 2.0% 1.7% 1.9%
spice 4 1.2% 1.3% 1.2%16 0.3% 0.6% 0.4%
• Cache with 4-byte blocks had 16K entries total of 64KB data• Cache with 16-byte blocks had 4K entries total of 64KB data• Thus, both caches can accommodate the same amount of data
• Larger blocks reduced miss rate due to spatial locality• But for a fixed-sized cache, larger blocks fewer of them more competition may increased miss rate
• Larger miss penalty may override benefit of reduced miss rate• Thus keep in mind, the miss rate is not the only parameter:AverageAccessTime=CacheAccessTime+MissRatexMissPenalty
Block Size Consideration
g. babic
• DRAM memory has a width of its read/write operations determined by a width of its bus data lines.
• Numbers used in examples that follow for reading from main memory :
– 2nsec for address transfer from CPU to memory controller (mostly propagation delay),
– 50nsec per DRAM access,
– 2nsec per data transfer from memory controller to cache (mostly propagation delay)
Main Memory Supporting Caches
50Presentation Hg. babic
26
Presentation H 51
C P U
C a c h e
M e m o ry
a . O n e - w o rd - w id e
m e m o r y o rg a n iz a t io n
4-Byte Main Memory Bus
4 bytewide bus
g. babic
• For 16-byte block, and 4-byte-wide DRAM bus: Miss penalty =
2 + 4×50 + 4×2 = 210 nsec,Bandwidth =
16 bytes/210= 0.08 bytes/nsec.
• Although caches are interested in low–latency memory, it is generally easier to improve memory bandwidth with new memory organization than it is to reduce memory latency.
Presentation H 52
C P U
M e m o r y
M u l t ip le x o r
C a c h e
16-Byte (Wider) Main Memory Bus
16 bytewide bus
g. babic
• CPU accesses a word at a time, so a need for a multiplexer
• For 16-byte block, and 16-byte-wide DRAM bus: Miss penalty =
2 + 50 + 2 = 54 nsec,Bandwidth =
16 bytes/54 = 0.30 bytes/nsec.
27
Wider Main Memory Bus & Level-2 Cache
C P U
B u s
M e m o r y
M u l t ip le x o r
C a c h e
• But the multiplexer is on the critical timing path.
53Presentation H
• Level-2 cache helps since the multiplexing is now between level-1 and level-2 caches and not on critical timing path.
g. babic
54
Interleaved Memory Organization
• This is four-way interleaved memory with 4 byte memory bus─ saves wires in bus
• Miss penalty = 2 + 50 + 4×2 = 60nsec,• Throughput = 16 bytes / 60 = 0.27 bytes/nsec
C P U
C a c h e
B u s
M e m o r y
b a n k 1
M e m o r y
b a n k 2
M e m o r y
b a n k 3
M e m o r y
b a n k 0
Presentation Hg. babic
28
• Level-1 (primary) cache attached to CPU
– small, but very fast
• Level-2 (secondary) cache services misses from level-1 cache
– larger, slower, but still faster than main memory
• Main memory services level-2 cache misses
• Some high-end systems include level-3 cache
• Level-1 cache: focus on minimal hit time
• Level-2 cache:
– focus on low miss rate to avoid main memory access
– hit time has less overall impact
Multilevel Caches
55Presentation Hg. babic
Presentation H 56
Multilevel Caches (cont.)
• Increased logic density => on-chip cache with processor
– internal cache: level 1
– internal cache: level 2
– external or internal cache: level 3
• Unified cache, i.e. a cache for both instruction and data
– balances the load between instruction and data fetches,
– only one cache needs to be designed and implemented;
• Split cache:
– data cache (d-cache) and instruction cache (i-cache)
– essential for pipelined & other advanced architectures;
• Level 1 caches are normally split caches.g. babic
29
• In addition to direct mapped, two more cache organizations
• Fully associative caches:
– allow a given block to go in any cache entry
– requires all entries to be searched at once
– tag comparator per entry (expensive)
– Note: no index, thus a tag equals a block address
• n-way set associative caches:
– each set contains n entries (blocks)
– block number determines which set in the cache
set number = (block number) mod (number of sets)
– search only n entries in a given set (at once)
– n comparators (less expensive)
Associative Caches
57Presentation Hg. babic
Presentation H 58
Illustration of Cache Organizations
2 way set associative
g. babic
30
Presentation H 59
4-Way Set Associate Cache with 4-byte Blocks
A d d r e s s
2 2 8
V T a gI n d e x
0
1
2
2 5 3
2 5 4
2 5 5
D a t a V T a g D a t a V T a g D a t a V T a g D a t a
3 22 2
4 - t o - 1 m u l t ip le x o r
H i t D a t a
123891 01 11 23 03 1 0
indextag block offset 2 bits
• 4-way set associative cache builtout of 4 direct-mapped caches
g. babic
• There are 256 sets• Indexing sets
• Consider 2-way set associative cache with 4 sets, and 8-byte blocks. Assume 16-bit address.
a. provide address format.8 byte blocks 3-bit block offset; 4 sets 2-bit index4-byte data (words) 2-bit byte offset3-bit block offset = 2-bit byte offset + 1-bit word offset16-bit address= 11-bit tag + 2-bit index+1-bit word offset
+2-bit byte offset b. Provide cache layout
v tag (11bits) data 4B data 4B v tag (11bits) data 4B data 4B
c. For address sequence: 8, 0, 52, 20, 56, 16, 24, 116, 20, 8, 16indicate hit/miss and content of the cache, initially empty.
Set Associative Cache: Example
60
31
0
1 0 [12-15] [8-11]
0
0
0
0
0
0
1 0 [4-7] [0-3]
1 0 [12-15] [8-11]
0
0
0
0
0
0
1 0 [4-7] [0-3]
1 0 [12-15] [8-11]
1 1 [52-55] [48-51]
0
0
0
0
0
1 0 [4-7] [0-3]
1 0 [12-15] [8-11]
1 1 [52-55] [48-51]
0
0
0
1 0 [20-23] [16-19]
0
8 miss
0 miss
52 miss
20 miss
tag 1 w offset 0 w offset tag 1 w offset 0 w offset
1 0 [4-7] [0-3]
1 0 [12-15] [8-11]
1 1 [52-55] [48-51]
1 1 [60-63] [56-59]
0
0
1 0 [20-23] [16-19]
0
1 0 [4-7] [0-3]
1 0 [12-15] [8-11]
1 1 [52-55] [48-51]
1 1 [60-63] [56-59]
0
0
1 0 [20-23] [16-19]
1 0 [28-31] [24-27]
1 0 [4-7] [0-3]
1 0 [12-15] [8-11]
1 3 [116-119] [112-115]
1 1 [60-63] [56-59]
0
0
1 0 [20-23] [16-19]
1 0 [28-31] [24-27]
56 miss
16 hit
24 miss
116 missLRU rep.
20 hit = 00000000000 10 1 00 8 hit = 00000000000 01 0 0016 hit = 00000000000 10 0 00
tag 1 w offset 0 w offset tag 1 w offset 0 w offset
32
• Reduce a number of comparisons to reduce cost.
Associativity Location method Tag comparisons
Direct mapped Index entry 1
n-way set associative
Index set, then search n entries within the set
n
Fully associative Search all entries Number of entries
Comparing Finding a Block
63Presentation Hg. babic
• Refers to the situation when on a miss there is no room for a new block and one of existing blocks in the cache has to be removed from caches.
• Direct mapped cache: no choice.
• N-way set associative cache:
– choose among n entries in the set;
• Fully associative cache:
– choose among all entries in the cache;
• Least-recently used (LRU) replacement policy:
– choose the one unused for the longest time,
– simple for 2-way, manageable for up to 16-way, probably too inefficient beyond that;
• Random replacement policy also possibility;
Replacement Policy
64Presentation Hg. babic
33
• Write is more complex and longer than read since:
─ write and tag comparison can not proceed simultaneously as read and tag comparison can in the case of cache read,
─ only a portion of the block has to be updated;
• Two approaches when cache hit: write through & write back
• Write through technique:
– updates a main memory and the block in cache,
– but it makes writes take longer and it does not take advantage of cache, although simple to implement.
• Improvements to this technique special write buffer:
– it holds data waiting to be written to memory,
– CPU continues immediately, and CPU stalls on write only if write buffer is already full.
Cache Writing on Hit
65Presentation Hg. babic
• Write back technique:
– just update the block in cache,
– keep track of whether or not a block has been updated; each entry has a dirty bit for that purpose.
– update in memory when cache entry has to be replaced
• Cache and memory are inconsistent; problem in
multi-processor (-core) systems. Why?
• When a dirty block is to be replaced:
– write it back to memory,– can also use write buffer,
• Write through is usually found in level 1 data caches backed by level 2 cache that uses write back.
Cache Writing on Hit (cont.)
66Presentation Hg. babic
34
• What should happen on a write miss?
• Write-allocate: load block into cache, and update in cache
(good if more writes to the location follow);
– Write-back usually uses write-allocate;
• No-write-allocate: writes immediately only to main memory;
– Write-through usually uses no-write-allocate
Cache Writing on Miss
67Presentation Hg. babic
Registers
L1 d-cache
L1 i-cache
L2 unified cache
Core 0
Registers
L1 d-cache
L1 i-cache
L2 unified cache
Core 3
…
L3 unified cache(shared by all cores)
Main memory
Processor package Intel Core i7 (4 cores):L1: i‐cache and d‐cache:
32 KB each, 8‐way, Access: 4 cycles
L2: unified cache:256 KB, 8‐way, Access: 11 cycles
L3: unified cache:8 MB, 16‐way,Access: 30‐40 cycles
Block size: 64 bytes for all caches.
Intel Core i7 Cache Hierarchy
Presentation Hg. babic 68