Chapter Seven Large and Fast: Exploiting Memory Hierarchy

12004 Morgan Kaufmann Publishers

Chapter Seven

Large and Fast: Exploiting Memory Hierarchy



Outline

• 7.1 Introduction

• 7.2 The Basic of Caches

• 7.3 Measuring and Improving Cache Performance

• 7.4 Virtual Memory

• 7.5 A Common Framework for Memory Hierarchies

• 7.6 Real Stuff: The Pentium P4 and the AMD Opteron Memory Hierarchies

• 7.7 Fallacies and Pitfalls

• 7.8 Concluding Remarks

• 7.9 Historical Perspective and Further Reading


7.1 Introduction

The principle of locality states that programs access a relatively small portion of their address space at any instant of time, just as you accessed a very small portion of the library’s collection.


Keywords

• Temporal locality The principle stating that if a data location is referenced then it will tend to be referenced again soon.

• Spatial locality The locality principle stating that if a data location is referenced, data locations with nearby addresses will tend to be referenced soon.

• Memory hierarchy A structure that uses multiple levels of memories; as the distance from the CPU increases, the size of the memories and the access time both increase.

• Block The minimum unit of information that can be either present or not present in the two-level hierarchy.

• Hit rate The fraction of memory accesses found in a cache.

• Miss rate The fraction of memory accesses not found in a level of the memory hierarchy.


Keywords

• Hit time The time required to access a level of the memory hierarchy, including the time needed to determine whether the access is a hit or a miss.

• Miss penalty The time required to fetch a block into a level of the memory hierarchy from the lower level, including the time to access the block, transmit it from one level to the other, and insert it in the level that experienced the miss.


• SRAM:

– value is stored on a pair of inverting gates

– very fast but takes up more space than DRAM (4 to 6 transistors)

• DRAM:

– value is stored as a charge on capacitor (must be refreshed)

– very small but slower than SRAM (factor of 5 to 10)

Memories: Review

B

A A

B

Word line

Pass transistor

Capacitor

Bit line


Memory technology Typical access time $ per GB in 2004

SRAM 0.5 – 5 ns $ 4,000 – $10,000

DRAM 50 – 70 ns $100 – $200

Magnetic disk 5,000,000 – 20,000,000 ns $0.50 –$2


Figure 7.1 The basic structure of a memory hierarchy.


Figure 7.2 Every pair of levels in the memory hierarchy can be thought of as having an upper and lower level.


• Users want large and fast memories!

SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.DRAM access times are 50-70ns at cost of $100 to $200 per GB.Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.

• Try and give it to them anyway

– build a memory hierarchy

Exploiting Memory Hierarchy

2004

CPU

Level 1

Level 2

Level n

Increasing distance

from the CPU in

access timeLevels in the

memory hierarchy

Size of the memory at each level


Locality

• A principle that makes having a memory hierarchy a good idea

• If an item is referenced,

temporal locality: it will tend to be referenced again soon

spatial locality: nearby items will tend to be referenced soon.

Why does code have locality?

• Our initial focus: two levels (upper, lower)

– block: minimum unit of data

– hit: data requested is in the upper level

– miss: data requested is not in the upper level


7.2 The Basics of Caches


Keywords

• Direct-mapped cache A cache structure in which each memory location is mapped to exactly one location in the cache.

• Tag A field in a table used for a memory hierarchy that contains the address information required to identify whether the associated block in the hierarchy corresponds to a requested word.

• Valid bit A field in the tables of a memory hierarchy that indicates that the associated block in the hierarchy contains valid data.

• Cache miss A request for data from the cache that cannot be filled because the data is not present in the cache.

• Write-through A scheme in which writes always update both the cache and the memory, ensuring that data is always consistent between the two.

• Write buffer A queue that holds data while the data are waiting to be written to memory.


Keywords

• Write-back A scheme that handles writes by updating values only to the block in the cache, then writing the modified block to the lower level of the hierarchy when the block is replaced.

• Split cache A scheme in which a level of the memory hierarchy is composed of two independent caches that operate in parallel with each other with one handling instructions and one handling data.


• Two issues:

– How do we know if a data item is in the cache?

– If it is, how do we find it?

• Our first example:

– block size is one word of data

– "direct mapped"

For each item of data at the lower level, there is exactly one location in the cache where it might be.

e.g., lots of items at the lower level share locations in the upper level

Cache


Figure 7.4 The cache just before and just after a reference to a word Xn that is not initially in the cache.


• Mapping: address is modulo the number of blocks in the cache

Direct Mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

01

001

11

001

011

101

11


Decimal address of reference

Binary address of reference

Hit or miss in cache

Assigned cache block (where found or placed)

22 Miss (7.6b)

26 Miss (7.6c)

22 Hit

26 Hit

16 Miss (7.6d)

3 Miss (7.6e)

16 Hit

18 Miss (7.6f)

two10110

two11010

two10110

two11010

two10000

two00011

two10000

two10010

twotwo 1108) mod 10110(

twotwo 0108) mod 11010(

twotwo 1108) mod 10110(

twotwo 0108) mod 11010(

twotwo 0008) mod 10000(

twotwo 0118) mod 00011(

twotwo 0008) mod 10000(

twotwo 0108) mod 10010(


Figure 7.6 The cache contents are shown after each reference request that misses, with the index and tag fields shown in binary.

Index V Tag Data

000 N

001 N

010 N

011 N

100 N

101 N

110 N

111 N

Index V Tag Data

000 N

001 N

010 N

011 N

100 N

101 N

110 Y

111 N

two10 )10110(Memory two

a. The initial state of the cache after power-on b. After handling a miss of address ( )two10110


Index V Tag Data

000 N

001 N

010 Y

011 N

100 N

101 N

110 Y

111 N



c. After handling a miss of address ( )two11010

Index V Tag Data

000 Y

001 N

010 Y

011 N

100 N

101 N

110 Y

111 N




d. After handling a miss of address ( )two10000


Index V Tag Data

000 Y

001 N

010 Y

011 Y

100 N

101 N

110 Y

111 N





Index V Tag Data

000 Y

001 N

010 Y

011 Y

100 N

101 N

110 Y

111 N





e. After handling a miss of address ( ) f. After handling a miss of address ( )two00011 two10010


• For MIPS:

What kind of locality are we taking advantage of?

Direct Mapped Cache

Address (showing bit positions)

Data

Hit

Data

Tag

Valid Tag

3220

Index

012

102310221021

=

Index

20 10

Byteoffset

31 30 13 12 11 2 1 0


Bits in Cache

• How many total bits are required for a direct-mapped cache with 16 KB of data and 4-word clocks, assuming a 32-bit address?

• AnswerWe know that 16 KB is 4 K words, which is words, and, with a block size of 4 words , blocks. Each block has or 128 bits of data plus a tag, which is 32-10-2-2 bits, plus a valid bit. Thus, the total cache size is

or 18.4 KB for a 16 KB cache. For this cache, the total number of bits in the cache is about 1.15 times as many as needed just for the storage of the data.

)2( 2 102

122324

Kbits 1471472)1)221032(128(2 1010


Mapping an Address to a Multiword Cache Block

• Consider a cache with 64 blocks and a clock size of 16 bytes. What block number does byte address 1200 map to?

• Answer

– The address of the block is (formula on page 474) ：

– Notice that this block address is the block containing all addresses between

and

– Thus, with 16 bytes per block, byte address 1200 is block address

which maps to cache block number (75 modulo 64) = 11. In fact, this block maps all addresses between 1200 and 1215.

blockper Bytes

address Byte

blockper Bytesblockper Bytes

address Byte

1)-blockper (Bytesblockper Bytesblockper Bytes

address Byte

7516

1200


Figure 7.8 Miss rate versus block size.


• Taking advantage of spatial locality:

Direct Mapped Cache

Address (showing bit positions)

DataHit

Data

Tag

V Tag

32

16

=

Index

18 8 Byteoffset

31 14 13 2 1 06 5

4

Block offset

256

entries

512 bits18 bits

Mux

3232 32


• Read hits

– this is what we want!

• Read misses

– stall the CPU, fetch block from memory, deliver to cache, restart

• Write hits:

– can replace data in cache and memory (write-through)

– write the data only into the cache (write-back the cache later)

• Write misses:

– read the entire block into the cache, then write the word

Hits vs. Misses


To understand the impact of different organizations of memory, let’s define a set of hypothetical memory access times, Assume

• 1 memory bus clock cycle to send the address• 15 memory bus clock cycles for each DRAM access initiated

• 1 memory bus clock cycle to send a word of data

If we have a cache clock of four words and a one-word-wide bank of DRAMs,

the miss penalty would be 1+4*15+4*1=65 memory bus clock cycles. Thus,

the number of bytes transferred per bus clock cycle for a single miss would

be

Figure 7.11 shows three options for designing the memory system.

1. Memory is one word wide, and all access are made sequentially.

2. Increases the bandwidth to memory by widening the memory and the

buses between the processor and memory.

3. The memory chips are organized in 4 banks.

25.065

44


• Make reading multiple words easier by using banks of memory

• It can get a lot more complicated...

Hardware Issues

CPU

Cache

Memory

Bus

One-word-widememory organization

a.

b. Wide memory organization

CPU

Cache

Memory

Bus

Multiplexor

CPU

Cache

Bus

Memory

bank 0

Memory

bank 1

Memory

bank 2

Memory

bank 3

c. Interleaved memory organization


The miss penalty of each memory system

• A. 1+4*15+4*1 = 65

• B. With a main memory width if two words ： 1+2*15+2*1 = 33 four words ： 17

• C. 1cycle to transmit the address and read request to the banks, 15 cycles for all four banks to access memory, and 4 cycles to send the four words back to the cache.

1+1*15+4*1 = 20


Figure 7.12 DRAM size increased by multiples of four approximately once every three years until 1996, and thereafter doubling approximately every two years.

Year introduced Chip size $ per MB Total access time to a new row / column

Column access time to existing row

1980 64 Kbit $ 1500 250 ns 150 ns

1983 256 Kbit $ 500 185 ns 100 ns

1985 1 Kbit $ 200 135 ns 40 ns

1989 4 Kbit $ 50 110 ns 40 ns

1992 16 Kbit $ 15 90 ns 30 ns

1996 64 Kbit $ 10 60 ns 12 ns

1998 128 Kbit $ 4 60 ns 10 ns

2000 256 Kbit $ 1 55 ns 7 ns

2002 512 Kbit $ 0.25 50 ns 5 ns

2004 1024 Kbit $ 0.10 45 ns 3 ns


7.3 Measuring and Improving Cache Performance


Keywords

• Fully associative cache A cache structure in which a block can be placed in any location in the cache.

• Set-associative cache A cache that has a fixed number of locations (at least two) where each block can be placed.

• Least recently used (LRU) A replacement scheme in which the block replaced is the one that has been unused for the longest time.

• Multilevel cache A memory hierarchy with multiple levels of caches, rather than just a cache and main memory.

• Global miss rate The fraction of references that miss in all levels of a multilevel cache.

• Local miss rate The fraction of references to one level of a cache that miss; used in multilevel hierarchies.


penalty MissnInstructio

Misses

Program

nsInstructiocyclesclock stall-Memory

penalty Missrate MissProgram

accessesMemory cyclesclock stall-Memory

stallsbuffer Write

penalty miss Writerate miss WriteProgram

Writescycles stall- Write

penalty miss Readrate miss ReadProgram

Readscycles stall-Read

cycles stall-Writecycles stall-Readcyclesclock stall-Memory

timecycleClock

cycles)clock stall-Memorycyclesclock execution (CPU timeCPU


Calculating Cache Performance

• Assume an instruction cache miss rate for a program is 2 % and a data cache miss rate is 4%. If a processor has a CPI of 2 without and memory stalls and the miss penalty is 100 cycles for all misses, determine how much faster a processor would run with a perfect cache that never missed. Use the instruction frequencies for SPECint2000 from Chapter 3, Figure 3.26, on page 228.

• Answer

72.22

44.5by better is cacheperfect with theeperformanc The

2

44.5

CPI

CPI

cycleClock CPII

cycleClock CPII

cacheperfect with timeCPU

stalls with timeCPU

I44.11004%36%Icycles miss Data

I2.001002%Icycles missn Instructio

perfect

stall

perfect

stall


Cache Performance with Increased Clock Rate

• Suppose we increase the performance of the computer in the previous example by doubling its clock rate. Since the main memory speed is unlikely to change, assume that the absolute time to handle a cache miss does not change. How much faster will the computer be with faster clock, assuming the same miss rate as the previous example?

• Answer

6.88200)(4%36%200)(2%ninstructiooer cycles miss Total

Thus, the faster computer with cache misses will have a CPI of 2 + 6.88 = 8.88, compared to a CPI with cache misses of 5.44 for the slower computer.

Using the formula for APU time from the previous example, we can compute the relative performance as

1.23

21

8.88

5.44

2cycleClock

clockCPIfast IC

cycleClock clock CPIslowICclockfast with timeExecution

clock slow with timeExecution

clock slow with ePerformanc

clockfast with ePerformanc


• Increasing the block size tends to decrease miss rate:

• Use split caches because there is more spatial locality in code:

Performance

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)

ProgramBlock size in

wordsInstruction miss rate

Data miss rate

Effective combined miss rate

gcc 1 6.1% 2.1% 5.4%4 2.0% 1.7% 1.9%

spice 1 1.2% 1.3% 1.2%4 0.3% 0.6% 0.4%


Performance

• Simplified model:

execution time = (execution cycles + stall cycles) cycle time

stall cycles = # of instructions miss ratio miss penalty

• Two ways of improving performance:

– decreasing the miss ratio

– decreasing the miss penalty

What happens if we increase block size?


Compared to direct mapped, give a series of references that:

– results in a lower miss ratio using a 2-way set associative cache

– results in a higher miss ratio using a 2-way set associative cache

assuming we use the “least recently used” replacement strategy

Decreasing miss ratio with associativity

Eight-way set associative (fully associative)

Tag Tag Data DataTagTag Data Data Tag Tag Data DataTagTag Data Data

Tag Tag Data DataTagTag Data DataSet

0

1

Four-way set associative

TagTag Data DataSet

0

1

2

3

Two-way set associative

Tag DataBlock

0

1

2

3

4

5

6

7

One-way set associative

(direct mapped)


An implementationAddress

22 8

V TagIndex

01

2

253254255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0


Performance

Associativity

0One-way Two-way

3%

6%

9%

12%

15%

Four-way Eight-way

1 KB

2 KB

4 KB

8 KB

16 KB

32 KB64 KB 128 KB


Decreasing miss penalty with multilevel caches

• Add a second level cache:

– often primary cache is on the same chip as the processor

– use SRAMs to add another cache above primary memory (DRAM)

– miss penalty goes down if data is in 2nd level cache

• Example:– CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access– Adding 2nd level cache with 5ns access time decreases miss rate to .5%

• Using multilevel caches:

– try and optimize the hit time on the 1st level cache

– try and optimize the miss rate on the 2nd level cache


Cache Complexities

• Not always easy to understand implications of caches:

Radix sort

Quicksort

Size (K items to sort)

04 8 16 32

200

400

600

800

1000

1200

64 128 256 512 1024 2048 4096

Radix sort

Quicksort


04 8 16 32

400

800

1200

1600

2000

64 128 256 512 1024 2048 4096

Theoretical behavior of Radix sort vs. Quicksort

Observed behavior of Radix sort vs. Quicksort


Cache Complexities

• Here is why:

• Memory system performance is often critical factor– multilevel caches, pipelined processors, make it harder to predict outcomes– Compiler optimizations to increase locality sometimes hurt ILP

• Difficult to predict best algorithm: need experimental data

Radix sort

Quicksort


04 8 16 32

1

2

3

4

5

64 128 256 512 1024 2048 4096


7.4 Virtual Memory


Keywords

• Virtual memory A technique that uses main memory as a “cache” for secondary storage.

• Physical address An address in main memory.

• Protection A set of mechanisms for ensuring that multiple processes sharing the processor, memory, or I/O devices cannot interfere, intentionally or unintentionally, with one another by reading or writing each other’s data. These mechanisms also isolate the operating system from a user process.

• Page fault An event that occurs when an accessed page is not present in main memory.

• Virtual address An address that corresponds to a location in virtual space and is translated by address mapping to a physical address when memory is accessed.


Keywords

• Address translation Also called address mapping. The process by which a virtual address is mapped to an address used to access memory.

• Segmentation A variable-size address mapping scheme in which an address consists of two parts: a segment number, which is mapped to a physical address, and a segment offset.

• Page table The table containing the virtual to physical address translations in a virtual memory system. The table, which is stored in memory, is typically indexed by the virtual page number; each entry in the table contains the physical page number for that virtual page if the page is currently in memory.

• Swap space The space on the disk reserved for the full virtual memory space of a process.

• Reference bit Also called use bit. A field that is set whenever a page is accessed and that is used to implement LRU or other replacement schemes.


Keywords

• Translation-lookaside buffer (TLB) A cache that keeps track of recently used address mappings to avoid an access to the page table.

• Virtually addressed cache A cache that is accessed with a virtual address rather than a physical address.

• Aliasing A situation in which the same object is accessed by two addresses; can occur in virtual memory when there are two virtual addresses for the same physical page.

• Physically addressed cache A cache that is addressed by a physical address.

• Kernel mode Also called supervisor mode. A mode indicating that a running process is an operating system process.

• System call A special instruction that transfers control from user mode to a dedicated location in supervisor code space, invoking the exception mechanism in the process.


Keywords

• Context switch A changing of the internal state of the processor to allow a different process to use the processor that includes saving the state needed to return to the currently executing process.

• Exception enable Also called interrupt enable. A signal or action that controls whether the process responds to an exception or not; necessary for preventing the occurrence of exceptions during intervals before the processor has safely saved the state needed to restart.

• Restartable instruction An instruction that can resume execution after an exception is resolved without the exception’s affecting the result of the instruction.

• Handler Name of a software routine invoked to “handle” an exception or interrupt.

• Unmapped A portion of the address space that cannot have page faults.


Virtual Memory

• Main memory can act as a cache for the secondary storage (disk)

• Advantages:– illusion of having more physical memory– program relocation – protection

Virtual addresses Physical addresses

Address translation

Disk addresses


Pages: virtual memory blocks

• Page faults: the data is not in memory, retrieve it from disk

– huge miss penalty, thus pages should be fairly large (e.g., 4KB)

– reducing page faults is important (LRU is worth the price)

– can handle the faults in software instead of hardware

– using write-through is too expensive so we use writeback

Virtual page number Page offset

31 30 29 28 27 3 2 1 015 14 13 12 11 10 9 8

Physical page number Page offset

29 28 27 3 2 1 015 14 13 12 11 10 9 8

Virtual address

Physical address

Translation


Page Tables

Page tablePhysical page or

disk addressPhysical memory

Virtual pagenumber

Disk storage

1111011

11

1

0

0

Valid


Page Tables


3 1 3 0 2 9 2 8 2 7 3 2 1 01 5 1 4 1 3 1 2 11 1 0 9 8


2 9 2 8 2 7 3 2 1 01 5 1 4 1 3 1 2 11 1 0 9 8

Virtual address

Physical address

Page table register

Physical page numberValid

Page table

If 0 then page is notpresent in memory

20 12

18


Making Address Translation Fast

• A cache for address translations: translation lookaside buffer

1111011

11

1

0

0

1000000

11

1

0

0

1001011

11

1

0

0

Physical pageor disk addressValid Dirty Ref

Page table

Physical memory

Virtual pagenumber

Disk storage

111101

011000

111101

Physical pageaddressValid Dirty Ref

TLB

Tag

Typical values: 16-512 entries, miss-rate: .01% - 1%miss-penalty: 10 – 100 cycles


TLBs and caches

YesWrite access

bit on?

No

YesCache hit?

No

Write data into cache,update the dirty bit, and

put the data and theaddress into the write buffer

YesTLB hit?

Virtual address

TLB access

Try to read datafrom cache

No

YesWrite?

No

Cache miss stallwhile read block

Deliver datato the CPU

Write protectionexception

YesCache hit?

No

Try to write datato cache

Cache miss stallwhile read block

TLB missexception

Physical address


=

=

20


31 30 29 3 2 1 014 13 12 11 10 9

Virtual address

TagValid Dirty

TLB

Physical page number

TagValid

TLB hit

Cache hit

Data

Data

Byteoffset

=====


Physical address tag Cache index

12

20

Blockoffset

Physical address

18

32

8 4 2

12

8

Cache


Figure 7.26 The possible combinations of events in the TLB, virtual memory system, and cache.

TLBPage Table Cache Possible? If so, under what circumstance?

hit hit miss Possible, although the page table is never really checked if TLB hits.

miss hit hit TLB misses, but entry found in page table; after retry, data is found in cache.

miss hit miss TLB misses, but entry found in page table; after retry, data misses in cache.

miss miss miss TLB misses and is followed by a page fault; after retry, data must miss in cache.

hit miss miss Impossible: cannot have a translation in TLB if page is not present in memory.

hit miss hit Impossible: cannot have a translation in TLB if page is not present in memory.

hit miss hit Impossible: data cannot be allowed in cache if the page is not in memory.


Figure 7.27 MIPS control registers.

Register CP0 register number Description

EPC 14 Where to restart after exception

Cause 13 Cause of exception

BadVAddr 8 Address that caused exception

Index 0 Location in TLB to be read or written

Random 1 Pseudorandom location in TLB

EntryLo 2 Physical page address and flags

EntryHi 10 Virtual page address

Context 4 Page table address and page number


7.5 A Common Framework for Memory Hierarchies


Keywords

• Three Cs model A cache model in which all cache misses are classified into one of three categories: compulsory misses, capacity misses, and conflict misses.

• Compulsory miss Also called cold start miss. A cache miss caused by the first access to a clock that has never been in the cache.

• Capacity miss A cache miss that occurs because the cache, even with full associatively, cannot contain all the block needed to satisfy the request.

• Conflict miss Also called collision miss. A cache miss that occurs in a set-associative or direct-mapped cache when multiple blocks compete for the same set and that are eliminated in a fully associative cache of the same size.


Question 1 ： Where can a block be placed?

• We have seen that block placement in the upper level of the hierarchy can use a range of schemes, from direct mapped to set associative to fully associative. As mentioned above, this entire range of schemes can be through of as variations on a set-associative scheme where the number of sets and the number of blocks per set varies:

Scheme name Number of sets Blocks per set

Direct mapped Number of blocks in cache 1

Set associative Associativity (typically 2 —16)

Fully associative 1 Number of blocks in the cache

ityAssociativ

cachein blocks ofNumber


Figure 7.29 The key quantitative design parameters that characterize the major elements of memory hierarchy in a computer.

Feature Typical values for L1 caches

Typical values for L2 caches

Typical values for paged memory

Typical values for a TLB

Total size in blocks 250–2000 4000–250,000 16,000–250,000 16–512

Total size in kilobytes 16–64 500–8000 250,000–1,000,000,000 0.25–16

Block size in bytes 32–64 32–128 4000–64,000 4–32

Miss penalty in clocks 10–25 100–1000 10,000,000–100,000,000 10–1000

Miss rates (global for L2)

2%–5% 0.1%–2% 0.00001%–0.0001% 0.01%–2%


Figure 7.30 The data cache miss rates for each of eight cache sizes improve as the associativity increases.


Question 2 ： How is a block found?

• The choice of how we locate a block depends on the block placement scheme, since that dictates the number of possible locations. We can summarize the schemes as follows:

Associativity Location method Comparisons required

Direct mapped Index 1

Set associative Index the set, search among elements

Degree of associativity

Full Search all cache entires Size of the cache

Separate lookup table 02


Question 3 ： Which block should be replaced on a cache miss?

• We have already mentioned the two primary strategies for replacement in set-associative or fully associative caches:

– Random: Candidate blocks are randomly selected, possibly using some hardware assistance. For example, MIPS supports random replacement for TLB misses.

– Least recently used (LRU): The block replaced is the one that has been unused for the longest time.


Question 4 ： What happens on a Write?

• A key characteristic of any memory hierarchy is how it deals with writes. We have already seen the two basic options:

– Write-through: The information is written to both the block in the cache and to the block in the lower level of the memory hierarchy (main memory for a cache). The caches in Section 7.2 used this scheme.

– Write-back (also called copy-back): The information is written only to the block in the cache. The modified block is written to the lower level of the hierarchy only when it is replaced. Virtual memory systems always use write-back, for the reasons discussed in Section 7.4.


The three Cs ： An intuitive model for understanding the behavior of memory hierarchies

• Compulsory misses ： These are cache misses caused by the first access to a block that has never been in the cache. These are also called cold-start misses.

• Capacity misses ： These are cache misses caused when the cache cannot contain all the blocks needed during execution of a program. Capacity misses occur when blocks are replaced and then later retrieved.

• Conflict misses ： These are cache misses that occur in set-associative or direct-mapped caches when multiple blocks complete for the same set. Conflict misses are those misses in a direct-mapped or set-associative cache that are eliminated in a fully associative cache of the same size. These cache misses are also called collision misses.


Figure 7.31 The miss rate can be broken into three sources of misses.


Figure 7.32 Memory hierarchy design challenges.

Design change Effect on miss rate Possible negative performance effect

Increase cache size Decreases capacity misses May increase access time

Increase associativity Decreases miss rate due to conflict misses

May increase access time

Increase block size Decreases miss rate for a wide range of block sizes due to spatial locality

Increases miss penalty. Very large block could increase miss rate


7.6 Real Stuff: The Pentium P4 and the AMD Opteron Memory Hierarchies


Keywords

• Nonblocking cache A cache that allows the processor to make references to the cache while the cache is handling an earlier miss.


Figure 7.33 An AMD Opteron die processor photo with the components labeled.


Modern Systems

•



7.7 Fallacies and Pitfalls


• Pitfall: Forgetting to account for byte addressing or the cache block size in simulating a cache.

• Pitfall: Ignoring memory system behavior when writing programs or when generating code in a compiler.

• Pitfall: Using average memory access time to evaluate the memory hierarchy of an out-of-order processor.

• Pitfall: Extending an address space by adding segments on top of an unsegmented address space.


7.8 Concluding Remarks


Keywords

• Prefetching A technique in which data clocks needed in the future are brought into the cache early by the use of special instructions that specify the address of the block.


Modern Systems

• Things are getting complicated!


• Processor speeds continue to increase very fast— much faster than either DRAM or disk access times

• Design challenge: dealing with this growing disparity– Prefetching? 3rd level caches and more? Memory design?

Some Issues

Year

Performance

1

10

100

1,000

10,000

100,000

CPU

Memory

Date post:	06-Jan-2016
Category:	Documents
Upload:	silas
View:	44 times
Download:	0 times

Chapter Seven Large and Fast: Exploiting Memory Hierarchy

Documents