Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 212 times |
Download: | 0 times |
Copyright © 2012, Elsevier Inc. All rights reserved. 1
Chapter 2 (and Appendix B)
Memory Hierarchy Design
Computer ArchitectureA Quantitative Approach, Fifth Edition
2Copyright © 2012, Elsevier Inc. All rights reserved.
In the beginning…
Main memory (i.e., RAM) was faster than processors Memory access times less than clock cycle
times Everything was “great” until ~1980
3Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Performance GapIntroduction
4Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Performance Gap Programmers want unlimited amounts of memory with
low latency Fast memory technology is more expensive per bit than
slower memory Solution: organize memory system into a hierarchy
Entire addressable memory space available in largest, slowest memory
Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor
Temporal and spatial locality insures that nearly all references can be found in smaller memories
Gives the illusion of a large, fast memory being presented to the processor
Introduction
5Copyright © 2012, Elsevier Inc. All rights reserved.
Memory HierarchyIntroduction
6Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Design
Memory hierarchy design becomes more crucial with recent multi-core processors: Aggregate peak bandwidth grows with # cores:
Intel Core i7 can generate two references per core per clock Four cores and 3.2 GHz clock
25.6 billion 64-bit data references/second + 12.8 billion 128-bit instruction references = 409.6 GB/s!
DRAM bandwidth is only 6% of this (25 GB/s) Requires:
Multi-port, pipelined caches Two levels of cache per core Shared third-level cache on chip
Introduction
7Copyright © 2012, Elsevier Inc. All rights reserved.
Performance and Power
High-end microprocessors have >10 MB on-chip cache Consumes large amount of area and power budget
Introduction
8Copyright © 2012, Elsevier Inc. All rights reserved.
Cache Hits
When a word is found in the cache, a hit occurs: Yay!
Introduction
9Copyright © 2012, Elsevier Inc. All rights reserved.
Cache Misses
When a word is not found in the cache, a miss occurs: Fetch word from lower level in hierarchy, requiring a
higher latency reference Lower level may be another cache or the main
memory Also fetch the other words contained within the block
Takes advantage of spatial locality Place block into cache in a location determined by
address
Introduction
10Copyright © 2012, Elsevier Inc. All rights reserved.
Causes of Misses
Miss rate Fraction of cache access that result in a miss
Causes of misses Compulsory
First reference to a block Capacity
Blocks discarded and later retrieved Conflict
Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache
Introduction
11Copyright © 2012, Elsevier Inc. All rights reserved.
Causes of MissesIntroduction
12Copyright © 2012, Elsevier Inc. All rights reserved.
Note that speculative and multithreaded processors may execute other instructions during a miss Reduces performance impact of misses
Quantifying MissesIntroduction
13Copyright © 2012, Elsevier Inc. All rights reserved.
Cache Design
4 Questions: Where can a block be placed in the cache? How is a block found? Which block should be replaced? How is a write handled?
Introduction
14Copyright © 2012, Elsevier Inc. All rights reserved.
Block Placement
Divide cache into sets to reduce conflict misses n blocks per sets => n-way set associative
Direct-mapped cache => one block per set Fully associative => one set
Introduction
15Copyright © 2012, Elsevier Inc. All rights reserved.
Block PlacementIntroduction
16Copyright © 2012, Elsevier Inc. All rights reserved.
Block Placement
For a direct-mapped cache High-order address bits determine block location
(uniquely) For example: 32-bit addresses, 64 blocks, 8-byte
blocks Low-order 3 address bits determine byte location in block High-order 32-3 = 29 bits determine block location in cache 64 blocks, so location is value of high-order 29 bits mod 64 Can simply use the low-order 6 bits of those 29 bits!
Introduction
17Copyright © 2012, Elsevier Inc. All rights reserved.
Block Placement
In general Block address is divided into tag and index based on
associativity An n-way associative cache has a number of sets equal to
blocks/associativity, so index is lg blocks/associativity bits For example, a 2-way associative cache with 64 blocks has
32 sets, so 5 index bits Tag is used to identify the block during lookup
18Copyright © 2012, Elsevier Inc. All rights reserved.
Causes of MissesIntroduction
19Copyright © 2012, Elsevier Inc. All rights reserved.
Cache Lookup
Easy for direct mapped, each block has a unique location Just need to verify the tag
For n-way, check tag at index in all sets simultaneously Higher associativity means more-complex
hardware (i.e., slower) Also, each entry needs a “valid” bit
Avoids initialization issues
Introduction
20Copyright © 2012, Elsevier Inc. All rights reserved.
Opteron L1 Data Cache (2-way)Introduction
21Copyright © 2012, Elsevier Inc. All rights reserved.
Replacement
Easy for direct mapped, each block has a unique location (again)
For higher associativities Least Recently Used (LRU) First In, First Out (FIFO) Random
Introduction
22Copyright © 2012, Elsevier Inc. All rights reserved.
Writes
Writing to cache: two strategies Write-through
Immediately update lower levels of hierarchy Write-back
Only update lower levels of hierarchy when an updated block is replaced (use dirty bit)
Both strategies use write buffer to make writes asynchronous
Introduction
23Copyright © 2012, Elsevier Inc. All rights reserved.
Example 32-bit addressing; 128KB cache, 256B blocks (512
blocks), 2-way set associative (256 sets), LRU replacement policy
Low-order 8 bits is offset, next lowest 8 bits is index, high-order 16 bits is tag
24Copyright © 2012, Elsevier Inc. All rights reserved.
Example From empty:
Read 0xAABB0100 Tag: AABB, Index: 01, Offset: 00; place in set 1, bank 0 Miss (compulsory)
Read 0xABBA0101 Tag: ABBA, Index: 01, Offset: 01; place in set 1, bank 1 Miss (compulsory)
Read 0xABBA01A3 Tag: ABBA, Index: 01, Offset: A3; located in set 1, bank 1 Hit!
Read 0xCBAF01FF Tag: CBAF, Index: 01, Offset: FF; place in set 1, bank 0 (LRU) Miss (conflict)
Read 0xAABB01CC Tag: AABB, Index: 01, Offset: CC; place in set 1, bank 1 (LRU) Miss (conflict)
25Copyright © 2012, Elsevier Inc. All rights reserved.
Basic Optimizations
Six basic cache optimizations: Larger block size
Reduces compulsory misses Increases conflict misses, increases miss penalty
Larger total cache capacity Reduces capacity misses Increases hit time, increases power consumption
Higher associativity Reduces conflict misses Increases hit time, increases power consumption
Higher number of cache levels Reduces overall memory access time
Giving priority to read misses over writes Reduces miss penalty
Avoiding address translation in cache indexing Reduces hit time
Introduction
26Copyright © 2012, Elsevier Inc. All rights reserved.
Pitfalls
Ignoring the impact of the operating system on the performance of the memory hierarchy
Mem
ory Technology