Memory Hierarchy
Slides contents from:
Hennessy & Patterson, 5ed. Appendix B and Chapter 2.
David Wentzlaff, ELE 475 – Computer Architecture.
MJT, High Performance Computing, NPTEL.
Introduction➢ Programmers want unlimited amounts of memory with
low latency➢ Fast memory technology is more expensive per bit than
slower memory➢ Solution: organize memory system into a hierarchy
➢ Entire addressable memory space available in largest, slowest memory
➢ Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor
➢ Temporal and spatial locality insures that nearly all references can be found in smaller memories
➢ Gives the allusion of a large, fast memory being presented to the processor
Register File
Memory Array - SRAM
Memory Arrays – DRAM
SRAM vs. DRAM
Foss, R.C. “Implementing Application-Specific Memory”, ISSCC 1996
Memory Technology Trade-offs
Latches/Registers
Register File
SRAM
DRAM
Low CapacityLow LatencyHigh Bandwidth(more and wider ports)
High CapacityHigh LatencyLow Bandwidth
Memory Hierarchy
Memory Performance Gap
Memory Performance Gap● Aggregate Peak Bandwidth required by Intel i7
– 2 references per clock
– 4 cores, 3.2GHz.– 25.6 Billion 64-bit references/sec + 12.8 Billion128-
bit instruction references
– = 409.6 GB/s
● DRAM Capacity = 25.6 GB/s– Multiport, Pipelined caches– Two levels of cache per core
– Shared third-level cache on chip
Predictable Memory Reference Patterns
Spatial andTemporal Locality
Hatfield and Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Inside a Cache
PROCESSORPROCESSOR MAINMEMORY
MAINMEMORYCACHECACHE
CONTROLLER
TAGS
DATA
. . .0 1 N Line (Block)
Classification of Caches● Block Placement
– Where can a block be placed in a cache?
– Direct mapped, Set associative, Fully associative
● Block Identification– How is a block found if it is in cache?
● Block Replacement– Which block should be replaced on a miss?
● Write Strategy– What happens on a write?
Block Identification 2 Index=CacheSize
BlockSize×SetAssociativity
Direct Mapped
2-way Set Associative
Block Replacement● No choice in a direct mapped cache● In an associative cache, which block from set should be
evicted when the set becomes full?● Random● Least Recently Used (LRU)
– LRU cache state must be updated on every access
– True implementation only feasible for small sets (2-way)
– Almost LRU
● First In, First Out (FIFO) aka Round-Robin– Used in highly associative caches
● Not Most Recently Used (NMRU)– FIFO with exception for most recently used block(s)
Write Strategy: How are writes handled?
● Cache Hit– Write Through – write both cache and memory, generally
higher traffic but simpler to design– Write Back – write cache only, memory is written when evicted,
dirty bit per block avoids unnecessary write backs, more complicated
● Cache Miss– No Write Allocate – only write to main memory– Write Allocate – fetch block into cache, then write
● Common Combinations– Write Through & No Write Allocate
– Write Back & Write Allocate
Average Memory Access Time
PROCESSORPROCESSOR MAINMEMORY
MAINMEMORYCACHECACHE
HIT
MISS
Avg. Memory Access Time=Hit Time+(Miss Rate×Miss Penalty)
Categorizing Misses: The Three C's● Cold Start Misses (Compulsory)
– First-reference to a block
● Capacity– Cache is too small to hold all data needed by
program, occur even under perfect replacement policy
● Conflict (Collision) Misses– Misses that occur because of collisions due to less
than full associativity
Six Basic Cache Optimizations● Larger block size to reduce miss rate● Larger caches to reduce miss rate● Higher associativity to reduce miss rate● Multilevel caches to reduce miss penalty● Giving priority to read misses over writes to
reduce miss penalty● Avoiding address translation during indexing of
the cache to reduce hit time
Prioritizing read misses over writes
● Direct Mapped, Write through cache maps address 512 and 1024 to the same cache block. Write buffer is not checked on a read miss. Will the value in R2 always be equal to value in R3?
● Copy the dirty block to a buffer. On a read: check the buffer for the address, read memory and then write memory.
Avoiding Address Translation During Indexing of the Cache to Reduce Hit Time
● Address Translation is on the critical path– Determines clock cycle time of the processor!
● Use Virtual address or Physical address for index or tag?– Eg. Virtually addressed Physically tagged caches
CPU MMU
VirtualAddress
PhysicalAddress
Cache
CPU Cache
VirtualAddress
PhysicalAddress
MMUCacheMiss
MM
Virtual Memory
Virtual Memory
Page Translation Table per Process
Virtual Page Number
Physical Page Number
DiskAddress
ValidBit
Opteron Example
Virtual Memory – Page Faults● Situation where virtual address generated by
processor is not available in main memory● Detected on attempt to translate address
– Page Table entry is invalid
● Must be `handled’ by operating system– Identify slot in main memory to be used
– Get page contents from disk
– Update page table entry
● Data can then be provided to the processor
Fast Translation● Address translation is on the critical path● Paging – 2 memory accesses!
– Address Translation Table + Data
● Translation Lookaside Buffer (TLB)
VM – Example
Cache Access Example
TagTag Index (15b)Index (15b)
Direct mapped, 32 KB, 32B block,32b main memory address
OffsetOffset
32 b address32 b address
32B Block32B BlockW 1W 1
=?
NoCache Miss
33 22
YesCache Hit
Tag is not needed until the cache linehas been read
W 1W 1 W 2 ... 7W 2 ... 7 W 8W 8
. . .
W 2 ... 7W 2 ... 7 W 8W 8
To Processor
VM Example
IndexIndex OffsetOffset
64 b VM address64 b VM address
=?
Virtual Page No.Virtual Page No. Page OffsetPage Offset
Cache BlockCache Block
Index comes from the Virtual Address(Virtually Indexed)
TLB TagTLB Tag TLB IndexTLB Index
Physical Address (Tag)Physical Address (Tag)
=?
Physical AddressPhysical Address
Tag comes from the Physical Address(Physically tagged)
TLB Hit/Page Fault
L1 Hit/Miss
To L2
TagTag IndexIndex OffsetOffset
L2 Cache BlockL2 Cache Block
=?
Summary of Basic Cache Optimizations
Advanced Cache Optimizations● Reducing hit time
– Small L1, Way prediction
● Increasing cache bandwidth– Pipelined caches, multibanked caches, non-blocking caches
● Reducing the miss penalty– Critical word first, merging write buffers
● Reducing the miss rate– Compiler optimizations
● Reducing the miss penalty or miss rate via parallelism– Hardware and compiler prefetchng
Small and Simple L1 Caches● Low latency and low power
– Critical timing path:● Indexing, tag comparison, select correct set
● Direct-mapped caches can overlap tag compare and transmission of data
● Lower associativity reduces power because fewer cache lines are accessed
● CAD tools to estimate hit time and power
– CACTI– Input parameters: Feature size, cache size,
associativity, number of read/write ports.
L1 Size and Associativity
Energy per read vs. size and associativity
Way Prediction● To improve hit time, predict the way; pre-set mux
– Extra bits per cache block– Mis-prediction gives longer hit time– Prediction accuracy
● 90% for two-way, 80% for four-way● I-cache has better accuracy than D-cache
– Today: ARM Cortex-A8● Extend to predict block as well (Way selection)
– Saves access power – Increases mis-prediction penalty– Difficult to pipeline caches
Pipelining Cache● Pipeline cache access to improve bandwidth
– Examples:
– Pentium: 1 cycle, Pentium Pro – Pentium III – 2 cycles, Pentium 4 – Core i7: 4 cycles.
● Increase branch misprediction penalty● Make it easier to increase associativity
Nonblocking Caches● Allow hits before previous
misses complete– Hit under miss
– Hit under multiple misses
● L2 must support this
● In general, processors can hide L1 miss penalty but not L2 miss penalty
Multibanked Caches● Cache block is divided into banks● Banks can be simultaneously accessed● Spread addresses of of the block sequentially
(Sequential Interleaving)
Critical Word First and Early Restart● Processor needs only one word at a time● Critical word first:
– Request missed word from memory first
– Send it to the processor as soon as it arrives
● Early restart:– Request words in normal order
– Send missed work to the processor as soon as it arrives
● Effectiveness of this strategy depends on block size and likelihood of another access to the portion of the block that has not yet been fetched
Merging Write Buffer● When storing to a block that is already
pending in the write buffer, update write buffer● Reduces stalls due to full write buffer● Do not apply the I/O addresses
No write buffering
Write buffering
Compiler Optimizations● Loop interchange
– Swap nested loops to access memory in sequential order
● Blocking– Instead of accessing entire rows or columns,
subdivide matrices into blocks
– Requires more memory accesses but improves locality of accesses
Storage Order of Arrays● 2D matrices storage order in
memory– Compiler's decision
● Row major order– Elements stored row-wise
– C, C++
● Column major order– Elements are stored column-
wise
– FORTRAN
2D Matrix Sumdouble A[1024][1024], B[1024][1024];for (j=0;j<1024;j++) for (i=0;i<1024;i++) B[i][j] = A[i][j] + B[i][j];
Load A[0][0] Load B[0][0] Store B[0][0]
Load A[1][0] Load B[1][0] Store B[1][0]
.... .... ....
Load A[0][1] Load B[0][1] Store B[0][1]
Load A[0][2] Load B[0][2] Store B[0][2]
... .... ....
● Load A[0][0] is a Cold Start miss
● Cache Block contains A[0][0] → A[0][3]
● Reference order is different from Storage order!
● Every Row: Cold miss, Cold Miss, Hit
● Hit ratio: 33%
Loop Interchangedouble A[1024][1024], B[1024][1024];for (j=0;j<1024;j++) for (i=0;i<1024;i++) B[i][j] = A[i][j] + B[i][j];
double A[1024][1024], B[1024][1024];for (i=0;i<1024;i++) for (j=0;j<1024;j++) B[i][j] = A[i][j] + B[i][j];
Loop Interchange
● Cache Block contains A[0][0] → A[0][3]● Hit Ratio: 83.3%
double A[1024][1024], B[1024][1024];for (i=0;i<1024;i++) for (j=0;j<1024;j++) B[i][j] = A[i][j] + B[i][j];
Load A[0][0] Load B[0][0] Store B[0][0]
Load A[0][1] Load B[0][1] Store B[0][1]
.... .... ....
Load A[1][0] Load B[1][0] Store B[1][0]
Load A[1][0] Load B[1][0] Store B[1][0]
... .... ....
Matrix Multiplicationdouble X[N][N], Y[N][N], Z[N][N];for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) X[i][j] += Y[i][k] * Z[k][j];
Y[0,0], Z[0,0], Y[0,1], Z[1,0], Y[0,2], Z[2,0], .... X[0,0]Y[0,0], Z[0,1], Y[0,1], Z[1,1], Y[0,2], Z[2,1], .... X[0,1]........Y[1,0], Z[0,0], Y[1,1], Z[1,0], Y[1,2], Z[2,0], .... X[1,0]
Blocking● Make full use of the elements of Z when they
are brought into the cache
(0,0) (0,1)
(1,0) (1,1)
(0,0) (0,1)
(1,0) (1,1)
Y Z
X X x Y
0,0 0,0 x 0,0 + 0,1 x 1,0
1,0 1,0 x 0,0 + 1,1 x 1,0
Blocking
double X[N][N], Y[N][N], Z[N][N];for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) X[i][j] += Y[i][k] * Z[k][j];
double X[N][N], Y[N][N], Z[N][N];for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) X[i][j] += Y[i][k] * Z[k][j];
double X[N][N], Y[N][N], Z[N][N];for (J=0;J<N;J+=B)for (K=0;K<N;K+=B)for (i=0;i<N;i++) for (j=J;j<min(J+B,N);j++) for (k=K,r=0;k<min(K+B,N);k++) r += Y[i][k] * Z[k][j]; X[i][j] += r;
Hardware Prefetching Fetch two blocks on miss (include next sequential
block)
Pentium 4 Pre-fetching
Compiler Prefetching
Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions
Register prefetch Loads data into register
Cache prefetch Loads data into cache
Combine with loop unrolling and software pipelining
Summary
Memory Technology● Main memory serves as input and output to
I/O interfaces and the processor.● DRAMs for main memory, SRAM for caches● Metrics: Latency, Bandwidth● Access Time
– Time between read request and when desired word arrives
● Cycle Time– Minimum time between unrelated requests to
memory
Memory Technology SRAM
Requires low power to retain bit, 6 transistors/bit DRAM
Must be re-written after being read Must be periodically refreshed
• Every ~ 8 ms Address lines are multiplexed:
Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS)
Memory Technology - Optimizations● Multiple accesses to same row● Synchronous DRAM
– Clocked operation, Burst mode
● Wider interfaces● Double data rate● Multiple banks on each DRAM device
Clock rates, Bandwidth and Names
2.5V
1.8V
1.5V
1 - 1.2V
● GDDR5 – Graphics memory based on DDR3– 2x – 5x bandwidth per DRAM vs. DDR3
– Wider interface, higher clockrate
– Attached via soldering instead of socketted DIMM
Flash Memory● Type of EEPROM● Must be erased (in blocks) before being
overwritten● Non-volatile● Limited number of write cycles● Cheaper than SDRAM, more expensive than
disk● Slower than SRAM, faster than disk
Memory Dependability
● Memory is susceptible to cosmic rays– Soft errors
● Hard errors (permanent)– ECC has detect and correct
– Spare rows are used as replacement in case a row fails
● Chip kill: RAID-like errror recovery technique