Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | zoe-williamson |
View: | 216 times |
Download: | 0 times |
Memory and cache
CPU Memory I/O
2College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
The Memory Hierarchy
Registers
Primary cache
Secondary cache
Main memory
Magnetic disk
• ~2ns
• ~4-5ns
• ~30ns
• ~220ns+
• >1ms (~6ms)
3College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Cache & Locality
Cache sits between the CPU and main memory
– Invisible to the CPU Only useful if recently used items are used again Fortunately, this happens a lot. We call this property locality of reference.
CPU Cache Main Memory
4College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Locality of reference
Temporal locality– Recently accessed data/instructions are likely to
be accessed again. Most program time is spent in loops Arrays are often scanned multiple times
Spatial locality– If I access memory address n, I am likely to then
access another address close to n (usually n+1, n+2, or n+4)
Linear execution of code Linear access of arrays
5College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
How a cache exploits locality
Temporal – When an item is accessed from memory it is brought into the cache
– If it is accessed again soon, it comes from the cache and not main memory
Spatial – When we access a memory word, we also fetch the next few words of memory into the cache
– The number of words fetched is the cache line size, or the cache block size for the machine
6College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Fully-associative cacheCache & memory architecture8-bit memory address: x00 – xFF (256 words)Cache size = 64 words, divided into 16-word cache linesMemory is divided into 16-word blocks
Block0
x00
x0F
Block1
x10
x1F
Block2
x20
x2F...
BlockF
xF0
xFF
0 1 2 3 4 5 6 7 8 9 A B C D E FWord:Tag
0
2
7
F
7College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Fully-associative cache
Note that the first four bits of the address are the block number…
Block0
x00
x0F
Block1
x10
x1F
Block2
x20
x2F...
BlockF
xF0
xFF
0 1 2 3 4 5 6 7 8 9 A B C D E FWord:Tag
0
2
7
F
8College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Parsing the address
Block0
x00
x0F
Block1
x10
x1F
Block2
x20
x2F...
BlockF
xF0
xFF
0 1 2 3 4 5 6 7 8 9 A B C D E FWord:Tag
0
2
7
F
0 1 0 1 0 1 1 0Address:
Tag Word
9College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Direct-Mapped Cache
Block0
x00
x0F
Block1
x10
x1F
Block2
x20
x2F
Block 3
Block4
x40
x4F
0 1 2 3 4 5 6 7 8 9 A B C D E FWord:Tag
01
10
11
10
0 1 0 1 0 1 1 0Address:
Tag Word
Line
0
1
2
3
Line
…
10College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Direct-Mapped AddressesBlock Binary Cache Line
0 0000 01 0001 12 0010 23 0011 34 0100 05 0101 16 0110 27 0111 38 1000 09 1001 1A 1010 2B 1011 3C 1100 0D 1101 1E 1110 2F 1111 3
0 1 0 1 0 1 1 0Address:
Tag WordLine
Do we need to store the entire 4-bit block number with each cache line?
11College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Set Associative Cache
Block0
x00
x0F
Block1
x10
x1F
Block2
x20
x2F
Block 3
Block4
x40
x4F
0 1 2 3 4 5 6 7 8 9 A B C D E FWord:Tag
001
101
110
101
0 1 0 1 0 1 1 0Address:
Tag Word
Set
0
Set
1
…
12College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Cache mapping
Direct mapped – each memory block can occupy one and only one cache block
Example:
– Cache block size: 16 words– Memory = 64K (4K blocks)– Cache = 2K (128 blocks)
13College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Direct Mapped Cache
Memory block n occupies cache block (n mod 128)
Consider address $2EF4
001011101111 0100block: $2EF = 751 word: 4
Cache:
00101 1101111 0100tag: 5 block: 111 word: 4
14College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Fully Associative Cache
More efficient cache utilization
– No wasted cache space
Slower cache search– Must check the tag of
every entry in the cache to see if the block we want is there.
15College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Set-associative mapping
Blocks are grouped into sets Each memory block can occupy any
block in its set This example is 2-way set-
associative Which of the two blocks do we
replace?
16College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Cache write policies
As long as we are only doing READ operations, the cache is an exact copy of a small part of the main memory
When we write, should we write to cache or memory?
Write through cache – write to both cache and main memory. Cache and memory are always consistent
Write back cache – write only to cache and set a “dirty bit”. When the block gets replaced from the cache, write it out to memory.When might the write-back policy be dangerous?
17College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Replacement algorithms
Random Oldest first Least accesses Least recently used (LRU): replace the block that has gone the longest
time without being referenced.
– This is the most commonly-used replacement algorithm
– Easy to implement with a small counter…
18College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Implementing LRU replacement Suppose we have a 4-way set associative cache…
Block 0
Block 1
Block 2
Block 3
Set 0
11
10
01
00
• Hit:• Increment lower
counters• Reset counter to 00
• Miss• Replace the 11• Set to 00• Increment all
other counters
19College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Interactions with DMA
If we have a write-back cache, DMA must check the cache before acting.– Many systems simply flush the cache before
DMA write operations What if we have a memory word in the cache, and the DMA controller
changes that word?– Stale data
We keep a valid bit for each cache line. When DMA changes memory, it must also set the valid bit to 0 for that cache line.– “Cache coherence”
20College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Typical Modern Cache Architecture
L0 cache
– On chip– Split 16 KB data/16 KB instructions
L1 cache
– On chip– 64 KB unified
L2 cache
– Off chip– 128 KB to 16+ MB
21College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Memory Interleaving• Memory is organized into modules or banks
• Within a bank, capacitors must be recharged after each read operation
• Successive reads to the same bank are slow
0000000200040006…00F800FA00FC00FE
Module
0100010201040106…01F801FA01FC01FE
0200020202040206…02F802FA02FC02FE
Byte
Non-interleaved
00000006000C0012…02E802EE02F402FA
Module
00020008000E0014…02EA02F002F602FC
0004000A00100016…02EC02F202F802FE
Byte
Interleaved
22College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Measuring Cache Performance
No cache: Often about 10 cycles per memory access Simple cache:
– tave = hC + (1-h)M– C is often 1 clock cycle– Assume M is 17 cycles (to load an entire cache
line)– Assume h is about 90%– tave = .9 (1) + (.1)17 = 2.6 cycles/access– What happens when h is 95%?
23College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Multi-level cache performance
tave = h1C1 + (1-h1) h2C2 + (1-h1) (1-h2) M
– h1 = hit rate in primary cache– h2 = hit rate in secondary cache– C1 = time to access primary cache– C2 = time to access secondary cache– M = miss penalty (time to load an entire cache
line from main memory)
24College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Virtual Memory
25College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Virtual Memory: Introduction
Motorola 68000 has 24-bit memory addressing and is byte addressable– Can address 224 bytes of memory – 16 MB
Intel Pentiums have 32-bit memory addressing and are byte addressable– Can address 232 bytes of memory – 4 GB
What good is all that address space if you only have 256MB of main memory (RAM)?– How do you run a program that is LARGER than 256MB?– Examples:
Unreal Tournament – 2.4 gibabytes Neverwinter Nights – 2 gigabytes
In modern computers it is possible to use more memory than the amount physically available in the system
Memory not currently being used is temporarily stored on magnetic disk Essentially, the main memory acts as a cache for the virtual memory on
disk.
26College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Working Set
Not all of a program needs to be in memory while you are executing it:– Error handling routines are not called very often.– Character creation generally only happens at the start of the game… why keep that
code easily available? Working set is the memory that is consumed at any moment by a
program while it is running.– Includes stack, allocated memory, active instructions, etc.
Examples:– Unreal Tournament – 100MB (requires 128MB)– Internet Explorer – 20MB
27College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Thrashing Diagram Why does paging work?
Locality model– Process migrates from one locality to another.– Localities may overlap.
Why does thrashing occur? size of locality > total memory size
What should we do?– suspend processes!
28College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Thrashing
If a process does not have enough frames to hold its current working set, the page-fault rate is very high
Thrashing– a process is thrashing when it spends more time paging than
executing– w/ local replacement algorithms, a process may thrash even
though memory is available– w/ global replacement algorithms, the entire system may
thrash Less thrashing in general, but is it fair?
29College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Program Structure How should we arrange memory references to large arrays?
– Is the array stored in row-major or column-major order? Example:
– Array A[1024, 1024] of type integer– Page size = 1K
Each row is stored in one page– System has one frame
– Program 1 for i := 1 to 1024 dofor j := 1 to 1024 do
A[i,j] := 0;1024 page faults
– Program 2 for j := 1 to 1024 dofor i := 1 to 1024 do
A[i,j] := 0;1024 x 1024 page faults
30College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
CS Reality #3
Memory Matters
Memory is not unbounded– It must be allocated and managed– Many applications are memory dominated
Memory referencing bugs especially pernicious– Effects are distant in both time and space
Memory performance is not uniform– Cache and virtual memory effects can greatly affect program performance– Adapting program to characteristics of memory system can lead to major
speed improvements
31College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
IA32 and GCC
IA32 processors, like most other processors, have special memory elements (registers) for holding floating-point values as they are being computed and used.
IA32 uses special 80-bit extended precision floating-point registers
IA32/gcc stores doubles as 64 bits in memory. Numbers are converted as they are stored in memory
Whenever a function call is made, register values may be stored in memory (callee save)
double recip (int denom) { return 1.0 /(double) denom;}void do_nothing () {}void test (int denom) { double r1, r2; int t1, t2; r1 = recip(denom); r2 = recip(denom); t1 = (r1 = = r2); do_nothing(); t2 = (r1 = = r2); printf(“t1: r1 %c= r2\n, t1 ? ‘=‘ : ‘!’); printf(“t2: r1 %c= r2\n, t2 ? ‘=‘ : ‘!’);}t1: r1 != r2;t2: r1 = = r2;
32College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Memory Referencing Errors C and C++ do not provide any memory protection
– Out of bounds array references– Invalid pointer values– Abuses of malloc/free
Can lead to nasty bugs– Whether or not bug has any effect depends on system and compiler– Action at a distance
Corrupted object logically unrelated to one being accessed Effect of bug may be first observed long after it is generated
How can I deal with this?– Program in Java, Lisp, or ML– Understand what possible interactions may occur– Use or develop tools to detect referencing errors
33College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Memory Performance Example
Implementations of Matrix Multiplication– Multiple ways to nest loops
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}
/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}
34College of CharlestonDr. Anderson, Computer Science
CSCI 250Comp. Org. & Assembly
Matmult Performance
0
20
40
60
80
100
120
140
160
matrix size (n)
ijk
ikj
jik
jki
kij
kji
Too big for L1 Cache Too big for L2 Cache
mu
ltip
licati
on
s p
er
tim
e u
nit