Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 2 times |
CS 3214Computer Systems
Godmar Back
Lecture 11
Announcements
• Stay tuned for Exercise 5• Project 2 due Sep 30• Auto-fail rule 2:
– Need at least Firecracker to blow up to pass class.
CS 3214 Fall 2010
CODE OPTIMIZATIONPart 4
CS 3214 Fall 2010
Some of the following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP)
Randal E. Bryant and David R. O'Hallaron
http://csapp.cs.cmu.edu/public/lectures.html
MEMORY HIERARCHIES
CS 3214 Fall 2010
Locality• Principle of Locality:
– Programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves.
– Temporal locality: Recently referenced items are likely to be referenced in the near future.
– Spatial locality: Items with nearby addresses tend to be referenced close together in time.
Locality Example:• Data
– Reference array elements in succession (stride-1 reference pattern):
– Reference sum each iteration:• Instructions
– Reference instructions in sequence:– Cycle through loop repeatedly:
sum = 0;for (i = 0; i < n; i++)
sum += a[i];return sum;
Spatial locality
Spatial locality
Temporal locality
Temporal locality
CS 3214 Fall 2010
The CPU-Memory Gap• The increasing gap between DRAM, disk,
and CPU speeds.
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1980 1985 1990 1995 2000
year
ns
Disk seek time
DRAM access time
SRAM access time
CPU cycle time
CS 3214 Fall 2010
Memory Hierarchies
• Motivated by some fundamental and enduring properties of hardware and software:– Fast storage technologies cost more per byte and have less
capacity. – The gap between CPU and main memory speed is widening.– Well-written programs tend to exhibit good locality.
• These fundamental properties complement each other beautifully.
• They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
CS 3214 Fall 2010
CS 3214 Fall 2010
An Example Memory Hierarchy
registers
on-chip L1cache (SRAM)
main memory(DRAM)
local secondary storage(local disks)
Larger, slower, and cheaper (per byte)storagedevices
remote secondary storage(distributed file systems, Web servers)
Local disks hold files retrieved from disks on remote network servers.
Main memory holds disk blocks retrieved from local disks.
off-chip L2cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache memory.
CPU registers hold words retrieved from L1 cache.
L2 cache holds cache lines retrieved from main memory.
L0:
L1:
L2:
L3:
L4:
L5:
Smaller,faster,and costlier(per byte)storage devices
Caches
• Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.
• Fundamental idea of a memory hierarchy:– For each k, the faster, smaller device at level k serves as a cache
for the larger, slower device at level k+1.• Why do memory hierarchies work?
– Programs tend to access the data at level k more often than they access the data at level k+1.
– Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.
– Net effect: A large pool of memory that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
CS 3214 Fall 2010
Caching in a Memory Hierarchy
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Larger, slower, cheaper storagedevice at level k+1 is partitionedinto blocks.
Data is copied betweenlevels in block-sized transfer units
8 9 14 3Smaller, faster, more expensivedevice at level k caches a subset of the blocks from level k+1
Level k:
Level k+1: 4
4
4 10
10
10
CS 3214 Fall 2010
Request14Request12
General Caching Concepts• Program needs object d, which is stored
in some block b.• Cache hit
– Program finds b in the cache at level k. E.g., block 14.
• Cache miss– b is not at level k, so level k cache must
fetch it from level k+1. E.g., block 12.– If level k cache is full, then some current
block must be replaced (evicted). Which one is the “victim”?
• Placement policy: where can the new block go? E.g., b mod 4
• Replacement policy: which block should be evicted? E.g., LRU
9 3
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Level k:
Level k+1:
1414
12
14
4*
4*12
12
0 1 2 3
Request12
4*4*12
CS 3214 Fall 2010
Types of cache misses
• Cold (compulsory) miss– Cold misses occur because the cache is empty.
• Conflict miss– Most caches limit blocks at level k+1 to a small subset
(sometimes a singleton) of the block positions at level k.– E.g. Block i at level k+1 must be placed in block (i mod 4) at level
k+1.– Conflict misses occur when the level k cache is large enough, but
multiple data objects all map to the same level k block.– E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
• Capacity miss– Occurs when the set of active cache blocks (working set) is larger
than the cache.
CS 3214 Fall 2010
Cache Performance Metrics• Miss Rate
– Fraction of memory references not found in cache (misses/references)– Typical numbers:
• 3-10% for L1• can be quite small (e.g., < 1%) for L2, depending on size, etc.
• Hit Time– Time to deliver a line in the cache to the processor (includes time to
determine whether the line is in the cache)– Typical numbers:
• 1 clock cycle for L1• 3-8 clock cycles for L2
• Miss Penalty– Additional time required because of a miss
• Typically 25-100 cycles for main memory
• Q.: What is the average access time?
CS 3214 Fall 2010
Direct-mapped vs. Set Associative Caches
• The more lines there are available to hold a block of data, the less likely the chance for conflict misses
• Direct-mapped caches– Exactly 1 location
• N-Way associative– Each set has N lines in which to hold block
• Fully associative– Block can be held in any line
CS 3214 Fall 2010
Examples of Caching in the Hierarchy
Hardware0On-Chip TLBAddress translations
TLB
Web browser
10,000,000Local diskWeb pagesBrowser cache
Web cache
Network buffer cache
Buffer cache
Virtual Memory
L2 cache
L1 cache
Registers
Cache Type
Web pages
Parts of files
Parts of files
4-KB page
32-byte block
32-byte block
4-byte word
What Cached
Web proxy server
1,000,000,000Remote server disks
OS100Main memory
Hardware1On-Chip L1
Hardware10Off-Chip L2
AFS/NFS client
10,000,000Local disk
Hardware+OS
100Main memory
Compiler0CPU registers
Managed By
Latency (cycles)
Where Cached
CS 3214 Fall 2010
Locality Example
• Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer.
• Question: Which of these functions has good locality?int sumarraycols(int a[M][N]){ int i, j, sum = 0;
for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;}
int sumarrayrows(int a[M][N]){ int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;}
CS 3214 Fall 2010
Writing Cache Friendly Code
• Repeated references to variables are good (temporal locality)
• Stride-1 reference patterns are good (spatial locality)• Examples:
– cold cache, 4-byte words, 4-word cache blocksint sumarrayrows(int a[M][N]){ int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;}
int sumarraycols(int a[M][N]){ int i, j, sum = 0;
for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;}
Miss rate = Miss rate = 1/4 = 25% 100%
CS 3214 Fall 2010
Locality Example (2)
• Question: Can you permute the loops so that the function scans the 3-d array a[] with a stride-1 reference pattern (and thus has good spatial locality)?
int sumarray3d(int a[M][N][N]){ int i, j, k, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) sum += a[k][i][j]; return sum}
CS 3214 Fall 2010
Locality Example (3)• Question: Which of these two exhibits
better spatial locality?// struct of arraysstruct soa { float *x; float *y; float *r;};
compute_r(struct soa s) { for (i = 0; …) { s.r[i] = s.y[i] * s.y[i] + s.x[i] * s.x[i]; }}
// struct of arraysstruct soa { float x; float y; float r;};
compute_r(struct soa *s) { for (i = 0; …) { s[i].r = s[i].x * s[i].x + s[i].y * s[i].y; }}
CS 3214 Fall 2010
Locality Example (4)• Question: Which of these two exhibits
better spatial locality?// struct of arraysstruct soa { float *x; float *y; float *r;};
sum_r(struct soa s) { sum = 0; for (i = 0; …) { sum += s.r[i]; }}
// struct of arraysstruct soa { float x; float y; float r;};
compute_r(struct soa *s) { sum = 0; for (i = 0; …) { sum += s[i].r; }}
CS 3214 Fall 2010
Locality Example (5)• Question: Which of these two exhibits
better spatial locality?// struct of arraysstruct soaa { float *x; float *y; float *r;};struct soaget_xyr(struct soaa s, int i) { return ({ .x = s.x[i], .y = s.y[i], .r = s.r[i]}); }
// struct of arraysstruct soa { float x; float y; float r;};
struct soaget_xyr(struct soa *s, int i){ return s[i];}
CS 3214 Fall 2010
The Memory Mountain
• Read throughput (read bandwidth)– Number of bytes read from memory per
second (MB/s)• Memory mountain
– Measured read throughput as a function of spatial and temporal locality.
– Compact way to characterize memory system performance.
CS 3214 Fall 2010
Memory Mountain Test Function/* The test function */void test(int elems, int stride) { int i, result = 0; volatile int sink;
for (i = 0; i < elems; i += stride)result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */}
/* Run test(elems, stride) and return read throughput (MB/s) */double run(int size, int stride, double Mhz){ double cycles; int elems = size / sizeof(int);
test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */}
CS 3214 Fall 2010
Memory Mountain Main Routine/* mountain.c - Generate the memory mountain. */#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */#define MAXBYTES (1 << 23) /* ... up to 8 MB */#define MAXSTRIDE 16 /* Strides range from 1 to 16 */#define MAXELEMS MAXBYTES/sizeof(int)
int data[MAXELEMS]; /* The array we'll be traversing */
int main(){ int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz));printf("\n");
} exit(0);} CS 3214 Fall 2010
The Memory Mountain
s1
s3
s5
s7
s9
s11
s13
s15
8m
2m 512k 12
8k 32k 8k
2k
0
200
400
600
800
1000
1200
read
th
rou
gh
pu
t (M
B/s
)
stride (words) working set size (bytes)
Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cache
Ridges ofTemporalLocality
L1
L2
mem
Slopes ofSpatialLocality
xe
CS 3214 Fall 2010
Ridges of Temporal Locality
• Slice through the memory mountain with stride=1– illuminates read throughputs of different caches and memory
0
200
400
600
800
1000
1200
8m
4m
2m
10
24
k
51
2k
25
6k
12
8k
64
k
32
k
16
k
8k
4k
2k
1k
working set size (bytes)
rea
d t
hro
ug
pu
t (M
B/s
)L1 cache
regionL2 cache
regionmain memory
region
CS 3214 Fall 2010
A Slope of Spatial Locality• Slice
through memory mountain with size=256KB– shows
cache block size.
0
100
200
300
400
500
600
700
800
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16
stride (words)
rea
d t
hro
ug
hp
ut
(MB
/s)
one access per cache line
CS 3214 Fall 2010
Matrix Multiplication Example
• Major Cache Effects to Consider– Total cache size
• Exploit temporal locality and keep the working set small (e.g., by using blocking)
– Block size• Exploit spatial locality
• Description:– Multiply N x N matrices– O(N3) total operations– Accesses
• N reads per source element• N values summed per destination
– but may be able to hold in register
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
Variable sumheld in register
CS 3214 Fall 2010
Miss Rate Analysis for Matrix Multiply
• Assume:– Line size = 32B (big enough for 4 64-bit words)– Matrix dimension (N) is very large
• Approximate 1/N as 0.0
– Cache is not even big enough to hold multiple rows• Analysis Method:
– Look at access pattern of inner loop
CA
k
i
B
k
j
i
j
CS 3214 Fall 2010
Layout of C Arrays in Memory (review)
• C arrays allocated in row-major order– each row in contiguous memory locations
• Stepping through columns in one row:– for (i = 0; i < N; i++)
sum += a[0][i];– accesses successive elements– if block size (B) > 4 bytes, exploit spatial locality
• compulsory miss rate = 4 bytes / B• Stepping through rows in one column:
– for (i = 0; i < n; i++)sum += a[i][0];
– accesses distant elements– no spatial locality!
• compulsory miss rate = 1 (i.e. 100%)CS 3214 Fall 2010
Matrix Multiplication (ijk)
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
A B C
(i,*)
(*,j)(i,j)
Inner loop:
Column-wise
Row-wise Fixed
Misses per Inner Loop Iteration:A B C
0.25 1.0 0.0
CS 3214 Fall 2010
Matrix Multiplication (jik)
/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}
/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}
A B C
(i,*)
(*,j)(i,j)
Inner loop:
Row-wise Column-wise
Fixed
Misses per Inner Loop Iteration:A B C
0.25 1.0 0.0
CS 3214 Fall 2010
Matrix Multiplication (kij)
/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
A B C
(i,*)(i,k) (k,*)
Inner loop:
Row-wise Row-wiseFixed
Misses per Inner Loop Iteration:A B C
0.0 0.25 0.25
CS 3214 Fall 2010
Matrix Multiplication (ikj)
/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
A B C
(i,*)(i,k) (k,*)
Inner loop:
Row-wise Row-wiseFixed
Misses per Inner Loop Iteration:A B C
0.0 0.25 0.25
CS 3214 Fall 2010
Matrix Multiplication (jki)
/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
A B C
(*,j)(k,j)
Inner loop:
(*,k)
Column -wise
Column-wise
Fixed
Misses per Inner Loop Iteration:A B C
1.0 0.0 1.0
CS 3214 Fall 2010
Matrix Multiplication (kji)
/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
A B C
(*,j)(k,j)
Inner loop:
(*,k)
FixedColumn-wise
Column-wise
Misses per Inner Loop Iteration:A B C
1.0 0.0 1.0
CS 3214 Fall 2010
Summary of Matrix Multiplication
for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25
for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5
jki (& kji): • 2 loads, 1 store• misses/iter = 2.0
CS 3214 Fall 2010
Pentium Matrix Multiply Performance• Miss rates are helpful but not perfect predictors.• Combination of miss rate & load/stores
0
10
20
30
40
50
60
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
Cyc
les
/ite
rati
on
kjijkikijikjjikijk
CS 3214 Fall 2010
Improving Temporal Locality by Blocking
• Example: Blocked matrix multiplication– “block” (in this context) does not mean “cache block”.– Instead, it mean a sub-block within the matrix.– Example: N = 8; sub-block size = 4
C11 = A11B11 + A12B21 C12 = A11B12 + A12B22
C21 = A21B11 + A22B21 C22 = A21B12 + A22B22
A11 A12
A21 A22
B11 B12
B21 B22
X = C11 C12
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
C11 = 0C12 = 0C21 = 0C22 = 0C11 += A11B11
C21 += A21B11
C11 += A12B21
C21 += A22B21
C12 += A11B12 C22 += A21B12 C12 += A12B22
C22 += A22B22
CS 3214 Fall 2010
Blocked Matrix Multiply (bijk)for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } }}
CS 3214 Fall 2010
Blocked Matrix Multiply Analysis– Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X
bsize block of B and accumulates into 1 X bsize sliver of C– Loop over i steps through n row slivers of A & C, using same B
A B C
block reused n times in succession
row sliver accessedbsize times
Update successiveelements of sliver
i ikk
kk jjjj
for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }
InnermostLoop Pair
CS 3214 Fall 2010
Intuition For Blocking
• Matrix multiply takes n^3 multiplications• Every value of A is used in n multiplications (there
are n^2 many)• Every value of B is used in n multiplications (ditto)• Flop-to-memory ratio: n^3/(n^2+n^2+n^2).• Note access sequence for B:
– b[k][j], b[k+1][j], … b[k+bsize-1][j], b[k][j+1], b[k+1][j+1], … b[k+bsize-1][j+1],… b[k][j+bsize-1], b[k+1][j+bsize-1], … b[k+bsize-1][j+bsize-1]
CS 3214 Fall 2010
CS 3214 Fall 2010
Access pattern for arrays A, B, C in C = A * B unblocked (ijk)
CS 3214 Fall 2010
Access pattern for arrays A, B, C in C = A * B blocked (bijk)
Pentium Blocked Matrix Multiply Performance
• Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik)– relatively insensitive to array size.
0
10
20
30
40
50
60
Array size (n)
Cy
cle
s/it
era
tio
n
kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)
CS 3214 Fall 2010
Concluding Observations
• Programmer can optimize for cache performance– How data structures are organized– How data are accessed
• Nested loop structure• Blocking is a general technique
• All systems favor “cache friendly code”– Getting absolute optimum performance is very platform
specific• Cache sizes, line sizes, associativity, etc.
– Can get most of the advantage with generic code• Keep working set reasonably small (temporal locality)• Use small strides (spatial locality)
CS 3214 Fall 2010
Sparse Problems
• Dense matrix multiple is suitable for blocking because of its high flop-to-memory ratio
• Some problems aren’t. Those problems are much tougher to optimize.
• Optimizing Sparse Matrix Vector Multiply– See Belgin, Back, Ribbens ICS ‘09
CS 3214 Fall 2010