+ All Categories
Home > Documents > Cache Performance Metrics

Cache Performance Metrics

Date post: 03-Feb-2016
Category:
Upload: seda
View: 46 times
Download: 0 times
Share this document with a friend
Description:
Cache Performance Metrics. Miss Rate Fraction of memory references not found in cache (misses/references) Typical numbers: 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time - PowerPoint PPT Presentation
50
– 1 – 15-213, F’02 Cache Performance Metrics Miss Rate Miss Rate Fraction of memory references not found in cache (misses/references) Typical numbers: 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Hit Time Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) Typical numbers: 1 clock cycle for L1 3-8 clock cycles for L2 Miss Penalty Miss Penalty Additional time required because of a miss Typically 25-100 cycles for main memory
Transcript
Page 1: Cache Performance Metrics

– 1 – 15-213, F’02

Cache Performance MetricsCache Performance Metrics

Miss RateMiss Rate Fraction of memory references not found in cache

(misses/references) Typical numbers:

3-10% for L1can be quite small (e.g., < 1%) for L2, depending on size, etc.

Hit TimeHit Time Time to deliver a line in the cache to the processor (includes

time to determine whether the line is in the cache) Typical numbers:

1 clock cycle for L13-8 clock cycles for L2

Miss PenaltyMiss Penalty Additional time required because of a miss

Typically 25-100 cycles for main memory

Page 2: Cache Performance Metrics

– 2 – 15-213, F’02

Writing Cache Friendly CodeWriting Cache Friendly Code

Repeated references to variables are good (temporal Repeated references to variables are good (temporal locality)locality)

Stride-1 reference patterns are good (spatial locality)Stride-1 reference patterns are good (spatial locality)

Examples:Examples: cold cache, 4-byte words, 4-word cache blocks

int sumarrayrows(int a[M][N]){ int i, j, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;}

int sumarraycols(int a[M][N]){ int i, j, sum = 0;

for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;}

Miss rate = Miss rate = 1/4 = 25% 100%

Page 3: Cache Performance Metrics

– 3 – 15-213, F’02

The Memory MountainThe Memory Mountain

Read throughput (read bandwidth)Read throughput (read bandwidth) Number of bytes read from memory per second (MB/s)

Memory mountainMemory mountain Measured read throughput as a function of spatial and

temporal locality. Compact way to characterize memory system performance.

Page 4: Cache Performance Metrics

– 4 – 15-213, F’02

Memory Mountain Test FunctionMemory Mountain Test Function

/* The test function */void test(int elems, int stride) { int i, result = 0; volatile int sink;

for (i = 0; i < elems; i += stride)result += data[i];

sink = result; /* So compiler doesn't optimize away the loop */}

/* Run test(elems, stride) and return read throughput (MB/s) */double run(int size, int stride, double Mhz){ double cycles; int elems = size / sizeof(int);

test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */}

Page 5: Cache Performance Metrics

– 5 – 15-213, F’02

Memory Mountain Main RoutineMemory Mountain Main Routine/* mountain.c - Generate the memory mountain. */#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */#define MAXBYTES (1 << 23) /* ... up to 8 MB */#define MAXSTRIDE 16 /* Strides range from 1 to 16 */#define MAXELEMS MAXBYTES/sizeof(int)

int data[MAXELEMS]; /* The array we'll be traversing */

int main(){ int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */

init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {

for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz));printf("\n");

} exit(0);}

Page 6: Cache Performance Metrics

– 6 – 15-213, F’02

The Memory MountainThe Memory Mountain

s1s3

s5s7

s9

s11s13

s15 8m2m 512k

128k

32k

8k2k

0

200

400

600

800

1000

1200

read throughput (MB/s)

stride (words) working set size (bytes)

Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cache

Ridges ofTemporalLocality

L1

L2

mem

Slopes ofSpatialLocality

xe

Page 7: Cache Performance Metrics

– 7 – 15-213, F’02

Ridges of Temporal LocalityRidges of Temporal Locality

Slice through the memory mountain with stride=1Slice through the memory mountain with stride=1 illuminates read throughputs of different caches and

memory

0

200

400

600

800

1000

1200

8m 4m 2m

1024k512k 256k 128k

64k 32k 16k8k 4k 2k 1k

working set size (bytes)

read througput (MB/s)

L1 cacheregion

L2 cacheregion

main memoryregion

Page 8: Cache Performance Metrics

– 8 – 15-213, F’02

A Slope of Spatial LocalityA Slope of Spatial Locality

Slice through memory mountain with size=256KBSlice through memory mountain with size=256KB shows cache block size.

0

100

200

300

400

500

600

700

800

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

stride (words)

read throughput (MB/s)

one access per cache line

Page 9: Cache Performance Metrics

– 9 – 15-213, F’02

Matrix Multiplication ExampleMatrix Multiplication Example

Major Cache Effects to ConsiderMajor Cache Effects to Consider Total cache size

Exploit temporal locality and keep the working set small (e.g., by using blocking)

Block size Exploit spatial locality

Description:Description: Multiply N x N matrices O(N3) total operations Accesses

N reads per source element N values summed per destination

» but may be able to hold in register

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

Variable sumheld in register

Page 10: Cache Performance Metrics

– 10 – 15-213, F’02

Miss Rate Analysis for Matrix MultiplyMiss Rate Analysis for Matrix Multiply

Assume:Assume: Line size = 32B (big enough for 4 64-bit words) Matrix dimension (N) is very large

Approximate 1/N as 0.0

Cache is not even big enough to hold multiple rows

Analysis Method:Analysis Method: Look at access pattern of inner loop

CA

k

i

B

k

j

i

j

Page 11: Cache Performance Metrics

– 11 – 15-213, F’02

Layout of C Arrays in Memory (review)Layout of C Arrays in Memory (review)C arrays allocated in row-major orderC arrays allocated in row-major order

each row in contiguous memory locations

Stepping through columns in one row:Stepping through columns in one row: for (i = 0; i < N; i++)

sum += a[0][i]; accesses successive elements if block size (B) > 4 bytes, exploit spatial locality

compulsory miss rate = 4 bytes / B

Stepping through rows in one column:Stepping through rows in one column: for (i = 0; i < n; i++)

sum += a[i][0]; accesses distant elements no spatial locality!

compulsory miss rate = 1 (i.e. 100%)

Page 12: Cache Performance Metrics

– 12 – 15-213, F’02

Matrix Multiplication (ijk)Matrix Multiplication (ijk)

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

A B C

(i,*)

(*,j)(i,j)

Inner loop:

Column-wise

Row-wise Fixed

Misses per Inner Loop Iteration:Misses per Inner Loop Iteration:A B C

0.25 1.0 0.0

Page 13: Cache Performance Metrics

– 13 – 15-213, F’02

Matrix Multiplication (jik)Matrix Multiplication (jik)

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

A B C

(i,*)

(*,j)(i,j)

Inner loop:

Row-wise Column-wise

Fixed

Misses per Inner Loop Iteration:Misses per Inner Loop Iteration:A B C

0.25 1.0 0.0

Page 14: Cache Performance Metrics

– 14 – 15-213, F’02

Matrix Multiplication (kij)Matrix Multiplication (kij)

/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

Misses per Inner Loop Iteration:Misses per Inner Loop Iteration:A B C

0.0 0.25 0.25

Page 15: Cache Performance Metrics

– 15 – 15-213, F’02

Matrix Multiplication (ikj)Matrix Multiplication (ikj)

/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

Misses per Inner Loop Iteration:Misses per Inner Loop Iteration:A B C

0.0 0.25 0.25

Page 16: Cache Performance Metrics

– 16 – 15-213, F’02

Matrix Multiplication (jki)Matrix Multiplication (jki)

/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

A B C

(*,j)(k,j)

Inner loop:

(*,k)

Column -wise

Column-wise

Fixed

Misses per Inner Loop Iteration:Misses per Inner Loop Iteration:A B C

1.0 0.0 1.0

Page 17: Cache Performance Metrics

– 17 – 15-213, F’02

Matrix Multiplication (kji)Matrix Multiplication (kji)

/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

A B C

(*,j)(k,j)

Inner loop:

(*,k)

FixedColumn-wise

Column-wise

Misses per Inner Loop Iteration:Misses per Inner Loop Iteration:A B C

1.0 0.0 1.0

Page 18: Cache Performance Metrics

– 18 – 15-213, F’02

Summary of Matrix MultiplicationSummary of Matrix Multiplication

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25

for (k=0; k<n; k++) {

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

}

for (j=0; j<n; j++) {

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

}

kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5

jki (& kji): • 2 loads, 1 store• misses/iter = 2.0

Page 19: Cache Performance Metrics

– 19 – 15-213, F’02

Improving Temporal Locality by BlockingImproving Temporal Locality by BlockingExample: Blocked matrix multiplicationExample: Blocked matrix multiplication

“block” (in this context) does not mean “cache block”. Instead, it mean a sub-block within the matrix. Example: N = 8; sub-block size = 4

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22

A11 A12

A21 A22

B11 B12

B21 B22

X = C11 C12

C21 C22

Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.

Page 20: Cache Performance Metrics

– 20 – 15-213, F’02

Blocked Matrix Multiply (bijk)Blocked Matrix Multiply (bijk)

for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } }}

Page 21: Cache Performance Metrics

– 21 – 15-213, F’02

Blocked Matrix Multiply AnalysisBlocked Matrix Multiply Analysis

Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X bsize sliver of C

Loop over i steps through n row slivers of A & C, using same B

A B C

block reused n times in succession

row sliver accessedbsize times

Update successiveelements of sliver

i ikk

kk jjjj

for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

InnermostLoop Pair

Page 22: Cache Performance Metrics

– 22 – 15-213, F’02

Optimizing CompilersOptimizing Compilers

Provide efficient mapping of program to machineProvide efficient mapping of program to machine register allocation code selection and ordering eliminating minor inefficiencies

Don’t (usually) improve asymptotic efficiencyDon’t (usually) improve asymptotic efficiency up to programmer to select best overall algorithm big-O savings are (often) more important than constant

factorsbut constant factors also matter

Have difficulty overcoming “optimization blockers”Have difficulty overcoming “optimization blockers” potential memory aliasing potential procedure side-effects

Page 23: Cache Performance Metrics

– 23 – 15-213, F’02

Limitations of Optimizing CompilersLimitations of Optimizing CompilersOperate Under Fundamental ConstraintOperate Under Fundamental Constraint

Must not cause any change in program behavior under any possible condition

Often prevents it from making optimizations when would only affect behavior under pathological conditions.

Behavior that may be obvious to the programmer can be Behavior that may be obvious to the programmer can be obfuscated by languages and coding stylesobfuscated by languages and coding styles e.g., data ranges may be more limited than variable types suggest

Most analysis is performed only within proceduresMost analysis is performed only within procedures whole-program analysis is too expensive in most cases

Most analysis is based only on Most analysis is based only on staticstatic information information compiler has difficulty anticipating run-time inputs

When in doubt, the compiler must be conservativeWhen in doubt, the compiler must be conservative

Page 24: Cache Performance Metrics

– 24 – 15-213, F’02

Machine-Independent OptimizationsMachine-Independent Optimizations Optimizations you should do regardless of processor / compiler

Code MotionCode Motion Reduce frequency with which computation performed

If it will always produce same resultEspecially moving code out of loop

for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j];}

Page 25: Cache Performance Metrics

– 25 – 15-213, F’02

Compiler-Generated Code MotionCompiler-Generated Code Motion Most compilers do a good job with array code + simple loop

structures

Code Generated by GCCCode Generated by GCCfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)# Inner Loop.L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ jl .L40 # loop if j<n

for (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j];}

Page 26: Cache Performance Metrics

– 26 – 15-213, F’02

Reduction in StrengthReduction in Strength

Replace costly operation with simpler one Shift, add instead of multiply or divide

16*x --> x << 4Utility machine dependentDepends on cost of multiply or divide instructionOn Pentium II or III, integer multiply only requires 4 CPU cycles

Recognize sequence of products

for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}

Page 27: Cache Performance Metrics

– 27 – 15-213, F’02

Make Use of RegistersMake Use of Registers

Reading and writing registers much faster than reading/writing memory

LimitationLimitation Compiler not always able to determine whether variable can

be held in register Possibility of Aliasing

Page 28: Cache Performance Metrics

– 28 – 15-213, F’02

Machine-Independent Opts. (Cont.)Machine-Independent Opts. (Cont.)Share Common SubexpressionsShare Common Subexpressions

Reuse portions of expressions Compilers often not very sophisticated in exploiting

arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j];down = val[(i+1)*n + j];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;

int inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;

3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n

leal -1(%edx),%ecx # i-1 imull %ebx,%ecx # (i-1)*n leal 1(%edx),%eax # i+1 imull %ebx,%eax # (i+1)*n imull %ebx,%edx # i*n

Page 29: Cache Performance Metrics

– 29 – 15-213, F’02

Vector ADTVector ADT

ProceduresProceduresvec_ptr new_vec(int len)

Create vector of specified length

int get_vec_element(vec_ptr v, int index, int *dest)Retrieve vector element, store at *destReturn 0 if out of bounds, 1 if successful

int *get_vec_start(vec_ptr v)Return pointer to start of vector data

lengthdata

0 1 2 length–1

Page 30: Cache Performance Metrics

– 30 – 15-213, F’02

Optimization ExampleOptimization Example

ProcedureProcedure Compute sum of all elements of integer vector Store result at destination location Vector data structure and operations defined via abstract data type

Pentium II/III Performance: Clock Cycles / ElementPentium II/III Performance: Clock Cycles / Element 42.06 (Compiled -g) 31.25 (Compiled -O2)

void combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}

Page 31: Cache Performance Metrics

– 31 – 15-213, F’02

Understanding LoopUnderstanding Loop

InefficiencyInefficiency Procedure vec_length called every iteration Even though result always the same

void combine1-goto(vec_ptr v, int *dest){ int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done:}

1 iteration

Page 32: Cache Performance Metrics

– 32 – 15-213, F’02

Move vec_length Call Out of LoopMove vec_length Call Out of Loop

OptimizationOptimization Move call to vec_length out of inner loop

Value does not change from one iteration to nextCode motion

CPE: 20.66 (Compiled -O2) vec_length requires only constant time, but significant overhead

void combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; }}

Page 33: Cache Performance Metrics

– 33 – 15-213, F’02

void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}

Code Motion Example #2Code Motion Example #2

Procedure to Convert String to Lower CaseProcedure to Convert String to Lower Case

Page 34: Cache Performance Metrics

– 34 – 15-213, F’02

Lower Case Conversion PerformanceLower Case Conversion Performance

Time quadruples when double string length Quadratic performance

lower1

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CPU Seconds

Page 35: Cache Performance Metrics

– 35 – 15-213, F’02

Convert Loop To Goto FormConvert Loop To Goto Form

strlen executed every iteration strlen linear in length of string

Must scan string until finds '\0' Overall performance is quadratic

void lower(char *s){ int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done:}

Page 36: Cache Performance Metrics

– 36 – 15-213, F’02

Improving PerformanceImproving Performance

Move call to strlen outside of loop Since result does not change from one iteration to another Form of code motion

void lower(char *s){ int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}

Page 37: Cache Performance Metrics

– 37 – 15-213, F’02

Lower Case Conversion PerformanceLower Case Conversion Performance

Time doubles when double string length Linear performance

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CPU Seconds

lower1 lower2

Page 38: Cache Performance Metrics

– 38 – 15-213, F’02

Optimization Blocker: Procedure CallsOptimization Blocker: Procedure CallsWhy couldn’t the compiler move Why couldn’t the compiler move vec_lenvec_len or or strlenstrlen out of out of

the inner loop?the inner loop? Procedure may have side effects

Alters global state each time called

Function may not return same value for given argumentsDepends on other parts of global stateProcedure lower could interact with strlen

Why doesn’t compiler look at code for Why doesn’t compiler look at code for vec_lenvec_len or or strlenstrlen?? Linker may overload with different version

Unless declared static

Interprocedural optimization is not used extensively due to cost

Warning:Warning: Compiler treats procedure call as a black box Weak optimizations in and around them

Page 39: Cache Performance Metrics

– 39 – 15-213, F’02

Reduction in StrengthReduction in Strength

OptimizationOptimization Avoid procedure call to retrieve each vector element

Get pointer to start of array before loopWithin loop just do pointer referenceNot as clean in terms of data abstraction

CPE: 6.00 (Compiled -O2)Procedure calls are expensive!Bounds checking is expensive

void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i];}

Page 40: Cache Performance Metrics

– 40 – 15-213, F’02

Eliminate Unneeded Memory RefsEliminate Unneeded Memory Refs

OptimizationOptimization Don’t need to store in destination until end Local variable sum held in register Avoids 1 memory read, 1 memory write per cycle CPE: 2.00 (Compiled -O2)

Memory references are expensive!

void combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum;}

Page 41: Cache Performance Metrics

– 41 – 15-213, F’02

Detecting Unneeded Memory Refs.Detecting Unneeded Memory Refs.

PerformancePerformance Combine3

5 instructions in 6 clock cycles addl must read and write memory

Combine44 instructions in 2 clock cycles

.L18:movl (%ecx,%edx,4),%eaxaddl %eax,(%edi)incl %edxcmpl %esi,%edxjl .L18

Combine3

.L24:addl (%eax,%edx,4),%ecx

incl %edxcmpl %esi,%edxjl .L24

Combine4

Page 42: Cache Performance Metrics

– 42 – 15-213, F’02

Optimization Blocker: Memory AliasingOptimization Blocker: Memory Aliasing

AliasingAliasing Two different memory references specify single location

ExampleExample v: [3, 2, 17] combine3(v, get_vec_start(v)+2) --> ? combine4(v, get_vec_start(v)+2) --> ?

ObservationsObservations Easy to have happen in C

Since allowed to do address arithmeticDirect access to storage structures

Get in habit of introducing local variablesAccumulating within loopsYour way of telling compiler not to check for aliasing

Page 43: Cache Performance Metrics

– 43 – 15-213, F’02

Machine-Independent Opt. SummaryMachine-Independent Opt. Summary

Code MotionCode Motion Compilers are good at this for simple loop/array structures Don’t do well in presence of procedure calls and memory aliasing

Reduction in StrengthReduction in Strength Shift, add instead of multiply or divide

compilers are (generally) good at thisExact trade-offs machine-dependent

Keep data in registers rather than memorycompilers are not good at this, since concerned with aliasing

Share Common SubexpressionsShare Common Subexpressions compilers have limited algebraic reasoning capabilities

Page 44: Cache Performance Metrics

– 44 – 15-213, F’02

Important ToolsImportant Tools

MeasurementMeasurement Accurately compute time taken by code

Most modern machines have built in cycle countersUsing them to get reliable measurements is tricky

Profile procedure calling frequenciesUnix tool gprof

ObservationObservation Generating assembly code

Lets you see what optimizations compiler can makeUnderstand capabilities/limitations of particular compiler

Page 45: Cache Performance Metrics

– 45 – 15-213, F’02

Code Profiling ExampleCode Profiling ExampleTaskTask

Count word frequencies in text document Produce sorted list of words from most frequent to least

StepsSteps Convert strings to lowercase Apply hash function Read words and insert into hash table

Mostly list operations Maintain counter for each unique word

Sort results

Data SetData Set Works of Shakespeare 946,596 total words, 26,596 unique Initial implementation: 9.2 seconds

29,80129,801 thethe

27,52927,529 andand

21,02921,029 II

20,95720,957 toto

18,51418,514 ofof

15,37015,370 aa

1401014010 youyou

12,93612,936 mymy

11,72211,722 inin

11,51911,519 thatthat

Shakespeare’s

most frequent words

Page 46: Cache Performance Metrics

– 46 – 15-213, F’02

Code ProfilingCode ProfilingAugment Executable Program with Timing FunctionsAugment Executable Program with Timing Functions

Computes (approximate) amount of time spent in each function

Time computation methodPeriodically (~ every 10ms) interrupt programDetermine what function is currently executing Increment its timer by interval (e.g., 10ms)

Also maintains counter for each function indicating number of times called

UsingUsinggcc –O2 –pg prog. –o prog

./progExecutes in normal fashion, but also generates file gmon.out

gprof progGenerates profile information based on gmon.out

Page 47: Cache Performance Metrics

– 47 – 15-213, F’02

Profiling ResultsProfiling Results

Call StatisticsCall Statistics Number of calls and cumulative time for each function

Performance LimiterPerformance Limiter Using inefficient sorting algorithm Single call uses 87% of CPU time

% cumulative self self total time seconds seconds calls ms/call ms/call name 86.60 8.21 8.21 1 8210.00 8210.00 sort_words 5.80 8.76 0.55 946596 0.00 0.00 lower1 4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec 1.27 9.33 0.12 946596 0.00 0.00 h_add

Page 48: Cache Performance Metrics

– 48 – 15-213, F’02

Code OptimizationsCode Optimizations

First step: Use more efficient sorting function Library function qsort

0

1

2

3

4

5

6

7

8

9

10

Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower

CPU Secs.

Rest

Hash

Lower

List

Sort

Page 49: Cache Performance Metrics

– 49 – 15-213, F’02

Further OptimizationsFurther Optimizations

Iter first: Use iterative function to insert elements into linked listCauses code to slow down

Iter last: Iterative function, places new entry at end of listTend to place most common words at front of list

Big table: Increase number of hash buckets Better hash: Use more sophisticated hash function Linear lower: Move strlen out of loop

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower

CPU Secs.

Rest

Hash

Lower

List

Sort

Page 50: Cache Performance Metrics

– 50 – 15-213, F’02

Profiling ObservationsProfiling Observations

BenefitsBenefits Helps identify performance bottlenecks Especially useful when have complex system with many

components

LimitationsLimitations Only shows performance for data tested E.g., linear lower did not show big gain, since words are

shortQuadratic inefficiency could remain lurking in code

Timing mechanism fairly crudeOnly works for programs that run for > 3 seconds


Recommended