+ All Categories
Home > Documents > Midterm review

Midterm review

Date post: 01-Jan-2016
Category:
Upload: thadeus-atalo
View: 41 times
Download: 0 times
Share this document with a friend
Description:
Midterm review. Announcement. Midterm Time: Oct. 27 th , 6- 8PM 7-9PM Location: GB 404 & GB 405 Policy: close book Coverage: [ lec . 1, lec . 7 (dynamic memory management) ]. Topics covered in midterm. Latency vs. throughput CPU architecture Profiling Compiler and optimization - PowerPoint PPT Presentation
Popular Tags:
25
Midterm review
Transcript

Midterm review

04/20/2023

Announcement• Midterm

– Time: Oct. 27th, 6-8PM 7-9PM– Location: GB 404 & GB 405– Policy: close book– Coverage: [lec. 1, lec. 7 (dynamic memory

management)]

Topics covered in midterm• Latency vs. throughput• CPU architecture• Profiling• Compiler and optimization• Memory performance

– Memory hierarchy– Optimize for cache– Virtual memory

• Dynamic memory management

Ding Yuan, ECE454 4

CPU architecture: key techniques

Year CPI1971

Processor Tech.

4004 no pipeline n

1985 386 pipeline close to 1branch prediction closer to 1

1993 Pentium Superscalar < 1

1995 PentiumPro Out-of-Order exe. << 1

1999 Pentium III Deep pipeline shorter cycle

2000 Pentium IV SMT <<<1

Profiling• Why do we need profiling?• Amdahl’s law speedup = OldTime / NewTime• Example problem: If an optimization makes

loops go 4 times faster, and applying the optimization to my program makes it go twice as fast, what fraction of my program is loops?

Solution: looptime = x * oldtime;newtime = looptime/4 + othertime = x*oldtime/4 + (1-x)*oldtimespeedup = oldtime/newtime = 1/(x/4 + 1-x) = 1/(1-0.75x) = 2

Profiling tools

• We discussed quite a few of them– /usr/bin/time– get_seconds()– get_tsc()– gprof, gcov, valgrind

• Important things:– What info. does each tool provide?– What are the limitations?

Compiler and optimization• Machine independent optimizations

• Constant propagation• Constant folding• Common Subexpression Elimination• Dead Code Elimination• Loop Invariant Code Motion• Function Inlining

• Machine dependent (apply differently to different CPUs)• Loop unrolling

• What are the blockers for compiler optimization?• What are the trade-offs for each optimization?

Q9 from midterm 2013Consider the following functions: int max(int x, int y) { return x < y ? y : x; } void incr(int *xp, int v) { *xp += v; } int add (int i, int j) { return i + j; }

The following code fragment calls these functions: 1 int max_sum (int m, int n) { // m and n are large integers 2 int i; 3 int sum = 0; 4 5 for (i = 0; i < max(m, n); incr (&i, 1)) { 6 sum = add(data[i], sum);// data is an integer array 7 } 8 9 return sum;10 }

A). identify all of the optimization opportunities for this code and explain each one. Also discuss whether it can be performed by the compiler or not. (6 marks)

Loop unrolling for machine dependent optimization

void vsum4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum;}

void vsum5(vec_ptr v, int *dest){ int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; for (i = 0; i < limit; i+=3){ sum += data[i]; sum += data[i+1]; sum += data[i+2]; } for ( ; i < length; i++){ sum += data[i] *dest = sum;}

Why loop unrolling can help?

Ding Yuan, ECE454 10

Executing without loop unrolling%edx.0

t.1

%ecx.i +1

incl

cmpl

jl

addl%ecx.1

i=0

loadcc.1

%edx.0

t.1

%ecx.i +1

incl

cmpl

jl

addl%ecx.1

i=0

loadcc.1

%edx.1

t.2

%ecx.i +1

incl

cmpl

jl

addl%ecx.2

i=1

loadcc.2

%edx.1

t.2

%ecx.i +1

incl

cmpl

jl

addl%ecx.2

i=1

loadcc.2

%edx.2

t.3

%ecx.i +1

incl

cmpl

jl

addl%ecx.3

i=2

loadcc.3

%edx.2

t.3

%ecx.i +1

incl

cmpl

jl

addl%ecx.3

i=2

loadcc.3

%edx.3

t.4

%ecx.i +1

incl

cmpl

jl

addl%ecx.4

i=3

loadcc.4

%edx.3

t.4

%ecx.i +1

incl

cmpl

jl

addl%ecx.4

i=3

loadcc.4

%ecx.0

%edx.4

Cycle

1

2

3

4

5

6

7

Cycle

1

2

3

4

5

6

7

Iteration 1

Iteration 2

Iteration 3

Iteration 4

4 integer ops

Q11@midterm 2013

Memory performance: cache

• Motivation– L1 cache reference 0.5 ns– Main memory reference 100 ns• 200X slower!

Why Caches Work

• Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

• Temporal locality: – Recently referenced items are likely

to be referenced again in the near future

• Spatial locality: – Items with nearby addresses tend

to be referenced close together in time

block

block

Cache hierarchy: example 1

S = 2s sets

Direct mapped: One block per setAssume: cache block size 8 bytes

t bits 0…01 100Address of int:

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

find set

E-way Set Associative Cache (E = 2)

E = 2: Two lines per setAssume: cache block size 8 bytes

t bits 0…01 100Address of short int:

0 1 2 7tagv 3 654 0 1 2 7tagv 3 654

0 1 2 7tagv 3 654 0 1 2 7tagv 3 654

0 1 2 7tagv 3 654 0 1 2 7tagv 3 654

0 1 2 7tagv 3 654 0 1 2 7tagv 3 654

find set

E-way Set Associative Cache (E = 2)

t bits 0…01 100Address of short int:

0 1 2 7tagv 3 654 0 1 2 7tagv 3 654

compare both

valid? + match: yes = hit

block offset

tag

E = 2: Two lines per setAssume: cache block size 8 bytes

E-way Set Associative Cache (E = 2)

t bits 0…01 100Address of short int:

0 1 2 7tagv 3 654 0 1 2 7tagv 3 654

compare both

valid? + match: yes = hit

block offset

tag

E = 2: Two lines per setAssume: cache block size 8 bytes

short int (2 Bytes) is here

No match: • One line in set is selected for eviction and replacement• Replacement policies: random, least recently used (LRU), …

Cache miss analysis on matrix mult.

a b

i

j

*c

+=

c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++)

for (j = 0; j < n; j++) for (k = 0; k < n; k++)

c[i][j] += a[i][k]*b[k][j];}

*+=

*+=

n/8 misses n misses

8 wide

8 wide

First iteration:How many misses?n/8 + n = 9n/8 misses

Second iteration:Number of misses:

n/8 + n = 9n/8 misses

Total misses (entire mmm):9n/8 * n2 = (9/8) * n3

Tiled Matrix Multiplicationc = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=T)

for (j = 0; j < n; j+=T) for (k = 0; k < n; k+=T)

/* T x T mini matrix multiplications */ for (i1 = i; i1 < i+T; i1++) for (j1 = j; j1 < j+T; j1++) for (k1 = k; k1 < k+T; k1++)

c[i1][j1] += a[i1][k1]*b[k1][j1];}

a b

i1

j1

*c

+=

Tile size T x T

Virtual memory

MMU Cache/Memory

PA

Data

CPU VA

CPU Chip

PTE

1

2

4

5

TLB

VA 3

Page Table

Complete data reference analysis

• Q7@midterm 2013

Dynamic mem. management

• Alignment– What is alignment? why alignment?

• Q6@midterm 2013

malloc/free

• How do we know how much memory to free just given a pointer?

• How do we keep track of the free blocks?• How do we pick a block to use for allocation -- many

might fit?• How do we reinsert freed block?

Keeping Track of Free Blocks• Method 1: Implicit list using lengths -- links all blocks

• Method 2: Explicit list among the free blocks using pointers within the free blocks

• Method 3: Segregated free list– Different free lists for different size classes

4 4 4 4 6 46 4

4 4 4 4 66 44 4 4

Predecessor links

Successor links

A B

C

Final remarks

• Time/location/policy– Time: Oct. 27th, 7-9PM– Location: GB 404 & GB 405– Policy: close book

• Make sure you understand the practice midterm

• Best of luck!


Recommended