+ All Categories
Home > Documents > Cache Memory

Cache Memory

Date post: 01-Jan-2016
Category:
Upload: porter-christensen
View: 15 times
Download: 0 times
Share this document with a friend
Description:
Cache Memory. Outline. Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7. 6.6 Putting it Together: The Impact of Caches on Program Performance 6.6.1 The Memory Mountain. The Memory Mountain P512. Read throughput (read bandwidth) - PowerPoint PPT Presentation
Popular Tags:
38
1 Cache Memory
Transcript
Page 1: Cache Memory

1

Cache Memory

Page 2: Cache Memory

2

Outline

• Cache mountain

• Matrix multiplication

• Suggested Reading: 6.6, 6.7

Page 3: Cache Memory

3

6.6 Putting it Together: The Impact of Caches on Program Performance

6.6.1 The Memory Mountain

Page 4: Cache Memory

4

The Memory Mountain P512

• Read throughput (read bandwidth)– The rate that a program reads data from the

memory system

• Memory mountain– A two-dimensional function of read bandwidth

versus temporal and spatial locality– Characterizes the capabilities of the memory

system for each computer

Page 5: Cache Memory

5

Memory mountain main routineFigure 6.41 P513

/* mountain.c - Generate the memory mountain. */

#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */

#define MAXBYTES (1 << 23) /* ... up to 8 MB */

#define MAXSTRIDE 16 /* Strides range from 1 to 16 */

#define MAXELEMS MAXBYTES/sizeof(int)

int data[MAXELEMS]; /* The array we'll be traversing */

Page 6: Cache Memory

6

Memory mountain main routine

int main()

{

int size; /* Working set size (in bytes) */

int stride; /* Stride (in array elements) */

double Mhz; /* Clock frequency */

init_data(data, MAXELEMS); /* Initialize each element in data to 1 */

Mhz = mhz(0); /* Estimate the clock frequency */

Page 7: Cache Memory

7

Memory mountain main routine

for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {

for (stride = 1; stride <= MAXSTRIDE; stride++)

printf("%.1f\t", run(size, stride, Mhz));

printf("\n");

}

exit(0);

}

Page 8: Cache Memory

8

Memory mountain test functionFigure 6.40 P512

/* The test function */

void test (int elems, int stride) {

int i, result = 0;

volatile int sink;

for (i = 0; i < elems; i += stride)

result += data[i];

sink = result; /* So compiler doesn't optimize away the loop */

}

Page 9: Cache Memory

9

Memory mountain test function

/* Run test (elems, stride) and return read throughput (MB/s) */

double run (int size, int stride, double Mhz)

{

double cycles;

int elems = size / sizeof(int);

test (elems, stride); /* warm up the cache */

cycles = fcyc2(test, elems, stride, 0); /* call test (elems,stride) */

return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */

}

Page 10: Cache Memory

10

The Memory Mountain

• Data

– Size

• MAXBYTES(8M) bytes or MAXELEMS(2M) words

– Partially accessed

• Working set: from 8MB to 1KB

• Stride: from 1 to 16

Page 11: Cache Memory

11

The Memory MountainFigure 6.42 P514

s1

s3

s5

s7

s9

s11

s13

s15

8m 2

m 51

2k

12

8k 32

k 8k 2

k

0

200

400

600

800

1000

1200

rea

d t

hro

ug

hp

ut

(MB

/s)

stride (words) working set size (bytes)

Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cache

Ridges ofTemporalLocality

L1

L2

mem

Slopes ofSpatialLocality

xe

Page 12: Cache Memory

12

Ridges of temporal locality

• Slice through the memory mountain with

stride=1

– illuminates read throughputs of different

caches and memory

Ridges: 山脊

Page 13: Cache Memory

13

Ridges of temporal localityFigure 6.43 P515

0

200

400

600

800

1000

12008

m

4m

2m

10

24

k

51

2k

25

6k

12

8k

64

k

32

k

16

k

8k

4k

2k

1k

working set size (bytes)

rea

d t

hro

ug

pu

t (M

B/s

)

L1 cacheregion

L2 cacheregion

main memoryregion

Page 14: Cache Memory

14

A slope of spatial locality

• Slice through memory mountain with

size=256KB

– shows cache block size.

Page 15: Cache Memory

15

A slope of spatial localityFigure 6.44 P516

0

100

200

300

400

500

600

700

800

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

stride (words)

rea

d t

hro

ug

hp

ut

(MB

/s)

one access per cache line

Page 16: Cache Memory

16

6.6 Putting it Together: The Impact of Caches on Program Performance

6.6.2 Rearranging Loops to Increase Spatial Locality

Page 17: Cache Memory

17

2221

1211

2221

1211

2221

1211

bb

bb

aa

aa

cc

cc

2222122122

2122112121

2212121112

2112111111

babac

babac

babac

babac

Matrix Multiplication P517

Page 18: Cache Memory

18

Matrix Multiplication ImplementationFigure 6.45 (a) P518

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { c[i][j] = 0.0; for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; }} O(n3)adds and multipliesEach n2 elements of A and B is read n times

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { c[i][j] = 0.0; for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; }} O(n3)adds and multipliesEach n2 elements of A and B is read n times

Page 19: Cache Memory

19

Matrix Multiplication P517

• Assumptions:– Each array is an nn array of double, with size 8

– There is a single cache with a 32-byte block size ( B=32 )

– The array size n is so large that a single matrix row does not fit in the L1 cache

– The compiler stores local variables in registers, and thus references to local variables inside loops do not require any load and store instructions.

Page 20: Cache Memory

20

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

Variable sumheld in register

Matrix MultiplicationFigure 6.45 (a) P518

Page 21: Cache Memory

21

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

A B C

(i,*)

(*,j)(i,j)

Inner loop:

Column-wise

Row-wise Fixed

• Misses per Inner Loop Iteration:A B C

0.25 1.0 0.0

Matrix multiplication (ijk)

Figure 6.46 P519

1) (AB)

Page 22: Cache Memory

22

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

A B C

(i,*)

(*,j)(i,j)

Inner loop:

Row-wise Column-wise

Fixed• Misses per Inner Loop Iteration:

A B C0.25 1.0 0.0

Matrix multiplication (jik)Figure 6.45 (b) P518

Figure 6.46 P519

1) (AB)

Page 23: Cache Memory

23

/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

• Misses per Inner Loop Iteration:

A B C0.0 0.25 0.25

Matrix multiplication (kij)Figure 6.45 (e) P518

Figure 6.46 P519

3) (BC)

Page 24: Cache Memory

24

/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

• Misses per Inner Loop Iteration:A B C

0.0 0.25 0.25

Matrix multiplication (ikj)Figure 6.45 (f) P518

Figure 6.46 P519

3) (BC)

Page 25: Cache Memory

25

/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

A B C

(*,j)(k,j)

Inner loop:

(*,k)

Column -wise

Column-wise

Fixed• Misses per Inner Loop

Iteration:A B C

1.0 0.0 1.0

Matrix multiplication (jki)Figure 6.45 (c) P518

Figure 6.46 P519

2) (AC)

Page 26: Cache Memory

26

/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

A B C

(*,j)(k,j)

Inner loop:

(*,k)

FixedColumn-wise

Column-wise

• Misses per Inner Loop Iteration:

A B C1.0 0.0 1.0

Matrix multiplication (kji)Figure 6.45 (d) P518

Figure 6.46 P519

2) (AC)

Page 27: Cache Memory

27

Pentium matrix multiply performanceFigure 6.47 (d) P519

0

10

20

30

40

50

60

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400

Array size (n)

Cyc

les

/ite

rati

on

kjijkikijikjjikijk

2) (AC)

3) (BC)

1) (AB)

2)

3)

1)

Page 28: Cache Memory

28

Pentium matrix multiply performance

• Notice that miss rates are helpful but not perfect predictors.

– Code scheduling matters, too.

Page 29: Cache Memory

29

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25

for (k=0; k<n; k++) {

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

}

for (j=0; j<n; j++) {

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

}

kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5

jki (& kji): • 2 loads, 1 store• misses/iter = 2.0

Summary of matrix multiplication

1) (AB) 3) (BC) 2) (AC)

Page 30: Cache Memory

30

6.6 Putting it Together: The Impact of Caches on Program Performance

6.6.3 Using Blocking to Increase Temporal Locality

Page 31: Cache Memory

31

Improving temporal locality by blocking P520

• Example: Blocked matrix multiplication– “block” (in this context) does not mean “cache

block”.

– Instead, it mean a sub-block within the matrix.

– Example: N = 8; sub-block size = 4

Page 32: Cache Memory

32

Improving temporal locality by blocking

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22

A11 A12

A21 A22

B11 B12

B21 B22X =

C11 C12

C21 C22

Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.

Page 33: Cache Memory

33

for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } }}

Blocked matrix multiply (bijk)Figure 6.48 P521

Page 34: Cache Memory

34

Blocked matrix multiply analysis

• Innermost loop pair multiplies a 1 X bsize

sliver of A by a bsize X bsize block of B and

accumulates into 1 X bsize sliver of C

– Loop over i steps through n row slivers of A & C,

using same B

Sliver: 长条

Page 35: Cache Memory

35

A B C

block reused n times in succession

row sliver accessedbsize times

Update successiveelements of sliver

i ikk

kk jjjj

for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

InnermostLoop Pair

Blocked matrix multiply analysis

Figure 6.49 P522

Page 36: Cache Memory

36

Pentium blocked matrix multiply performanceFigure 6.50 P523

0

10

20

30

40

50

60

Array size (n)

Cy

cle

s/i

tera

tio

n

kji

jki

kij

ikj

jik

ijk

bijk (bsize = 25)

bikj (bsize = 25)

2)3)

1)

Page 37: Cache Memory

37

6.7 Putting it Together: Exploring Locality in Your Programs

Page 38: Cache Memory

38

Techniques P523

• Focus your attention on the inner loops

• Try to maximize the spatial locality in your

programs by reading data objects sequentially, in

the order they are stored in memory

• Try to maximize the temporal locality in your

programs by using a data object as often as

possible once it has been read from memory

• Miss rates, the number of memory accesses


Recommended