Cache Memory

1

Cache Memory

2

Outline

• Cache mountain

• Matrix multiplication

• Suggested Reading: 6.6, 6.7

3

6.6 Putting it Together: The Impact of Caches on Program Performance

6.6.1 The Memory Mountain

4

The Memory Mountain P512

• Read throughput (read bandwidth)– The rate that a program reads data from the

memory system

• Memory mountain– A two-dimensional function of read bandwidth

versus temporal and spatial locality– Characterizes the capabilities of the memory

system for each computer

5

Memory mountain main routineFigure 6.41 P513

/* mountain.c - Generate the memory mountain. */

#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */

#define MAXBYTES (1 << 23) /* ... up to 8 MB */

#define MAXSTRIDE 16 /* Strides range from 1 to 16 */

#define MAXELEMS MAXBYTES/sizeof(int)

int data[MAXELEMS]; /* The array we'll be traversing */

6

Memory mountain main routine

int main()

{

int size; /* Working set size (in bytes) */

int stride; /* Stride (in array elements) */

double Mhz; /* Clock frequency */

init_data(data, MAXELEMS); /* Initialize each element in data to 1 */

Mhz = mhz(0); /* Estimate the clock frequency */

7

Memory mountain main routine

for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {

for (stride = 1; stride <= MAXSTRIDE; stride++)

printf("%.1f\t", run(size, stride, Mhz));

printf("\n");

}

exit(0);

}

8

Memory mountain test functionFigure 6.40 P512

/* The test function */

void test (int elems, int stride) {

int i, result = 0;

volatile int sink;

for (i = 0; i < elems; i += stride)

result += data[i];

sink = result; /* So compiler doesn't optimize away the loop */

}

9

Memory mountain test function

/* Run test (elems, stride) and return read throughput (MB/s) */

double run (int size, int stride, double Mhz)

{

double cycles;

int elems = size / sizeof(int);

test (elems, stride); /* warm up the cache */

cycles = fcyc2(test, elems, stride, 0); /* call test (elems,stride) */

return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */

}

10

The Memory Mountain

• Data

– Size

• MAXBYTES(8M) bytes or MAXELEMS(2M) words

– Partially accessed

• Working set: from 8MB to 1KB

• Stride: from 1 to 16

11

The Memory MountainFigure 6.42 P514

s1

s3

s5

s7

s9

s11

s13

s15

8m 2

m 51

2k

12

8k 32

k 8k 2

k

0

200

400

600

800

1000

1200

rea

d t

hro

ug

hp

ut

(MB

/s)

stride (words) working set size (bytes)

Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cache

Ridges ofTemporalLocality

L1

L2

mem

Slopes ofSpatialLocality

xe

12

Ridges of temporal locality

• Slice through the memory mountain with

stride=1

– illuminates read throughputs of different

caches and memory

Ridges: 山脊

13

Ridges of temporal localityFigure 6.43 P515

0

200

400

600

800

1000

12008

m

4m

2m

10

24

k

51

2k

25

6k

12

8k

64

k

32

k

16

k

8k

4k

2k

1k

working set size (bytes)

rea

d t

hro

ug

pu

t (M

B/s

)

L1 cacheregion

L2 cacheregion

main memoryregion

14

A slope of spatial locality

• Slice through memory mountain with

size=256KB

– shows cache block size.

15

A slope of spatial localityFigure 6.44 P516

0

100

200

300

400

500

600

700

800

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

stride (words)

rea

d t

hro

ug

hp

ut

(MB

/s)

one access per cache line

16


6.6.2 Rearranging Loops to Increase Spatial Locality

17

2221

1211

2221

1211

2221

1211

bb

bb

aa

aa

cc

cc

2222122122

2122112121

2212121112

2112111111

babac

babac

babac

babac

Matrix Multiplication P517

18

Matrix Multiplication ImplementationFigure 6.45 (a) P518

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { c[i][j] = 0.0; for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; }} O(n3)adds and multipliesEach n2 elements of A and B is read n times

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { c[i][j] = 0.0; for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; }} O(n3)adds and multipliesEach n2 elements of A and B is read n times

19

Matrix Multiplication P517

• Assumptions:– Each array is an nn array of double, with size 8

– There is a single cache with a 32-byte block size ( B=32 )

– The array size n is so large that a single matrix row does not fit in the L1 cache

– The compiler stores local variables in registers, and thus references to local variables inside loops do not require any load and store instructions.

20

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}


Variable sumheld in register

Matrix MultiplicationFigure 6.45 (a) P518

21



A B C

(i,*)

(*,j)(i,j)

Inner loop:

Column-wise

Row-wise Fixed

• Misses per Inner Loop Iteration:A B C

0.25 1.0 0.0

Matrix multiplication (ijk)

Figure 6.46 P519

1) (AB)

22

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

A B C

(i,*)

(*,j)(i,j)

Inner loop:

Row-wise Column-wise

Fixed• Misses per Inner Loop Iteration:

A B C0.25 1.0 0.0

Matrix multiplication (jik)Figure 6.45 (b) P518

Figure 6.46 P519

1) (AB)

23

/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

• Misses per Inner Loop Iteration:

A B C0.0 0.25 0.25

Matrix multiplication (kij)Figure 6.45 (e) P518

Figure 6.46 P519

3) (BC)

24

/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed

• Misses per Inner Loop Iteration:A B C

0.0 0.25 0.25

Matrix multiplication (ikj)Figure 6.45 (f) P518

Figure 6.46 P519

3) (BC)

25

/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

A B C

(*,j)(k,j)

Inner loop:

(*,k)

Column -wise

Column-wise

Fixed• Misses per Inner Loop

Iteration:A B C

1.0 0.0 1.0

Matrix multiplication (jki)Figure 6.45 (c) P518

Figure 6.46 P519

2) (AC)

26

/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

A B C

(*,j)(k,j)

Inner loop:

(*,k)

FixedColumn-wise

Column-wise

• Misses per Inner Loop Iteration:

A B C1.0 0.0 1.0

Matrix multiplication (kji)Figure 6.45 (d) P518

Figure 6.46 P519

2) (AC)

27

Pentium matrix multiply performanceFigure 6.47 (d) P519

0

10

20

30

40

50

60

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400

Array size (n)

Cyc

les

/ite

rati

on

kjijkikijikjjikijk

2) (AC)

3) (BC)

1) (AB)

2)

3)

1)

28

Pentium matrix multiply performance

• Notice that miss rates are helpful but not perfect predictors.

– Code scheduling matters, too.

29

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25

for (k=0; k<n; k++) {

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

}

for (j=0; j<n; j++) {

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

}

kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5

jki (& kji): • 2 loads, 1 store• misses/iter = 2.0

Summary of matrix multiplication

1) (AB) 3) (BC) 2) (AC)

30


6.6.3 Using Blocking to Increase Temporal Locality

31

Improving temporal locality by blocking P520

• Example: Blocked matrix multiplication– “block” (in this context) does not mean “cache

block”.

– Instead, it mean a sub-block within the matrix.

– Example: N = 8; sub-block size = 4

32

Improving temporal locality by blocking

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22

A11 A12

A21 A22

B11 B12

B21 B22X =

C11 C12

C21 C22

Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.

33

for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } }}

Blocked matrix multiply (bijk)Figure 6.48 P521

34

Blocked matrix multiply analysis

• Innermost loop pair multiplies a 1 X bsize

sliver of A by a bsize X bsize block of B and

accumulates into 1 X bsize sliver of C

– Loop over i steps through n row slivers of A & C,

using same B

Sliver: 长条

35

A B C

block reused n times in succession

row sliver accessedbsize times

Update successiveelements of sliver

i ikk

kk jjjj

for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

InnermostLoop Pair

Blocked matrix multiply analysis

Figure 6.49 P522

36

Pentium blocked matrix multiply performanceFigure 6.50 P523

0

10

20

30

40

50

60

Array size (n)

Cy

cle

s/i

tera

tio

n

kji

jki

kij

ikj

jik

ijk

bijk (bsize = 25)

bikj (bsize = 25)

2)3)

1)

37

6.7 Putting it Together: Exploring Locality in Your Programs

38

Techniques P523

• Focus your attention on the inner loops

• Try to maximize the spatial locality in your

programs by reading data objects sequentially, in

the order they are stored in memory

• Try to maximize the temporal locality in your

programs by using a data object as often as

possible once it has been read from memory

• Miss rates, the number of memory accesses

Date post:	01-Jan-2016
Category:	Documents
Upload:	porter-christensen
View:	15 times
Download:	0 times

Cache Memory

Documents