+ All Categories
Home > Documents > Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d:...

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d:...

Date post: 20-Dec-2015
Category:
View: 220 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches
Transcript
Page 1: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

Embedded Computer Architecture

5KK73 TU/e Henk Corporaal

Bart Mesman

Data Memory Management

Part d:

Data Layout for Caches

Page 2: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 2

Data layout for caches• Caches are hardware controled• Therefore: no explicit reuse copy code needed in

your code!

• What can we still do to improve performance?

• Topics:– Cache principles– The 3 C's: Compulsory, Capacity and Conflict misses– Data layout examples reducing misses

Page 3: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 3

Cache operation (direct mapped cache)M

emor

y / L

ower

leve

l

Cache / Higher level

block or line

tags data

Page 4: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 4

Why does a cache work?

• Principle of Locality– Temporal locality

• an accessed item has a high probability being accessed in the near future

– Spatial locality• items close in space to a recently accessed item have a high

probability of being accessed next

• Check yourself why there is temporal and spatial locality for instruction accesses and for data accesses– Regular programs have high instruction and data locality

Page 5: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 5

Direct mapped cache

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 1 1 2 1 0Address (bit positions)

Page 6: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 6

• Taking advantage of spatial locality:

Direct mapped cache: larger blocks

Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0

Address (bit positions)

Page 7: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 7

• Increasing the block (or line) size tends to decrease miss rate

Performance: effect of block size

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)

Page 8: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 8

p-k-m mk

tag index address byte address

tagdata

Hit?

mainmemory

CPU

2k lines

p-k-m2m bytes

Cache Line or BlockCache principles

Virtual or Physical address

Page 9: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 9

4 Cache Architecture Fundamentals

1. Block placement – Where in the cache will a new block be placed?

2. Block identification– How is a block found in the cache?

3. Block replacement policy– Which block is evicted from the cache?

4. Updating policy– When is a block written from cache to memory?– Write-Through vs. Write-Back caches

Page 10: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 10

CacheCache0011

77

2233445566

22334455

0011

6677......

0011223344556677

Fully associative Fully associative (one-to-many)(one-to-many)

Anywhere in cacheAnywhere in cache

Anywhere in cacheAnywhere in cache

Here only!Here only!

0011223344556677

Direct mapped Direct mapped (one-to-one)(one-to-one)

Here only!Here only!

MemoryMemory00112233445566778899

101011111212131314141515

Mapping?Mapping?

......

Block placement policies

Page 11: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 11

4-way associative cacheAddress

22 8

V TagIndex

012

253254255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0

• 4 ways• 256 sets

Page 12: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 12

Performance: effect of associativity

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB

2 KB

4 KB

8 KB

Mis

s ra

te

Associativity 16 KB

32 KB

64 KB

128 KB

1 KB

2 KB

8 KB

Page 13: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 13

Cache Basics• Cache_size = Nsets x Associativity x Block_size

• Block_address = Byte_address DIV Block_size in bytes

• Index = Block_address MOD Nsets

• Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently

tag index blockoffset

block address

… 2 1 031 …

Page 14: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 14

Example 1• Assume

– Cache of 4K blocks, with 4 word block size– 32 bit addresses

• Direct mapped (associativity=1) : – 16 bytes per block = 2^4 4 (2+2) bits for byte and word offsets– 32 bit address : 32-4=28 bits for index and tag– #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index– Total number of tag bits : (28-12)*4K=64 Kbits

• 2-way associative – #sets=#blocks/associativity : 2K sets– 1 bit less for indexing, 1 bit more for tag (compared to direct mapped)– Tag bits : (28-11) * 2 * 2K=68 Kbits

• 4-way associative– #sets=#blocks/associativity : 1K sets– 2 bits less for indexing, 2 bits more for tag (compared to direct mapped)– Tag bits : (28-10) * 4 * 1K=72 Kbits

Page 15: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 15

Example 2

3 caches consisting of 4 one-word blocks:

• Cache 1 : fully associative• Cache 2 : two-way set associative• Cache 3 : direct mapped

Suppose following sequence of block addresses: 0, 8, 0, 6, 8

Page 16: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 16

Example 2: Direct MappedBlock address Cache Block

0 0 mod 4=0

6 6 mod 4=2

8 8 mod 4=0

Address of memory block

Hit or miss

Location 0

Location 1

Location 2

Location 3

0 miss Mem[0]

8 miss Mem[8]

0 miss Mem[0]

6 miss Mem[0] Mem[6]

8 miss Mem[8] Mem[6]

Coloured = new entry = miss

Page 17: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 17

Example 2: 2-way Set Associative: (4/2 = 2 sets)

Block address Cache Block

0 0 mod 2=0

6 6 mod 2=0

8 8 mod 2=0

Address of memory block

Hit or miss

SET 0

entry 0

SET 0

entry 1

SET 1

entry 0

SET 1

entry 1

0 Miss Mem[0]

8 Miss Mem[0] Mem[8]

0 Hit Mem[0] Mem[8]

6 Miss Mem[0] Mem[6]

8 Miss Mem[8] Mem[6]

LEAST RECENTLY USED BLOCK

(so all in set/location 0)

Page 18: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 18

Example 2: Fully associative (4 way assoc., 4/4 = 1 set)

Address of memory block

Hit or miss

Block 0 Block 1 Block 2 Block 3

0 Miss Mem[0]

8 Miss Mem[0] Mem[8]

0 Hit Mem[0] Mem[8]

6 Miss Mem[0] Mem[8] Mem[6]

8 Hit Mem[0] Mem[8] Mem[6]

Page 19: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 19

Cache FundamentalsThe “Three C's”• Compulsory Misses

– 1st access to a block: never in the cache

• Capacity Misses– Cache cannot contain all the blocks– Blocks are discarded and retrieved later– Avoided by increasing cache size

• Conflict Misses– Too many blocks mapped to same set– Avoided by increasing associativity

• Some add 4th C: Coherence Misses

Page 20: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 20

for(i=0; i<10; i++) A[i] = f(B[i]);

Cache(@ i=2)

A[0]B[1]

B[2]

B[0]

A[1]

A[2]------

• B[3], A[3] required• B[3] never loaded before

loaded into cache• A[3] never loaded before

allocates new line

Cache(@ i=3)

Compulsory miss example

Page 21: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 21

Capacity miss example

B[3]B[0]A[0]

i=0B[3]B[0]A[0]B[4]B[1]A[1]

i=1A[2]B[0]A[0]B[4]B[1]A[1]B[5]B[2]

i=2A[2]B[6]B[3]A[3]B[1]A[1]B[5]B[2]

i=3A[2]B[6]B[3]A[3]B[7]B[4]A[4]B[2]

i=4B[5]A[5]B[3]A[3]B[7]B[4]A[4]B[8]

i=5B[5]A[5]B[9]B[6]A[6]B[4]A[4]B[8]

i=6

for(i=0; i<N; i++) A[i] = B[i+3]+B[i];

B[5]A[5]B[9]B[6]A[6]B[10]B[7]A[7]

i=7

• 11 compulsory misses (+8 write misses)

• 5 capacity misses

Cache size: 8 blocks of 1 wordFully associative

Page 22: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 22

Cache (@ i=0)

1234567

B[0][j]

A[0]/B[0][j]0

for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j];

A[0]0A[1]1A[2]

B[3][9]

7

10

31

B[3][0]B[0][1]

A[3]234 B[0][0]

B[1][0]

B[1][1]

B[2][0]56

11

B[2][1]B[3][1]

12

B[0][2]B[1][2]1

3 B[2][2]B[3][2]

89

1415

01

7

2

7

23456

345

01

67

B[0][3] 0...

Memoryaddress

Cacheaddress

j=even

A[0] multiply loaded

A[i] read 10 times

-> A[0] flushed in favor B[0][j] -> Miss

j=odd

Conflict miss example

Page 23: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 23

“Three C's” vs Cache size [Gee93]

1 2 4 8 16 32 64

Cache Size in KB

0.00

0.05

0.10

0.15

Total Misses Compulsory Misses Capacity MissesConflict Misses

Rel

ativ

e A

bsol

ute

Mis

sess

Page 24: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

Data layout may reduce cache misses

Page 25: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 25

Example 1: Capacity & Compulsory miss reduction

B[3]B[0]A[0]

i=0B[3]B[0]A[0]B[4]B[1]A[1]

i=1A[2]B[0]A[0]B[4]B[1]A[1]B[5]B[2]

i=2A[2]B[6]B[3]A[3]B[1]A[1]B[5]B[2]

i=3A[2]B[6]B[3]A[3]B[7]B[4]A[4]B[2]

i=4B[5]A[5]B[3]A[3]B[7]B[4]A[4]B[8]

i=5B[5]A[5]B[9]B[6]A[6]B[4]A[4]B[8]

i=6

for(i=0; i<N; i++) A[i] = B[i+3]+B[i];

B[5]A[5]B[9]B[6]A[6]B[10]B[7]A[7]

i=7

• 11 compulsory misses (+8 write misses)

• 5 capacity misses

Page 26: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 26

#Words

B[]

i60

CacheMemory

Main Memory

(16 words) (16 words)

AB[new]

Fit data in cache within-place mapping

A[]

15Detailed Analysis:

max=15 words

12

for(i=0; i<12; i++) A[i] = B[i+3]+B[i];Traditional

Analysis:max=27 words

Page 27: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 27

Remove capacity / compulsory misses with in-place mapping

AB[3]AB[0]

i=0AB[3]AB[0]AB[4]AB[1]

i=1AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]

i=2AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]AB[6]

i=3AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]AB[6]AB[7]

i=4AB[3]AB[8]AB[4]AB[1]AB[5]AB[2]AB[6]AB[7]

i=5AB[3]AB[8]AB[4]AB[9]AB[5]AB[2]AB[6]AB[7]

i=6

for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i];

AB[7]AB[8]AB[4]AB[9]AB[5]AB[10]AB[6]AB[7]

i=7

• 11 compulsory misses

• 5 cache hits (+8 write hits)

Page 28: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 28

Cache (@ i=0)

1234567

B[0][j]

A[0]/B[0][j]0

for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j];

A[0]0A[1]1A[2]

B[3][9]

7

10

31

B[3][0]B[0][1]

A[3]234 B[0][0]

B[1][0]

B[1][1]

B[2][0]56

11

B[2][1]B[3][1]

12

B[0][2]B[1][2]1

3 B[2][2]B[3][2]

89

1415

01

7

2

7

23456

345

01

67

B[0][3] 0...

Memoryaddress

Cacheaddress

j=even

A[0] multiply loaded

A[i] read 10 times

-> A[0] flushed in favor B[0][j] -> Miss j=odd

Example 2: Conflict miss reduction

Page 29: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 29

for(j=0; j<10; j++)for(j=0; j<10; j++) for(i=0; i<4; i++)for(i=0; i<4; i++) A[i] = A[i]+B[i][j];A[i] = A[i]+B[i][j];

A[0]A[0]00A[1]A[1]11A[2]A[2]

B[3][9]B[3][9]

77

1122

3311

B[3][0]B[3][0]

B[0][1]B[0][1]

Main MemoryMain Memory

A[3]A[3]223344 B[0][0]B[0][0]

B[1][0]B[1][0]

B[1][1]B[1][1]

B[2][0]B[2][0]5566

1133

Leave gapLeave gap

B[2][1]B[2][1]B[3][1]B[3][1]

Leave gapLeave gapB[0][2]B[0][2]

0011

77

44

77

2233445566

556677

11441155

1188

44......

......

......

11223344556677

B[0][j]B[0][j]

A[0]A[0]00

A[0] A[0] multiply multiply loadedloaded

A[i] multiple A[i] multiple xx read read

No No conflictconflict

Cache Cache (@ i=0)(@ i=0)

j=anyj=any

© imec 2001

Avoid conflict miss withmain memory data layout

Page 30: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 30

0

2

4

6

8

10

12

14

16

512Bytes 1KB 2KB

Cache Size

Mis

s R

ate

(%

) Initial - DirectMapped

Data Layout Org -Direct Mapped

Initial - Fully Assoc

Data Layout Organization forDirect Mapped Caches

Page 31: Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

@H.C. Embedded Computer Architecture 31

Conclusions on Data Management• In multi-media applications exploring data transfer and storage

issues should be done at source code level• DMM method:

– Reducing number of external memory accesses

– Reducing external memory size

– Trade-offs between internal memory complexity and speed

– Platform independent high-level transformations

– Platform dependent transformations exploit platform characteristics

(efficient use of memory, cache, …)

– Substantial energy reduction

• Although caches are hardware controlled data layout can

largely influence the miss-rate


Recommended