Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d:...

Embedded Computer Architecture

5KK73 TU/e Henk Corporaal

Bart Mesman

Data Memory Management

Part d:

Data Layout for Caches

@H.C. Embedded Computer Architecture 2

Data layout for caches• Caches are hardware controled• Therefore: no explicit reuse copy code needed in

your code!

• What can we still do to improve performance?

• Topics:– Cache principles– The 3 C's: Compulsory, Capacity and Conflict misses– Data layout examples reducing misses


Cache operation (direct mapped cache)M

emor

y / L

ower

leve

l

Cache / Higher level

block or line

tags data


Why does a cache work?

• Principle of Locality– Temporal locality

• an accessed item has a high probability being accessed in the near future

– Spatial locality• items close in space to a recently accessed item have a high

probability of being accessed next

• Check yourself why there is temporal and spatial locality for instruction accesses and for data accesses– Regular programs have high instruction and data locality


Direct mapped cache

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 1 1 2 1 0Address (bit positions)


• Taking advantage of spatial locality:

Direct mapped cache: larger blocks

Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0

Address (bit positions)


• Increasing the block (or line) size tends to decrease miss rate

Performance: effect of block size

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)


p-k-m mk

tag index address byte address

tagdata

Hit?

mainmemory

CPU

2k lines

p-k-m2m bytes

Cache Line or BlockCache principles

Virtual or Physical address


4 Cache Architecture Fundamentals

1. Block placement – Where in the cache will a new block be placed?

2. Block identification– How is a block found in the cache?

3. Block replacement policy– Which block is evicted from the cache?

4. Updating policy– When is a block written from cache to memory?– Write-Through vs. Write-Back caches


CacheCache0011

77

2233445566

22334455

0011

6677......

0011223344556677

Fully associative Fully associative (one-to-many)(one-to-many)

Anywhere in cacheAnywhere in cache

Anywhere in cacheAnywhere in cache

Here only!Here only!

0011223344556677

Direct mapped Direct mapped (one-to-one)(one-to-one)

Here only!Here only!

MemoryMemory00112233445566778899

101011111212131314141515

Mapping?Mapping?

......

Block placement policies


4-way associative cacheAddress

22 8

V TagIndex

012

253254255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0

• 4 ways• 256 sets


Performance: effect of associativity

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB

2 KB

4 KB

8 KB

Mis

s ra

te

Associativity 16 KB

32 KB

64 KB

128 KB

1 KB

2 KB

8 KB


Cache Basics• Cache_size = Nsets x Associativity x Block_size

• Block_address = Byte_address DIV Block_size in bytes

• Index = Block_address MOD Nsets

• Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently

tag index blockoffset

block address

… 2 1 031 …


Example 1• Assume

– Cache of 4K blocks, with 4 word block size– 32 bit addresses

• Direct mapped (associativity=1) : – 16 bytes per block = 2^4 4 (2+2) bits for byte and word offsets– 32 bit address : 32-4=28 bits for index and tag– #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index– Total number of tag bits : (28-12)*4K=64 Kbits

• 2-way associative – #sets=#blocks/associativity : 2K sets– 1 bit less for indexing, 1 bit more for tag (compared to direct mapped)– Tag bits : (28-11) * 2 * 2K=68 Kbits

• 4-way associative– #sets=#blocks/associativity : 1K sets– 2 bits less for indexing, 2 bits more for tag (compared to direct mapped)– Tag bits : (28-10) * 4 * 1K=72 Kbits


Example 2

3 caches consisting of 4 one-word blocks:

• Cache 1 : fully associative• Cache 2 : two-way set associative• Cache 3 : direct mapped

Suppose following sequence of block addresses: 0, 8, 0, 6, 8


Example 2: Direct MappedBlock address Cache Block

0 0 mod 4=0

6 6 mod 4=2

8 8 mod 4=0

Address of memory block

Hit or miss

Location 0

Location 1

Location 2

Location 3

0 miss Mem[0]

8 miss Mem[8]

0 miss Mem[0]

6 miss Mem[0] Mem[6]

8 miss Mem[8] Mem[6]

Coloured = new entry = miss


Example 2: 2-way Set Associative: (4/2 = 2 sets)

Block address Cache Block

0 0 mod 2=0

6 6 mod 2=0

8 8 mod 2=0


Hit or miss

SET 0

entry 0

SET 0

entry 1

SET 1

entry 0

SET 1

entry 1

0 Miss Mem[0]

8 Miss Mem[0] Mem[8]

0 Hit Mem[0] Mem[8]



LEAST RECENTLY USED BLOCK

(so all in set/location 0)


Example 2: Fully associative (4 way assoc., 4/4 = 1 set)


Hit or miss

Block 0 Block 1 Block 2 Block 3

0 Miss Mem[0]


0 Hit Mem[0] Mem[8]

6 Miss Mem[0] Mem[8] Mem[6]

8 Hit Mem[0] Mem[8] Mem[6]


Cache FundamentalsThe “Three C's”• Compulsory Misses

– 1st access to a block: never in the cache

• Capacity Misses– Cache cannot contain all the blocks– Blocks are discarded and retrieved later– Avoided by increasing cache size

• Conflict Misses– Too many blocks mapped to same set– Avoided by increasing associativity

• Some add 4th C: Coherence Misses


for(i=0; i<10; i++) A[i] = f(B[i]);

Cache(@ i=2)

A[0]B[1]

B[2]

B[0]

A[1]

A[2]------

• B[3], A[3] required• B[3] never loaded before

loaded into cache• A[3] never loaded before

allocates new line

Cache(@ i=3)

Compulsory miss example


Capacity miss example

B[3]B[0]A[0]

i=0B[3]B[0]A[0]B[4]B[1]A[1]

i=1A[2]B[0]A[0]B[4]B[1]A[1]B[5]B[2]

i=2A[2]B[6]B[3]A[3]B[1]A[1]B[5]B[2]

i=3A[2]B[6]B[3]A[3]B[7]B[4]A[4]B[2]

i=4B[5]A[5]B[3]A[3]B[7]B[4]A[4]B[8]

i=5B[5]A[5]B[9]B[6]A[6]B[4]A[4]B[8]

i=6

for(i=0; i<N; i++) A[i] = B[i+3]+B[i];

B[5]A[5]B[9]B[6]A[6]B[10]B[7]A[7]

i=7

• 11 compulsory misses (+8 write misses)

• 5 capacity misses

Cache size: 8 blocks of 1 wordFully associative


Cache (@ i=0)

1234567

B[0][j]

A[0]/B[0][j]0

for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j];

A[0]0A[1]1A[2]

B[3][9]

7

10

31

B[3][0]B[0][1]

A[3]234 B[0][0]

B[1][0]

B[1][1]

B[2][0]56

11

B[2][1]B[3][1]

12

B[0][2]B[1][2]1

3 B[2][2]B[3][2]

89

1415

01

7

2

7

23456

345

01

67

B[0][3] 0...

Memoryaddress

Cacheaddress

j=even

A[0] multiply loaded

A[i] read 10 times

-> A[0] flushed in favor B[0][j] -> Miss

j=odd

Conflict miss example


“Three C's” vs Cache size [Gee93]

1 2 4 8 16 32 64

Cache Size in KB

0.00

0.05

0.10

0.15

Total Misses Compulsory Misses Capacity MissesConflict Misses

Rel

ativ

e A

bsol

ute

Mis

sess

Data layout may reduce cache misses


Example 1: Capacity & Compulsory miss reduction

B[3]B[0]A[0]

i=0B[3]B[0]A[0]B[4]B[1]A[1]

i=1A[2]B[0]A[0]B[4]B[1]A[1]B[5]B[2]

i=2A[2]B[6]B[3]A[3]B[1]A[1]B[5]B[2]

i=3A[2]B[6]B[3]A[3]B[7]B[4]A[4]B[2]

i=4B[5]A[5]B[3]A[3]B[7]B[4]A[4]B[8]

i=5B[5]A[5]B[9]B[6]A[6]B[4]A[4]B[8]

i=6

for(i=0; i<N; i++) A[i] = B[i+3]+B[i];

B[5]A[5]B[9]B[6]A[6]B[10]B[7]A[7]

i=7

• 11 compulsory misses (+8 write misses)

• 5 capacity misses


#Words

B[]

i60

CacheMemory

Main Memory

(16 words) (16 words)

AB[new]

Fit data in cache within-place mapping

A[]

15Detailed Analysis:

max=15 words

12

for(i=0; i<12; i++) A[i] = B[i+3]+B[i];Traditional

Analysis:max=27 words


Remove capacity / compulsory misses with in-place mapping

AB[3]AB[0]

i=0AB[3]AB[0]AB[4]AB[1]

i=1AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]

i=2AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]AB[6]

i=3AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]AB[6]AB[7]



i=6

for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i];

AB[7]AB[8]AB[4]AB[9]AB[5]AB[10]AB[6]AB[7]

i=7

• 11 compulsory misses

• 5 cache hits (+8 write hits)


Cache (@ i=0)

1234567

B[0][j]

A[0]/B[0][j]0

for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j];

A[0]0A[1]1A[2]

B[3][9]

7

10

31

B[3][0]B[0][1]

A[3]234 B[0][0]

B[1][0]

B[1][1]

B[2][0]56

11

B[2][1]B[3][1]

12

B[0][2]B[1][2]1

3 B[2][2]B[3][2]

89

1415

01

7

2

7

23456

345

01

67

B[0][3] 0...

Memoryaddress

Cacheaddress

j=even

A[0] multiply loaded

A[i] read 10 times

-> A[0] flushed in favor B[0][j] -> Miss j=odd

Example 2: Conflict miss reduction


for(j=0; j<10; j++)for(j=0; j<10; j++) for(i=0; i<4; i++)for(i=0; i<4; i++) A[i] = A[i]+B[i][j];A[i] = A[i]+B[i][j];

A[0]A[0]00A[1]A[1]11A[2]A[2]

B[3][9]B[3][9]

77

1122

3311

B[3][0]B[3][0]

B[0][1]B[0][1]

Main MemoryMain Memory

A[3]A[3]223344 B[0][0]B[0][0]

B[1][0]B[1][0]

B[1][1]B[1][1]

B[2][0]B[2][0]5566

1133

Leave gapLeave gap

B[2][1]B[2][1]B[3][1]B[3][1]

Leave gapLeave gapB[0][2]B[0][2]

0011

77

44

77

2233445566

556677

11441155

1188

44......

......

......

11223344556677

B[0][j]B[0][j]

A[0]A[0]00

A[0] A[0] multiply multiply loadedloaded

A[i] multiple A[i] multiple xx read read

No No conflictconflict

Cache Cache (@ i=0)(@ i=0)

j=anyj=any

© imec 2001

Avoid conflict miss withmain memory data layout


0

2

4

6

8

10

12

14

16

512Bytes 1KB 2KB

Cache Size

Mis

s R

ate

(%

) Initial - DirectMapped

Data Layout Org -Direct Mapped

Initial - Fully Assoc

Data Layout Organization forDirect Mapped Caches


Conclusions on Data Management• In multi-media applications exploring data transfer and storage

issues should be done at source code level• DMM method:

– Reducing number of external memory accesses

– Reducing external memory size

– Trade-offs between internal memory complexity and speed

– Platform independent high-level transformations

– Platform dependent transformations exploit platform characteristics

(efficient use of memory, cache, …)

– Substantial energy reduction

• Although caches are hardware controlled data layout can

largely influence the miss-rate

Date post:	20-Dec-2015
Category:	Documents
View:	220 times
Download:	0 times

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d:...

Documents