Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 0 times |
Embedded Computer Architecture
5KK73 TU/e Henk Corporaal
Bart Mesman
Data Memory Management
Part d:
Data Layout for Caches
@H.C. Embedded Computer Architecture 2
Data layout for caches• Caches are hardware controled• Therefore: no explicit reuse copy code needed in
your code!
• What can we still do to improve performance?
• Topics:– Cache principles– The 3 C's: Compulsory, Capacity and Conflict misses– Data layout examples reducing misses
@H.C. Embedded Computer Architecture 3
Cache operation (direct mapped cache)M
emor
y / L
ower
leve
l
Cache / Higher level
block or line
tags data
@H.C. Embedded Computer Architecture 4
Why does a cache work?
• Principle of Locality– Temporal locality
• an accessed item has a high probability being accessed in the near future
– Spatial locality• items close in space to a recently accessed item have a high
probability of being accessed next
• Check yourself why there is temporal and spatial locality for instruction accesses and for data accesses– Regular programs have high instruction and data locality
@H.C. Embedded Computer Architecture 5
Direct mapped cache
20 10
Byteoffset
Valid Tag DataIndex
0
1
2
1021
1022
1023
Tag
Index
Hit Data
20 32
31 30 13 12 1 1 2 1 0Address (bit positions)
@H.C. Embedded Computer Architecture 6
• Taking advantage of spatial locality:
Direct mapped cache: larger blocks
Address (showing bit positions)
16 12 Byteoffset
V Tag Data
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 1 0
Address (bit positions)
@H.C. Embedded Computer Architecture 7
• Increasing the block (or line) size tends to decrease miss rate
Performance: effect of block size
1 KB
8 KB
16 KB
64 KB
256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Mis
s ra
te
64164
Block size (bytes)
@H.C. Embedded Computer Architecture 8
p-k-m mk
tag index address byte address
tagdata
Hit?
mainmemory
CPU
2k lines
p-k-m2m bytes
Cache Line or BlockCache principles
Virtual or Physical address
@H.C. Embedded Computer Architecture 9
4 Cache Architecture Fundamentals
1. Block placement – Where in the cache will a new block be placed?
2. Block identification– How is a block found in the cache?
3. Block replacement policy– Which block is evicted from the cache?
4. Updating policy– When is a block written from cache to memory?– Write-Through vs. Write-Back caches
@H.C. Embedded Computer Architecture 10
CacheCache0011
77
2233445566
22334455
0011
6677......
0011223344556677
Fully associative Fully associative (one-to-many)(one-to-many)
Anywhere in cacheAnywhere in cache
Anywhere in cacheAnywhere in cache
Here only!Here only!
0011223344556677
Direct mapped Direct mapped (one-to-one)(one-to-one)
Here only!Here only!
MemoryMemory00112233445566778899
101011111212131314141515
Mapping?Mapping?
......
Block placement policies
@H.C. Embedded Computer Architecture 11
4-way associative cacheAddress
22 8
V TagIndex
012
253254255
Data V Tag Data V Tag Data V Tag Data
3222
4-to-1 multiplexor
Hit Data
123891011123031 0
• 4 ways• 256 sets
@H.C. Embedded Computer Architecture 12
Performance: effect of associativity
0%
3%
6%
9%
12%
15%
Eight-wayFour-wayTwo-wayOne-way
1 KB
2 KB
4 KB
8 KB
Mis
s ra
te
Associativity 16 KB
32 KB
64 KB
128 KB
1 KB
2 KB
8 KB
@H.C. Embedded Computer Architecture 13
Cache Basics• Cache_size = Nsets x Associativity x Block_size
• Block_address = Byte_address DIV Block_size in bytes
• Index = Block_address MOD Nsets
• Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently
tag index blockoffset
block address
… 2 1 031 …
@H.C. Embedded Computer Architecture 14
Example 1• Assume
– Cache of 4K blocks, with 4 word block size– 32 bit addresses
• Direct mapped (associativity=1) : – 16 bytes per block = 2^4 4 (2+2) bits for byte and word offsets– 32 bit address : 32-4=28 bits for index and tag– #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index– Total number of tag bits : (28-12)*4K=64 Kbits
• 2-way associative – #sets=#blocks/associativity : 2K sets– 1 bit less for indexing, 1 bit more for tag (compared to direct mapped)– Tag bits : (28-11) * 2 * 2K=68 Kbits
• 4-way associative– #sets=#blocks/associativity : 1K sets– 2 bits less for indexing, 2 bits more for tag (compared to direct mapped)– Tag bits : (28-10) * 4 * 1K=72 Kbits
@H.C. Embedded Computer Architecture 15
Example 2
3 caches consisting of 4 one-word blocks:
• Cache 1 : fully associative• Cache 2 : two-way set associative• Cache 3 : direct mapped
Suppose following sequence of block addresses: 0, 8, 0, 6, 8
@H.C. Embedded Computer Architecture 16
Example 2: Direct MappedBlock address Cache Block
0 0 mod 4=0
6 6 mod 4=2
8 8 mod 4=0
Address of memory block
Hit or miss
Location 0
Location 1
Location 2
Location 3
0 miss Mem[0]
8 miss Mem[8]
0 miss Mem[0]
6 miss Mem[0] Mem[6]
8 miss Mem[8] Mem[6]
Coloured = new entry = miss
@H.C. Embedded Computer Architecture 17
Example 2: 2-way Set Associative: (4/2 = 2 sets)
Block address Cache Block
0 0 mod 2=0
6 6 mod 2=0
8 8 mod 2=0
Address of memory block
Hit or miss
SET 0
entry 0
SET 0
entry 1
SET 1
entry 0
SET 1
entry 1
0 Miss Mem[0]
8 Miss Mem[0] Mem[8]
0 Hit Mem[0] Mem[8]
6 Miss Mem[0] Mem[6]
8 Miss Mem[8] Mem[6]
LEAST RECENTLY USED BLOCK
(so all in set/location 0)
@H.C. Embedded Computer Architecture 18
Example 2: Fully associative (4 way assoc., 4/4 = 1 set)
Address of memory block
Hit or miss
Block 0 Block 1 Block 2 Block 3
0 Miss Mem[0]
8 Miss Mem[0] Mem[8]
0 Hit Mem[0] Mem[8]
6 Miss Mem[0] Mem[8] Mem[6]
8 Hit Mem[0] Mem[8] Mem[6]
@H.C. Embedded Computer Architecture 19
Cache FundamentalsThe “Three C's”• Compulsory Misses
– 1st access to a block: never in the cache
• Capacity Misses– Cache cannot contain all the blocks– Blocks are discarded and retrieved later– Avoided by increasing cache size
• Conflict Misses– Too many blocks mapped to same set– Avoided by increasing associativity
• Some add 4th C: Coherence Misses
@H.C. Embedded Computer Architecture 20
for(i=0; i<10; i++) A[i] = f(B[i]);
Cache(@ i=2)
A[0]B[1]
B[2]
B[0]
A[1]
A[2]------
• B[3], A[3] required• B[3] never loaded before
loaded into cache• A[3] never loaded before
allocates new line
Cache(@ i=3)
Compulsory miss example
@H.C. Embedded Computer Architecture 21
Capacity miss example
B[3]B[0]A[0]
i=0B[3]B[0]A[0]B[4]B[1]A[1]
i=1A[2]B[0]A[0]B[4]B[1]A[1]B[5]B[2]
i=2A[2]B[6]B[3]A[3]B[1]A[1]B[5]B[2]
i=3A[2]B[6]B[3]A[3]B[7]B[4]A[4]B[2]
i=4B[5]A[5]B[3]A[3]B[7]B[4]A[4]B[8]
i=5B[5]A[5]B[9]B[6]A[6]B[4]A[4]B[8]
i=6
for(i=0; i<N; i++) A[i] = B[i+3]+B[i];
B[5]A[5]B[9]B[6]A[6]B[10]B[7]A[7]
i=7
• 11 compulsory misses (+8 write misses)
• 5 capacity misses
Cache size: 8 blocks of 1 wordFully associative
@H.C. Embedded Computer Architecture 22
Cache (@ i=0)
1234567
B[0][j]
A[0]/B[0][j]0
for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j];
A[0]0A[1]1A[2]
B[3][9]
7
10
31
B[3][0]B[0][1]
A[3]234 B[0][0]
B[1][0]
B[1][1]
B[2][0]56
11
B[2][1]B[3][1]
12
B[0][2]B[1][2]1
3 B[2][2]B[3][2]
89
1415
01
7
2
7
23456
345
01
67
B[0][3] 0...
Memoryaddress
Cacheaddress
j=even
A[0] multiply loaded
A[i] read 10 times
-> A[0] flushed in favor B[0][j] -> Miss
j=odd
Conflict miss example
@H.C. Embedded Computer Architecture 23
“Three C's” vs Cache size [Gee93]
1 2 4 8 16 32 64
Cache Size in KB
0.00
0.05
0.10
0.15
Total Misses Compulsory Misses Capacity MissesConflict Misses
Rel
ativ
e A
bsol
ute
Mis
sess
Data layout may reduce cache misses
@H.C. Embedded Computer Architecture 25
Example 1: Capacity & Compulsory miss reduction
B[3]B[0]A[0]
i=0B[3]B[0]A[0]B[4]B[1]A[1]
i=1A[2]B[0]A[0]B[4]B[1]A[1]B[5]B[2]
i=2A[2]B[6]B[3]A[3]B[1]A[1]B[5]B[2]
i=3A[2]B[6]B[3]A[3]B[7]B[4]A[4]B[2]
i=4B[5]A[5]B[3]A[3]B[7]B[4]A[4]B[8]
i=5B[5]A[5]B[9]B[6]A[6]B[4]A[4]B[8]
i=6
for(i=0; i<N; i++) A[i] = B[i+3]+B[i];
B[5]A[5]B[9]B[6]A[6]B[10]B[7]A[7]
i=7
• 11 compulsory misses (+8 write misses)
• 5 capacity misses
@H.C. Embedded Computer Architecture 26
#Words
B[]
i60
CacheMemory
Main Memory
(16 words) (16 words)
AB[new]
Fit data in cache within-place mapping
A[]
15Detailed Analysis:
max=15 words
12
for(i=0; i<12; i++) A[i] = B[i+3]+B[i];Traditional
Analysis:max=27 words
@H.C. Embedded Computer Architecture 27
Remove capacity / compulsory misses with in-place mapping
AB[3]AB[0]
i=0AB[3]AB[0]AB[4]AB[1]
i=1AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]
i=2AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]AB[6]
i=3AB[3]AB[0]AB[4]AB[1]AB[5]AB[2]AB[6]AB[7]
i=4AB[3]AB[8]AB[4]AB[1]AB[5]AB[2]AB[6]AB[7]
i=5AB[3]AB[8]AB[4]AB[9]AB[5]AB[2]AB[6]AB[7]
i=6
for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i];
AB[7]AB[8]AB[4]AB[9]AB[5]AB[10]AB[6]AB[7]
i=7
• 11 compulsory misses
• 5 cache hits (+8 write hits)
@H.C. Embedded Computer Architecture 28
Cache (@ i=0)
1234567
B[0][j]
A[0]/B[0][j]0
for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j];
A[0]0A[1]1A[2]
B[3][9]
7
10
31
B[3][0]B[0][1]
A[3]234 B[0][0]
B[1][0]
B[1][1]
B[2][0]56
11
B[2][1]B[3][1]
12
B[0][2]B[1][2]1
3 B[2][2]B[3][2]
89
1415
01
7
2
7
23456
345
01
67
B[0][3] 0...
Memoryaddress
Cacheaddress
j=even
A[0] multiply loaded
A[i] read 10 times
-> A[0] flushed in favor B[0][j] -> Miss j=odd
Example 2: Conflict miss reduction
@H.C. Embedded Computer Architecture 29
for(j=0; j<10; j++)for(j=0; j<10; j++) for(i=0; i<4; i++)for(i=0; i<4; i++) A[i] = A[i]+B[i][j];A[i] = A[i]+B[i][j];
A[0]A[0]00A[1]A[1]11A[2]A[2]
B[3][9]B[3][9]
77
1122
3311
B[3][0]B[3][0]
B[0][1]B[0][1]
Main MemoryMain Memory
A[3]A[3]223344 B[0][0]B[0][0]
B[1][0]B[1][0]
B[1][1]B[1][1]
B[2][0]B[2][0]5566
1133
Leave gapLeave gap
B[2][1]B[2][1]B[3][1]B[3][1]
Leave gapLeave gapB[0][2]B[0][2]
0011
77
44
77
2233445566
556677
11441155
1188
44......
......
......
11223344556677
B[0][j]B[0][j]
A[0]A[0]00
A[0] A[0] multiply multiply loadedloaded
A[i] multiple A[i] multiple xx read read
No No conflictconflict
Cache Cache (@ i=0)(@ i=0)
j=anyj=any
© imec 2001
Avoid conflict miss withmain memory data layout
@H.C. Embedded Computer Architecture 30
0
2
4
6
8
10
12
14
16
512Bytes 1KB 2KB
Cache Size
Mis
s R
ate
(%
) Initial - DirectMapped
Data Layout Org -Direct Mapped
Initial - Fully Assoc
Data Layout Organization forDirect Mapped Caches
@H.C. Embedded Computer Architecture 31
Conclusions on Data Management• In multi-media applications exploring data transfer and storage
issues should be done at source code level• DMM method:
– Reducing number of external memory accesses
– Reducing external memory size
– Trade-offs between internal memory complexity and speed
– Platform independent high-level transformations
– Platform dependent transformations exploit platform characteristics
(efficient use of memory, cache, …)
– Substantial energy reduction
• Although caches are hardware controlled data layout can
largely influence the miss-rate