+ All Categories
Home > Documents > DARK2 in a nutshell - Uppsala University · Welcome to DARK2 (IT, MN and PhD) Erik Hagersten...

DARK2 in a nutshell - Uppsala University · Welcome to DARK2 (IT, MN and PhD) Erik Hagersten...

Date post: 13-Mar-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
36
Welcome to DARK2 (IT, MN and PhD) Erik Hagersten Uppsala University Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh 2 DARK2 2004 DARK2 On the web DARK2, Autumn 2004 Welcome! News Forms Schedule Slides Papers Assignments Reading instructions Exam (Links www.it.uu.se/edu/course/homepage/dark2/ht04 Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh 3 DARK2 2004 Literature Computer Architecture A Quantitative Approach (3rd edition) Lecturer Erik Hagersten gives most lectures and is responsible for the course Håkan Zeffer is responsible for the laborations and the hand-ins Jakob Engblom guest lecturer in embedded systems Jakob Carlström guest lecturer in network processors Jukka Rantakokko guest lecturer in parallel programming Mandatory Assignment There are two lab assignments that all participants have to complete before a hard deadline. (+ a Microprocessor Report if you are doing the MN2 version) Optional Assignment Ther are three (optional) hand-in assignments : Memory, CPU, Multiprocessors. You will get extra credit on the exam … Examination Written exam at the end of the course. No books are allowed. www.it.uu.se/edu/course/homepage/dark2/ht04 Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh 4 DARK2 2004 DARK2 in a nutshell 1. Memory Systems Caches, VM, DRAM, microbenchmarks, optimizing SW 2. Multiprocessors TLP: coherence, interconnects, scalability, clusters, … 3. CPUs ILP: pipelines, scheduling, superscalars, VLIWs, embedded, … 4. Future Technology impact, TLP+ILP in the CPU,…
Transcript

Welcome to DARK2 (IT, MN and PhD)

Erik HagerstenUppsala University

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh2

DARK22004

DARK2 On the web

DARK2, Autumn 2004Welcome!NewsFormsScheduleSlidesPapersAssignmentsReading instructionsExam

(Links

www.it.uu.se/edu/course/homepage/dark2/ht04

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh3

DARK22004

Literature Computer Architecture A Quantitative Approach (3rd edition)

Lecturer Erik Hagersten gives most lectures and is responsible for the courseHåkan Zeffer is responsible for the laborations and the hand-insJakob Engblom guest lecturer in embedded systemsJakob Carlström guest lecturer in network processorsJukka Rantakokko guest lecturer in parallel programming

Mandatory Assignment

There are two lab assignments that all participants have to complete before a hard deadline. (+ a Microprocessor Report if you are doing the MN2 version)

Optional Assignment

Ther are three (optional) hand-in assignments : Memory, CPU, Multiprocessors. You will get extra credit on the exam …

Examination Written exam at the end of the course. No books are allowed.

www.it.uu.se/edu/course/homepage/dark2/ht04

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh4

DARK22004

DARK2 in a nutshell1. Memory Systems

Caches, VM, DRAM, microbenchmarks, optimizing SW

2. Multiprocessors TLP: coherence, interconnects, scalability, clusters, …

3. CPUsILP: pipelines, scheduling, superscalars, VLIWs,embedded, …

4. FutureTechnology impact, TLP+ILP in the CPU,…

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh5

DARK22004

Lab1

Run programs in a architecture simulator modeling cache and memoryStudy performance when you:

change the cache modelchange the program

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh6

DARK22004

Lab2

Write your own ”microbenchmark” and find out everything about your favorite computer’s memory system

Introduction to Computer Architecture

Erik HagerstenUppsala University

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh8

DARK22004

What is computer architecture?

“Bridging the gap between programs and transistors”

“Finding the best model to execute the programs” best={fast, cheap, energy-efficient, reliable, predictable, …}

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh9

DARK22004

”Only” 20 years ago: APZ 212”the AXE supercomputer”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh10

DARK22004

APZ 212marketing brochure quotes:

”Very compact”6 times the performance1/6:th the size1/5 the power consumption

”A breakthrough in computer science””Why more CPU power?””All the power needed for future development””…800,000 BHCA, should that ever be needed””SPC computer science at its most elegance””Using 64 kbit memory chips””1500W power consumption

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh11

DARK22004

CPU Improvements

2000 2005 2010 2015 Year

Relative Performance[log scale]

Historical ra

te: 55% /year

1000

100

10

1

??

????

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh12

DARK22004

How do we get good performance?

Creating and exploring:1) Locality

a) Spatial locality b) Temporal localityc) Geographical locality

2) Parallelisma) Instruction levelb) Thread level

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh13

DARK22004

Execution in a CPU

”Machine Code”

”Data”CPU

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh14

DARK22004

RegisterRegister--basedbased machinemachineExample: C := A + B

654321

A:12B:14C:10

LD R1, [A]LD R7, [B]ADD R2, R1, R7ST R2, [C]

Data:

7891011

?

? ?

”Machine Code”1212

14

+

26

14

12

14

1226

26Programcounter

(PC)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh15

DARK22004

How ”long” is a CPU cycle?

1982: 5MHz200ns 60 m (in vacum)

2002: 3GHz clock0.3ns 10cm (in vacum)0.3ns 3mm (on silicon)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh16

DARK22004

Lifting the CPU hood (simplified…)

DCBA

CPU

Mem

Instructions:

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh17

DARK22004

Pipeline

DCBA

Mem

Instructions:

I R X WRegs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh18

DARK22004

Pipeline

A

Mem

I R X WRegs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh19

DARK22004

Pipeline

A

Mem

I R X WRegs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh20

DARK22004

Pipeline

A

Mem

I R X WRegs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh21

DARK22004

Pipeline:

A

Mem

I R X WRegs

I = Instruction fetchR = Read registerX = ExecuteW = Write register

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh22

DARK22004

Register Operations:

Add R1, R2, R3

A

Mem

I R X WRegs

23 1

OP: +Ifetch

PC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh23

DARK22004

Initially

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

PC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh24

DARK22004

Cycle 1

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

PC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh25

DARK22004

Cycle 2

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

PC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh26

DARK22004

Cycle 3

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

+

PC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh27

DARK22004

Cycle 4

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

+

PC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh28

DARK22004

Cycle 5

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

+

PC

A

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh29

DARK22004

Cycle 6

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

<

PC

A

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh30

DARK22004

Cycle 7

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

PC

A

Branch Next PC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh31

DARK22004

Cycle 8

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

PC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh32

DARK22004

Pipelining: a great idea??

Great instruction throughput (one/cycle)!Explored instruction-level parallelism (ILP)!Requires ”enough” ”independent” instructions

Control dependenceData dependence

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh33

DARK22004

Data dependency

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1RegC := RegC + 1

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh34

DARK22004

Today: ~10-20 stages and 4-6 pipes

Mem

I RRegs

+Shorter cycletime (more MHz)+ Even more ILP (parallel pipelines)- Branch delay even more expensive- Even harder to find ”enough” independent instr.

I R B M MW

I R B M MW

I R B M MW

I

I

I

I

Issuelogic

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh35

DARK22004

Modern MEM: ~150 CPU cycles

I RRegs

I R B M MW

I R B M MW

I R B M MW

I

I

I

I

Issuelogic

Mem

250 cycles+Shorter cycletime (more MHz)- Branch delay even more expensive- Memory access even more expensive- Even harder to find ”enough” independent instr.

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh36

DARK22004

Connecting to the Memory SystemConnecting to the Memory System

II RR XX MM WW

(d)(d)

s1s1s2s2

stst datadata

pcpc

dest datadest data

Data

Instr

DataDataMemoryMemorySystemSystem

IInstrnstrMemoryMemorySystemSystem

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh37

DARK22004

Fix: Use a cache

Mem

I RRegs

B M MW

I R B M MW

I R B M MW

I R B M MW

I

I

I

I

Issuelogic

250cycles 1GB

64kB$10 cycles

Caches and more cachesor

spam, spam, spam and spam

Erik HagerstenUppsala University, Sweden

[email protected]

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh39

DARK22004

Webster about “cache”

1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together, fr. L coactare to compel, fr. coactus, pp. of cogere to compel - more at COGENT 1a: ahiding place esp. for concealing and preserving provisions or implements 1b: a secure place of storage 2: something hidden or stored in a cache

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh40

DARK22004

Cache knowledge useful when...

Designing a new computerWriting an optimized program

or compileror operating system …

Implementing software cachingWeb cachesProxiesFile systems

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh41

DARK22004

Memory/storage

sram dram disk

sram

2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns1kB 64k 4MB 1GB 1 TB

(1982: 200ns 200ns 200ns 10 000 000ns)Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh42

DARK22004

Address Book CacheLooking for Tommy’s Telephone Number

ÖÄÅZYXVUT

TOMMY 12345

ÖÄÅZYXV

“Address Tag”

One entry per page =>Direct-mapped caches with 28 entries

“Data”

Indexingfunction

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh43

DARK22004

Address Book CacheLooking for Tommy’s Number

ÖÄÅZYXVUT

OMMY 12345

TOMMY

EQ?

index

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh44

DARK22004

Address Book CacheLooking for Tomas’ Number

ÖÄÅZYXVUT

OMMY 12345

TOMAS

EQ?

index

Miss!Lookup Tomas’ number inthe telephone directory

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh45

DARK22004

Address Book CacheLooking for Tomas’ Number

ZYXVUT

OMMY 12345

TOMASindex

Replace TOMMY’s datawith TOMAS’ data. There is no other choice(direct mapped)

OMAS 23457

ÖÄÅ

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh46

DARK22004

Cache

CPUCache

address

data (a word)hit

Memory

address

data

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh47

DARK22004

Cache Organization

Cache

OMAS 23457

TOMAS

index

=

Hit (1)

(1)

(4) (4)

1

Addrtag

&

(1)

Data (5 digits)

(1)Valid

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh48

DARK22004

Cache

Cache Organization (really)4kB, direct mapped

index

=

Hit?(1)(32)

1

Addrtag

&

(1)

Data

(1)

Valid

00100110000101001010011010100011

1k entries of 4 bytes each

(?)

(?)(?)

0101001 0010011100101

32 bit address identifying a byte in memory

OrdinaryMemory

msb lsb

”Byte”

What is a goodindex

function

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh49

DARK22004

Cache Organization4kB, direct mapped

index

=

Hit?(1)(32)

1

Addrtag

&

(1)

Data

(1)

Valid

00100110000101001010011010100011

1k entries of 4 bytes each

(10)

(20) (20)

0101001 0010011100101

32 bit address Identifies the byte within a word

msb lsb

MemOverhead:21/32= 66%

Latency =SRAM+CMP+AND

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh50

DARK22004

Cache

CPUCache

address

data (a word)hit

Memory

address

data

Hit: Use the data provided from the cache~Hit: Use data from memory and also store it in

the cache

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh51

DARK22004

Cache performance parameters

Cache “hit rate” [%]Cache “miss rate” [%] (= 1 - hit_rate) Hit time [CPU cycles]Miss time [CPU cycles]Hit bandwidthMiss bandwidthWrite strategy….

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh52

DARK22004

How to rate architecture performance?

Marketing:Frequency

Architecture goodness: CPI = Cycles Per InstructionIPC = Instructions Per Cycle

Benchmarking:SPEC-fp, SPEC-int, …TPC-C, TPC-D, …

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh53

DARK22004

Cache performance exampleAssumption:Infinite bandwidthA perfect 1.0 CyclesPerInstruction (CPI) CPU100% instruction cache hit rate

Total number of cycles =#Instr. * ( (1 - mem_ratio) * 1 +

mem_ratio * avg_mem_accesstime) =

= #Instr * ( (1- mem_ratio) + mem_ratio * (hit_rate * hit_time +

(1 - hit_rate) * miss_time)

CPI = 1 -mem_ratio + mem_ratio * (hit_rate * hit_time +

(1 - hit_rate) * miss_time)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh54

DARK22004

Example Numbers

CPI = 1 - mem_ratio + mem_ratio * (hit_rate * hit_time) +mem_ratio * (1 - hit_rate) * miss_time)

mem_ratio = 0.25hit_rate = 0.85hit_time = 3miss_time = 100

CPI = 0.75 + 0.25 * 0.85 * 3 + 0.25 * 0.15 * 100 =

0.75 + 0.64 + 3.75 = 5.14CPU HIT MISS

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh55

DARK22004

What if ...

CPI = 1 -mem_ratio + mem_ratio * (hit_rate * hit_time) +mem_ratio * (1 - hit_rate) * miss_time)

mem_ratio = 0.25hit_rate = 0.85hit_time = 3 CPU HIT MISSmiss_time = 100 == > 0.75 + 0.64 + 3.75 = 5.14

•Twice as fast CPU ==> 0.37 + 0.64 + 3.75 = 4.77

•Faster memory (70c) ==> 0.75 + 0.64 + 2.62 = 4.01

•Improve hit_rate (0.95) => 0.75 + 0.71 + 1.25 = 2.71

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh56

DARK22004

How to get more effective caches:

Larger cache (more capacity)Cache block size (larger cache lines)More placement choice (more associativity)Innovative caches (victim, skewed, …)Cache hierarchies (L1, L2, L3, CMR)Latency-hiding (weaker memory models)Latency-avoiding (prefetching)Cache avoiding (cache bypass)Optimized application/compiler…

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh57

DARK22004

Why do you miss in a cache

Mark Hill’s three “Cs”Compulsory miss (touching data for the first time)Capacity miss (the cache is too small)Conflict misses (imperfect cache implementation)

(Multiprocessors)Communication (imposed by communication)False sharing (side-effect from large cache blocks)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh58

DARK22004

Avoiding Capacity Misses –a huge address bookLots of pages. One entry per page.

Ö

Ä

Å

Z

Y

X

ÖV

ÖU

ÖT

LING 12345

ÖÖ

ÖÄ

ÖÅ

ÖZ

ÖY

ÖX

“Address Tag”

One entry per page =>Direct-mapped caches with 784 (28 x 28) entries

“Data”

NewIndexingfunction

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh59

DARK22004

Cache Organization1MB, direct mapped

index

=

Hit?(1)

(32)

1

Addrtag

&

(1)

Data

(1)

Valid

00100110000101001010011010100011

256k entries

(18)

(12) (12)

0101001 0010011100101

32 bit address Identifies the byte within a word

msb lsb

MemOverhead:13/32= 40%

Latency =SRAM+CMP+AND

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh60

DARK22004

Pros/Cons Large Caches++ The safest way to get improved hit rate-- SRAMs are very expensive!!-- Larger size ==> slower speed

more load on “signals”longer distances

-- (power consumption)-- (reliability)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh61

DARK22004

Why do you hit in a cache?Temporal locality

Likely to access the same data again soon

Spatial localityLikely to access nearby data again soon

Typical access pattern:(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, ...

temporal spatial

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh62

DARK22004

(32)Data

Multiplexer(4:1 mux)

Identifies the word within a cache line

(2)Select

Identifies a byte within a word

Fetch more than a word: cache blocks1MB, direct mapped, CacheLine=16B

1

00100110000101001010011010100011

64k entries

0101001 0010011100101(16)index

0010011100101 0010011100101 0010011100101

=

Hit?(1)&

(1)

(12) (12)(32) (32) (32) (32)

128 bits

msb lsb

MemOverhead:13/128= 10%

Latency =SRAM+CMP+AND

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh63

DARK22004

Pros/Cons Large Cache Lines++ Explores spatial locality++ Fits well with modern DRAMs

* first DRAM access slow* subsequent accesses fast (“page mode”)

-- Poor usage of SRAM & BW for some patterns-- Higher miss penalty (fix: critical word first)-- (False sharing in multiprocessors)

Cache line size

Perf

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh64

DARK22004 Thanks: Erik Berg

UART research: (Data for a fully associative cache)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh65

DARK22004

Cache ConflictsTypical access pattern:

(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, …

What if B and C index to the same cache locationConflict misses -- big time!Potential performance loss 10-100x

temporal spatial

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh66

DARK22004

Address Book CacheTwo names per page: index first, then search.

ÖÄÅZYXVUT

OMAS

TOMMY

EQ?

index

12345

23457

OMMY

EQ?

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh67

DARK22004

How should the select signal be

produced?

Avoiding conflict: More associativity1MB, 2-way set-associative, CL=4B

index

(32)

1

Data

00100110000101001010011010100011

128k “sets”

(17)0101001 0010011100101

Identifies a byte within a word

Multiplexer(2:1 mux)

(32) (32)

1 0101001 0010011100101

Hit?

=

&

(13)

(1)Select

(13)

=

&

“logic”

msb lsb

Latency =SRAM+CMP+AND+LOGIC+MUX

One “set”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh68

DARK22004

Pros/Cons Associativity

++ Avoids conflict misses-- Slower access time-- More complex implementation

comparators, muxes, ...-- Requires more pins (for external

SRAM…)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh69

DARK22004

Going all the way…!1MB, fully associative, CL=16B

index

=

Hit? 4B

1

&

Data

00100110000101001010011010100011One “set”

(0)

(28)0101001 0010011100101

Identifies a byte within a word

Multiplexer(256k:1 mux)

Identifies the word within a cache line

Select (16)

(13)

16B 16B

1 0101001 0010011100101

=

&

“logic”

(2)1 0101001 0010011100101 1

=

&

=

&

16B...

64kcomparators

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh70

DARK22004

Fully Associative

Very expensiveOnly used for small caches

CAM = Contents-addressable memory~Fully-associative cache storing

key+dataProvide key to CAM and get the

associated data

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh71

DARK22004

A combination thereof1MB, 2-way, CL=16B

index

=

Hit? (32)

1

&

Data

001001100001010010100110101000110

32k “sets”

(15)

(13)0101001 0010011100101

Identifies a byte within a word

Multiplexer(8:1 mux)

Identifies the word within a cache line

(1)Select

(13)

(128) (128)

1 0101001 0010011100101

=

&

“logic”(2)

(256)

msb lsb

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh72

DARK22004

Who to replace?Picking a “victim”

Least-recently usedConsidered the “best” algorithm (which is not always true…)Only practical up to ~4-way

Not most recently usedRemember who used it last: 8-way -> 3 bits/CL

Pseudo-LRUBased on course time stamps. Used in the VM system

Random replacementCan’t continuously to have “bad luck...

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh73

DARK22004

4-way sub-blocked cache1MB, direct mapped, Block=64B, sub-block=16B

index

=

Hit?

(4)

(32)

1

&

Data

00100110000101001010011010100011

16k

(14)

(12)

0101001 0010011100101

Identifies a byte within a word

0010011100101

16:1 mux

Identifies the word within a cache line

(2)

(12)(128) (128) (128) (128)

512 bits

0 1 0

& & &

logic4:1 mux(2)

Sub block within a block

msb lsb

MemOverhead:16/512= 3%

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh74

DARK22004

Pros/Cons Sub-blocking++ Lowers the memory overhead++ Avoids problems with false sharing++ Avoids problems with bandwidth waste -- Will not explore as much spatial locality-- Still poor utilization of SRAM -- Fewer sparse “things”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh75

DARK22004

Replacing dirty cache linesWrite-back

Write dirty data back to memory (next level) at replacementA “dirty bit” indicates an altered cache line

Write-through Always write through to the next level (as well)

data will never be dirty no write-backs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh76

DARK22004

Write Buffer/Store Buffer

Do not need the old value for a store

Write around (no write allocate in caches) used for lower level smaller caches

CPU

cache

WB:

stores loads

===

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh77

DARK22004

Innovative cache: Victim cache

CPU

Cacheaddress

datahit Memory

address

data

Victim Cache (VC): a small, fairly associative cache (~10s of entries)Lookup: search cache and VC in parallelCache replacement: move victim to the VC and replace in VCVC hit: swap VC data with the corresponding data in Cache

VCaddress

data (a word)

hit

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh78

DARK22004

Skewed Associative CacheA, B and C have a three-way conflict

2-wayABC

4-wayABC

It has been shown that 2-way skewed performs roughly the same as 4-way caches

2-way skewedABC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh79

DARK22004

Skewed-associative cache:Different indexing functions

index

=

1

&

00100110000101001010011010100011

128k entries

(17)

(13)

0101001 0010011100101

32 bit address Identifies the byte within a word

1 0101001 0010011100101

f1

f2

(>18)

(>18) (17)

msb lsb

2:1mux

=

&

function

(32) (32)

(32)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh80

DARK22004

UART: Elbow cacheIncrease “associativity” when needed

Performs roughly the same as an 8-way cacheSlightly fasterUses much less power!!

AB

If severe conflict:make room

ABC

Conflict!!ABC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh81

DARK22004

Cache Hierarchy Latency

300:1 between on-chip SRAM - DRAMcache hierarchies

L1: small on-chip cache Runs in tandem with pipeline smallVIPT caches adds constraints (more later…)

L2: large SRAM on-chip Communication latency becomes more important

L3: Off-chip SRAMHuge cache ~10x faster than DRAM

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh82

DARK22004

Cache Hierarchy

CPU

L1$on-chip

L2$on-module

L3$on-board

Memory

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh83

DARK22004

Topology of caches: Harvard Arch

CPU needs a new instruction each cycle25% of instruction LD/ST Data and Instr. have different access patterns

==> Separate D and I first level cache==> Unified 2nd and 3rd level caches

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh84

DARK22004

Common Cache Structurefor Servers

CPU

I$ D$

L2$

L3$...

L1: CL=32B, Size=32kB, 4-way, 1ns, split I/DL2: CL=128B, Size= 1MB, 8-way, 4ns, unifiedL3: CL=128B, Size= 32MB, 2-way, 15ns, unified

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh85

DARK22004

Why do you miss in a cache

Mark Hill’s three “Cs”Compulsory miss (touching data for the first time)Capacity miss (the cache is too small)Conflict misses (imperfect cache implementation)

(Multiprocessors)Communication (imposed by communication)False sharing (side-effect from large cache blocks)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh86

DARK22004

How are we doing?

Creating and exploring:1) Locality

a) Spatial locality b) Temporal localityc) Geographical locality

2) Parallelisma) Instruction levelb) Thread level

Memory Technology

Erik HagerstenUppsala University, Sweden

[email protected]

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh88

DARK22004

Main memory characteristicsMain memory characteristics

Performance of main memory:Access time: time between address is latched and data is available (~50ns)Cycle time: time between requests (~100 ns)Total access time: from ld to REG valid (~150ns)

• Main memory is built from DRAM: Dynamic RAM• 1 transistor/bit ==> more error prune and slow• Refresh and precharge

• Cache memory is built from SRAM: Static RAM• about 4-6 transistors/bit

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh89

DARK22004

DRAM organizationDRAM organization4Mbit memory array One bit memory cell

Bit line

Word line

CapacitanceR

owde

cod e

r

RAS

Address11

(4) Dataout

Column decoder CAS

Column latch

2048×2048cell matrix

The address is multiplexed Row/Address Strobe (RAS/CAS)“Thin” organizations (between x16 and x1) to decrease pin loadRefresh of memory cells decreases bandwidthBit-error rate creates a need for error-correction (ECC)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh90

DARK22004

SRAM organizationSRAM organization

Row

dec

oder

Column decoder

512×512×4cell matrix

In buffer

Diff.

-am

plify

er

A1A2A3A4A5A6A7A8A9

A 0 A 10

A 11

A 12

A 13

A 14

A 15

A 16

A 17

I /O3

I /O2

I /O1

I /O0

CE

WEOE

Address is typically not multiplexedEach cell consists of about 4-6 transistorsWider organization (x18 or x36), typically few chipsOften parity protected (ECC becoming more common)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh91

DARK22004

Error Detection and CorrectionError Detection and Correction

Error-correction and detectionE.g., 64 bit data protected by 8 bits of ECC

Protects DRAM and high-availability SRAM applicationsDouble bit error detection (”crash and burn” )Chip kill detection (all bits of one chip stuck at all-1 or all-0)Single bit correctionNeed “memory scrubbing” in order to get good coverage

ParityE.g., 8 bit data protected by 1 bit parity

Protects SRAM and data pathsSingle-bit ”crash and burn” detectionNot sufficient for large SRAMs today!!

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh92

DARK22004

Correcting the ErrorCorrecting the ErrorCorrection on the fly by hardware

no performance-glitchgreat for cycle-level redundancyfixes the problem for now…

Trap to softwarecorrect the data value and write back to memory

Memory scrubberkernel process that periodically touches all of memory

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh93

DARK22004

Improving main memoryImproving main memory performanceperformance

Page-mode => faster access within a small distanceImproves bandwidth per pin -- not time to critical wordSingle wide bank improves access time to the complete CLMultiple banks improves bandwidth

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh94

DARK22004

Newer kind of DRAM...SDRAM

Mem controller provides strobe for next seq. access

DDR-DRAMTransfer data on both edges

RAMBUSFast unidirectional circular busSplit transaction addr/dataEach DRAM devices implements RAS/CAS/refresh…internally

CPU and DRAM on the same chip?? (IMEM)...

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh95

DARK22004

Physical memory, little endian

Big Endian

Little Endian

00 00 00 5f

lsbmsb

0

64MB

00 00 00 5f

lsbmsb

0

64MB

Store the value 0x5F

o H e l l

lsbmsb

0

64MB

o l l e H

lsbmsb

0

64MB

Store the string Hello

4 5 6 7 0 1 2 3

lsbmsb

0

64MB

7 6 5 4 3 2 1 0

lsbmsb

0

64MB

Numbering the bytes

Word

Virtual Memory System

Erik HagerstenUppsala University, Sweden

[email protected]

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh97

DARK22004

Physical Memory

Physical Memory

Disk

0

64MB

PROGRAM

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh98

DARK22004

Virtual and Physical Memory

0

4GBtext

heap

data

stack

Context A

0

4GBtext

heap

data

stack

Context B

Physical Memory

Disk

0

64MB

Segments

…PROGRAM

$2$1

(Caches)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh99

DARK22004

Translation & Protection0

4GBtext

heap

data

Context A

0

4GBtext

heap

data

Context B

Physical Memory

Disk

0

64MB

R

RRW

RW

stack stack

Virtual Memory

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh100

DARK22004

Virtual memory Virtual memory —— parametersparametersCompared to first-level cache parameters

Parameter First-level cache Virtual memory

Block (page) size 16-128 bytes 4K-64K bytes

Hit time 1-2 clock cycles 40-100 clock cycles

Miss penalty(Access time)(Transfer time)

8-100 clock cycles(6-60 clock cycles)(2-40 clock cycles)

700K-6000K clock cycles(500K-4000K clock cycles)(200K-2000K clock cycles)

Miss rate 0.5%-10% 0.00001%-0.001%

Data memory size 16 Kbyte - 1 Mbyte 16 Mbyte - 8 Gbyte

Replacement in cache handled by HW. Replacement in VM handled by SWVM hit latency very low (often zero cycles)VM miss latency huge (several kinds of misses)Allocation size is one ”page” 4kB and up)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh101

DARK22004

VM: Block placementVM: Block placement

Where can a block (page) be placed in main memory?What is the organization of the VM?

The high miss penalty makes SW solutions to implement a fully associative address mapping feasible at page faultsA page from disk may occupy any pageframe in PASome restriction can be helpful (page coloring)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh102

DARK22004

VM: Block identificationVM: Block identification

Use a page table stored in main memory: • Suppose 8 Kbyte pages, 48

bit virtual address• Page table occupies

248/213* 4B = 237 = 128GB!!!•Solutions:• Only one entry per

physical page is needed• Multi-level page table

(dynamic)• Inverted page table

(~hashing)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh103

DARK22004

ksegkseg

Address translationAddress translationMulti-level table: The Alpha 21064 (

seg1seg1

seg0seg0

seg1seg1

seg0seg0

seg1seg1

seg0seg0

Kernel segmentUsed by OS.Does not use virtual memory.

User segment 1Used for stack.

User segment 0Used for instr. &static data & heap

Segment is selectedby bit 62 & 63 in addr.

PTE

Page Table Entry:(translation & protection)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh104

DARK22004

Protection mechanismsProtection mechanisms

The address translation mechanism can be used to provide memory protection:

Use protection attribute bits for each pageStored in the page table entry (PTE) (and TLB…)Each page gets its own per process protectionViolations detected during the address translation cause exceptions (i.e., SW trap)Supervisor/user modes necessary to prevent user processes from changing e.g. PTEs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh105

DARK22004

Fast address translationFast address translation

How do we avoid three extra memory references for each original memory reference?

Store the most commonly used address translations in a cache—Translation Look-aside Buffer (TLB)

==> The caches rears their ugly faces again!

P TLBlookup

Cache Main memory

VA PA

Transl.in mem

Data

Addr

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh106

DARK22004

Do we need a fast TLB?

Why do a TLB lookup for every L1 access?Why not cache virtual addresses instead?

Move the TLB on the other side of the cacheIt is only needed for finding stuff in Memory anyhowThe TLB can be made larger and slower – or can it?

P TLBlookup

Cache Main memory

VA PA

Transl.in mem

Data

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh107

DARK22004

Aliasing ProblemAliasing Problem

The same physical page may be accessed using different virtual addresses

A virtual cache will cause confusion -- a write by one process may not be observedFlushing the cache on each process switch is slow (and may only help partly)=>VIPT (VirtuallyIndexedPhysicallyTagged) is the answer

Direct-mapped cache no larger than a pageNo more sets than there are cache lines on a page + logicPage coloring can be used to guarantee correspondence between more PA and VA bits (e.g., Sun Microsystems)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh108

DARK22004

Virtually Indexed Physically Tagged =VIPT

Have to guarantee that all aliases have the same indexL1_cache_size < (page-size * associativity)Page coloring can help further

P TLBlookup

Cache

Main memory

VAPA

Transl.in memData

=Index

PA Addr tag

Hit

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh109

DARK22004

What is the capacity of the TLBWhat is the capacity of the TLB

Typical TLB size = 0.5 - 2kBEach translation entry 4 - 8B ==> 32 - 500

entriesTypical page size = 4kB - 16kB TLB-reach = 0.1MB - 8MB

FIX:Multiple page sizes, e.g., 8kB and 8 MBTSB -- A direct-mapped translation in memory as a “second-level TLB”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh110

DARK22004

VM: Page replacementVM: Page replacement

Most important: minimize number of page faults

Page replacement strategies:• FIFO—First-In-First-Out

• LRU—Least Recently Used

• Approximation to LRU• Each page has a reference bit that is set on a reference• The OS periodically resets the reference bits• When a page is replaced, a page with a reference bit that is not

set is chosen

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh111

DARK22004

So far…

DataL1$

UnifiedL2$

CPU

PFhandler

I

Pagefault

D

Disk

Memory

DDDDDDD

I

DDDDD

I II I

TLB miss TLB(transl$)

TLBfill

PT PTPTPTPTPT

TLB fill

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh112

DARK22004

Adding TSB (software TLB cache)

TLBDAtrans$

TLBfill

PFhandler

PT PTPTPTPTPT

DDDDDDD

I

DDDDD

I II I

I

DataL1$Page

fault

TLB missUnified

L2$

D

CPU

MemoryDisk

TSB

TLB fill

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh113

DARK22004

VM: Write strategyVM: Write strategy

Write back!Write through is impossible to use:

Too long access time to diskThe write buffer would need to be prohibitivelylargeThe I/O system would need an extremely high bandwidth

Write back or Write through?

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh114

DARK22004

VM dictionaryVM dictionaryVirtual Memory System The “cache” languge

Virtual address ~Cache address

Physical address ~Cache location

Page ~Huge cache block

Page fault ~Extremely painfull $miss

Page-fault handler ~The software filling the $

Page-out Write-back if dirty

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh115

DARK22004

Putting it all togetherPutting it all together

TLBDAtrans$

TLBfill

PFhandler

PT PTPTPTPTPT

DDDDDDD

I

DDDDD

I II I

I

DataL1$Page

fault

TLB miss

TLBIAtrans$

InstrL1$ Unified

L2$

D

CPU

1-2ns 2-4ns 10-20ns

150ns

500ns

2-10ms

MemoryDisk

TLB fill L2 miss

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh116

DARK22004

SummarySummary

Cache memories:HW-managementSeparate instruction and data caches permits simultaneous instruction fetch and data accessFour questions:

Block placementBlock identificationBlock replacementWrite strategy

Virtual memory:Software-managementVery high miss penalty => miss rate must be very lowAlso supports:

memory protectionmultiprogramming

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh117

DARK22004

Caches Everywhere…

D cacheI cacheL2 cacheL3 cacheITLBDTLBTSBVirtual memory systemBranch predictorsDirectory cache…

Exploring the Memory of a Computer System

Erik HagerstenUppsala University, Sweden

[email protected]

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh119

DARK22004

Micro Benchmark Signature

for (times = 0; times < Max; times++) /* many times*/for (i=0; i < ArraySize; i = i + Stride)

dummy = A[i]; /* touch an item in the array */

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh120

DARK22004

Micro Benchmark Signature

for (times = 0; times < Max; times++) /* many times*/for (i=0; i < ArraySize; i = i + Stride)

dummy = A[i]; /* touch an item in the array */

Tim

e (n

s)Stride (bytes)

4 16 64 256 1 K 4 K 16 K 64 K 256 K 1 M 4 M0

100

200

300

400

500

600

700

8 M4 M2 M1 M

512 K256 K128 K

64 K32 K16 K

Stride(bytes)

Avg

tim

e (n

s)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh121

DARK22004

Stepping through the array

for (times = 0; times < Max; times++) /* many times*/for (i=0; i < ArraySize; i = i + Stride)

dummy = A[i]; /* touch an item in the array */

0 Array Size = 16, Stride=4

0 Array Size = 32, Stride=4

0 Array Size = 16, Stride=8

0 Array Size = 32, Stride=8Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh122

DARK22004

Micro Benchmark Signature

for (times = 0; times < Max; time++) /* many times*/for (i=0; i < ArraySize; i = i + Stride)

dummy = A[i]; /* touch an item in the array */

Tim

e (n

s)

Stride (bytes)

4 16 64 256 1 K 4 K 16 K 64 K 256 K 1 M 4 M0

100

200

300

400

500

600

700

8 M4 M2 M1 M

512 K256 K128 K

64 K32 K16 K

ArraySize=8MB

ArraySize=512kB

ArraySize=32-256kB

ArraySize=16kB

Stride(bytes)

Avg

tim

e (n

s)Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh123

DARK22004

Micro Benchmark Signaturefor (times = 0; times < Max; time++) /* many times*/

for (i=0; i < ArraySize; i = i + Stride)dummy = A[i]; /* touch an item in the array */

Tim

e (n

s)

Stride (bytes)

4 16 64 256 1 K 4 K 16 K 64 K 256 K 1 M 4 M0

100

200

300

400

500

600

700

8 M4 M2 M1 M

512 K256 K128 K

64 K32 K16 K

ArraySize=8MB

ArraySize=512kB

ArraySize=32kB-256kB

ArraySize=16kB

L1$ hit

L2$hit=40ns

Mem=300ns

Mem+TLBmiss

L2$ block size=64B

Pagesize=8k ==> #TLB entries = 32-64

(56 normal+8 large)

L1$ block size=16B

L2$+TLBmiss

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh124

DARK22004

Twice as large L2 cache…

for (times = 0; times < Max; time++) /* many times*/for (i=0; i < ArraySize; i = i + Stride)

dummy = A[i]; /* touch an item in the array */

Tim

e (n

s)

Stride (bytes)

4 16 64 256 1 K 4 K 16 K 64 K 256 K 1 M 4 M0

100

200

300

400

500

600

700

8 M4 M2 M1 M

512 K256 K128 K

64 K32 K16 K

ArraySize=8MB

ArraySize=512kB

ArraySize=32-256kB

ArraySize=16kB

Stride(bytes)Avg

tim

e (n

s)

ArraySize=1M

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh125

DARK22004

Twice as large TLB…

for (times = 0; times < Max; time++) /* many times*/for (i=0; i < ArraySize; i = i + Stride)

dummy = A[i]; /* touch an item in the array */

Tim

e (n

s)

Stride (bytes)

4 16 64 256 1 K 4 K 16 K 64 K 256 K 1 M 4 M0

100

200

300

400

500

600

700

8 M4 M2 M1 M

512 K256 K128 K

64 K32 K16 K

ArraySize=8MB

ArraySize=512kB

ArraySize=32-256kB

ArraySize=16kB

Stride(bytes)

Avg

tim

e (n

s)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh126

DARK22004

SGI Challenge Memory System Performance

Tim

e (n

s)

Stride (bytes)

4 16 64 256 1 K 4 K 16 K 64 K 256 K 1 M 4 M0

500

1,000

1,500

TLB

MEM

L2

8 M4 M2 M

1 M512 K256 K128 K

64 K32 K16 K

Stride(bytes)

Avg

tim

e (n

s)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh127

DARK22004

Can software help us?

How are we doing?

Creating and exploring:1) Locality

a) Spatial locality b) Temporal localityc) Geographical locality

2) Parallelisma) Instruction levelb) Thread level

Optimizing for cache performance

Erik HagerstenUppsala University, Sweden

[email protected]

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh129

DARK22004

What is the potential gain?

Latency difference L1$ and mem: ~50xBandwidth difference L1$ and mem: ~20xRepeated TLB misses adds a factor ~2-3xExecute from L1$ instead from mem ==> 50-150x improvementAt least a factor 2-4x is within reach

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh130

DARK22004

Optimizing for cache performance

Keep the active footprint smallUse the entire cache line once it has been brought into the cacheFetch a cache line prior to its usageLet the CPU that already has the data in its cache do the job...

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh131

DARK22004

Merging arrays (align)/* Unoptimized */int info1[MAX]int info2[MAX]int key[MAX]

/* Optimized */struct merged_a {

align ...int key, info1, info2;

};struct merged_a merge_array[size];

/* AltOptimized */struct merged_a {

int key, info1, info2, dummy;

};

struct merge merge_array[size];

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh132

DARK22004

Loop Interchange (for C)/* Unoptimized */for (j = 0; j < N; j = j + 1)

for (i = 0; i < N; i = i + 1)x[i][j] = 2 * x[i][j];

/* Optimized */for (i = 0; i < N; i = i + 1)

for (j = 0; j < N; j = j + 1)x[i][j] = 2 * x[i][j];

(FORTRAN: The other way around!)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh133

DARK22004

Merging arrays/* Unoptimized */int record[MAX]int key[MAX]

/* Optimized */struct merge {

int record;int key;

};struct merge merge_array[size];

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh134

DARK22004

Loop Merging/* Unoptimized */for (i = 0; i < N; i = i + 1)

for (j = 0; j < N; j = j + 1)a[i][j] = 2 * b[i][j];

for (i = 0; i < N; i = i + 1)for (j = 0; j < N; j = j + 1)

c[i][j] = K * b[i][j] + d[i][j]/2

/* Optimized */for (i = 0; i < N; i = i + 1)

for (j = 0; j < N; j = j + 1)a[i][j] = 2 * b[i][j];c[i][j] = K * b[i][j] + d[i][j]/2;

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh135

DARK22004

Padding of data structures

j

i

1024

1024

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh136

DARK22004

Padding of data structures

j

i

1024+padding

1024

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh137

DARK22004

Blocking

/* Unoptimized ARRAY: x = y * z */for (i = 0; i < N; i = i + 1)

for (j = 0; j < N; j = j + 1){r = 0;for (k = 0; k < N; k = k + 1)

r = r + y[i][k] * z[k][j];x[i][j] = r;};

j

i

X: k

i

Y: j

k

Z:

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh138

DARK22004

Blocking

/* Optimized ARRAY: X = Y * Z */for (jj = 0; jj < N; jj = jj + B) for (kk = 0; kk < N; kk = kk + B)for (i = 0; i < N; i = i + 1)

for (j = jj; j < min(jj+B,N); j = j + 1){r = 0;for (k = kk; k < min(kk+B,N); k = k + 1)

r = r + y[i][k] * z[k][j];x[i][j] += r;};

j

i

X: k

i

Y: j

k

Z:First block

Second blockPartial solution

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh139

DARK22004

Prefetching/* Unoptimized */for (j = 0; j < N; j = j + 1)

for (i = 0; i < N; i = i + 1)x[i][j] = 2 * x[i][j];

/* Optimized */for (i = 0; i < N; i = i + 1)

for (j = 0; j < N/4; j = j + 4)PREFETCH x[i][j+8] x[i][j] = 2 * x[i][j];

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh140

DARK22004

Cache Affinity

Schedule the process on the processor it last ranCaches are warmed up ...

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh141

DARK22004

How are we doing?

Creating and exploring:1) Locality

a) Spatial locality b) Temporal localityc) Geographical locality

2) Parallelisma) Instruction levelb) Thread level

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh142

DARK22004

Lab1

Run programs in a architecturesimulator modelling cache and memoryStudy performance when you:

change the cache modelchange the program

Dept of Information Technology| www.it.uu.se © Erik Hagersten| www.docs.uu.se/~eh143

DARK22004

Lab2

Write your own ”microbenchmark” and find out everything about your favorit computer’s memory system


Recommended