Download - Textbook: Burdea and Coiffet, Virtual Reality Technology ... · 1 Textbook: Burdea and Coiffet, Virtual Reality Technology, 2nd Edition, Wiley, 2003 Textbook web site:

1

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2nd Edition, Wiley, 2003

Textbook web site: www.vrtechnology.org

2

Textbook web site: www.vrtechnology.org

Laboratory Hardware

3

Topics

14:332:331The Memory Hierarchy

4

A Typical Memory HierarchyArchitectures need to create the illusion of unlimited and fast memory; To do so we remember that applications do not access all the program memory, or data memory, at onceRather programs access a relatively small portion of their address space at any moment in time;Temporal locality - means that a referenced item in memory tends to be referenced again soon; (loops)Spatial locality - means that if an item is referenced, those adjacent to it tend to be referenced soon. (arrays)

SecondLevelCache

(SRAM)

A Typical Memory Hierarchy

SecondaryMemory(Disk)

MainMemory(DRAM)D

ataC

acheInstr

Cache

Control

Datapath

On-Chip Components

RegFile

ITLBD

TLB

Speed: 0.5 ns 1 ns 5 ns 50-70 ns 5,000,000 ns

Size: 128 B 64 KB 256 KB 4 GB TB’sCost/GB highest $10,000 $100 $1 lowest

These principles allows memory to be organized as a hierarchy ofmultiple levels with different access speeds and sizes

5

A Typical Memory HierarchySince main memory is much slower, to avoid pipeline stalls, the data and the instructions that the CPU needs soon should be moved into cache(s).Even though memory consists of multiple levels, data is copied only between two adjacent levels at a time.Within each level, a unit of information that is present or not is called a block. Blocks can be either single-word (32 bits wide), or multiple-word.

block

How is the Hierarchy Managed?When the data that the CPU needs are present in cache, it is a hit. When the needed data are not present, it is a miss, and misses have penalties -When there is a miss, the pipeline is frozen until data is fetched from main memory- affects performance. Data needs to always be present in the lowest level of the hierarchy. Temporal Locality - Keep most recently accessed data items closer to the processorSpatial Locality - Move blocks consisting of contiguous words to the upper levels

6

How do we apply the principle of locality?

To Processor

From Processor

Upper LevelMemory

Blk X

Lower LevelMemory

Blk Y

Hit: data needed by CPU appears in some block in the upper level (Block X)Hit Rate: the fraction of accesses found in the upper levelHit Time: Time to access the upper level = RAM access time + Time to determine if the access is a hit/missMiss penalty: time for data to be retrieve from a lower level memory (needed data is in Block Y) and deliver it to the processorHit Time << Miss Penalty

Data cannot be present hereIf it is not present here

Miss PenaltyMiss: data needs to be retrieve from a lower level block (Blk Y)

Miss Rate = 1 - (Hit Rate)Miss Penalty: Time to replace a block in the upper level with a block from the lower level + Time to deliver this block to the processor

In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss RateRecall that CPUtime=CPUexec.time (incl hits) +MemAccess.timeMemStall.Access.time=(Read-Stall.cycles+Write-stall.cycles) ×clock cycle.timeMemStall time=Reads/Program × Read miss rate × Miss penalty + Writes/Program × Write miss rate × Miss penalty (ignore write buffer stalls)Or Mem-stall = (Misses/program) × Miss penalty

7

Miss PenaltyWhat is the degrading influence of misses (stalls)?Performance perfect cache = CPUtime with stalls Performance cache with stalls CPUtime without stalls

= CPU exec time + Mem. Stall time = CPU exec time

# Instr. x CPIwithstall x Clock cycle time = CPIwithstall#Instr. x CPIperfect x Clock cycle time CPIperfect

CPIwithstall depends on the application. CPIwithstall = CPIperfect+CPImisses=

= CPIperfect+CPImiss inst+CPI miss.data.memIf miss penalty is 100 cycles, miss rate for instructions is 2%, miss rate for data memory cache is 4%, and data memory access rate is36% (for gcc example) and CPIperfect is 2, thenCPImisses = (2%+36%x4%)x100=3.44

Performance perfect cache = 2+3.44 = 2.72 thus 172% degradationPerformance with stalls 2

Miss PenaltyIncreasing the clock rate will not solve the problem - doubling the

clock rate, for example, will mean that the miss penalty goes from 100 cycles to 200 cycles. In that case Performance fast clock with stalls = ICxCPIstallxClock cycle time

Performance slow clock with stalls ICxCPIstallxClockcycletime/2

= [(2%+36%x4%)100 + 2]x2 = 10.88=1.23 Performance grows 23% not 100%!(2%+36%x4%)200+2 8.88

We need a way to reduce both the miss rate (%) and the miss penalty (100-200 cycles) Reducing the miss penalty is done by doing a multi-level cache - a miss in the primary cache means data is retrieved from secondarycache (faster - less cycles) instead of main memory. Reducing the miss rate is dependent on cache architecture as we will see shortly.

8

The simplest cache has blocks of one-word size. Direct-mapped cache - each memory location (32-bit memory address) maps exactly to one cache locationBut, main memory is much larger than cache, thus several memory addresses map to one cache location.Question 1: - How do we know whether the requested word is in the cache or not?Question 2: - How do we know if the data in cache corresponds to the requested word or not?Question 3: - How do we know if the data found in cache is valid?

The Simplest Cache

Each block in cache is indexed, allowing each block to be addressed when the CPU is looking for instructions or data stored in cache. A tag identifies which memory location corresponds to that particular block in cache. The tag contains the upper portion of the address, while the lower portion is used in the index. Bits 0 and 1 are not used.

The first bit in the cache block is a valid bitwhich tells the cache controller if the data in that block are valid or not.

Cache

9

Cache - exampleCache initially empty. All Valid bits are off

CPU requests address 10110 - a miss

CPU requests address 11010 - another miss

Cache - example

CPU requests address 00011 -miss

CPU requests address 10010 – miss

Then CPU requests address 10000 - hit

CPU requests address 10000 -miss

10000

Temporal locality - recently accessed words replace less recently accessed words

After that CPU requests address 10110 - hitThen CPU requests address 11010 - hit

10

If there is an instruction miss, Control Block stalls, PC-4 is sent to memory, memory performs read, data is placed in cache in the lower bits index slot, the tag is written with the upper memory address, valid is 1 and instruction is re-fetched

MIPS Direct Mapped Cache Example

Hit

=

20Tag 10Index

20

Data

32

DataIndex TagValid012...

102110221023

Cache

31 30 . . . 13 12 11 . . . 2 1 0Byte offset

Memory address

1111 (15)

1110 (14)

1101 (13)

1100 (12)

11 431011 (11)

1010 (10)

91001 (9)

8 561000 (8)

0111 (7)

60110 (6)

50101 (5)

4 20 4 0100 (4)

190011 (3)

0010 (2)

1 170001 (1)

0000 (0)

TAGBlock

Example:The series of memory address references as word addresses is1,4,8,5,20,17,19,11,4,43,5,6,9,17. Assume direct-mapped cache with 16 1-word blocks, which is initially empty. Label each reference in the list as Hit or Miss and show the final contents of the cache.

Hit1701 0001

Hit900 1001

Miss600 0110

Hit500 0101

Miss4310 1011

Miss400 0100

Miss1100 1011

Miss9 00 1001

Miss5611 1000

Miss1901 0011

Miss1701 0001

Miss2001 0100

Miss500 0101

Miss800 1000

Miss400 0100

Miss100 0001

HIT/MISSREFERENCEBinary Representation

11

Cache sizeHow large needs the cache be? Each row in the cache is formed of the valid bit + tag bits + data bits Assuming that there are n index bits, thus 2n rows in the cache, then

the cache size is 2n ×[32 + (32 - n - 2) + 1] = 2n × (63-n)For 64 kB of data that needs to be placed in cache which has 1-word blocks, since 1 word = 4 byte, cache needs to hold 16 k

words = 16 k blocks. 214 = 16,384 which means n = 14Thus the cache size is 2n ×(63-n)= 16,384(63-14)= 802,816 bits =

100 kB.For a 256 kB of data, 1-word blocks, we have 64 k words, n = 16,

the total size is 216 ×(63-16)= 65,536x47=3,080,192 bits=385 kB

64-bit cache with 16-byte blocks

=

Hit16 Tag

12Index

32

2

Block offset

Data

32

12

16 KB cache with 16-words blocks

=

Hit18 Tag

8Index

32

4

Block offset

Data

32

Cache sizeThe same data quantity can be accommodated by a cache

that has larger blocks (4-words each) but less number of index bits. Using several words per block improves the miss rate

vs. single-word per block caches. - spatial localityFor the same application gcc 5.4% miss rate (1-word

blocks) goes down to 1.9% (4-word blocks). For spice the miss rate goes from 1.2% (1-word blocks) to

0.4% (4-word blocks). However, the miss penalty increases since now 4

words need to be loaded at once (more cycles used in a stall). Early restart – fetch the requested word first (works for

instruction cache if memory can deliver one instr/cycle)

13

Cache sizeIf blocks become too large vs. cache size, the number of blocks for a

given cache size becomes too small, the miss rate goes back up.

256 KB cache

16 KB cache

4 KB cache

64 KB cache

Based on SPEC92

Handling writesWhen the CPU writes data to cache, it should also be

written in main memory to keep it consistent (called write-through)

Writing to main memory is slow and reduces performance (ex. A write to main memory may take 100 cycles). If 10% of instructions are sw, then CPI becomes 1+100×.1=11 cycles/instruction vs. 1. )

Thus we need to use a write buffer which should be several words deep to account for write “bursts”

To avoid buffer overflow (which corrupts data), the rate of reading the buffer by the main memory needs to be larger than the rate at which the buffer is filled by the CPU.An alternative is write back write in memory only

when cache block is being replaced.

14

Memory systems that support cacheDRAMS are designed to increase density not access timeTo reduce the miss penalty we need to change the memory access

design, to increase throughput.One word-wide memory Wide memory Interleaved memory

Sequ

entia

l acc

ess

Para

llel a

cces

s to

all w

ords

in a

blo

ck

Sim

ulta

neou

s mem

ory

read

The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways.

Memory Systems that Support Caches

DRAMMemory

One word wide organization (one word wide bus and one word wide memory)

Assume1 clock cycle (2 ns) to send the address

25 clock cycles (50 ns) for DRAM cycletime,

1 clock cycle (2ns) to return a word of data

Memory-Bus to Cache bandwidthnumber of bytes accessed from memory and transferred to cache/CPU per clock cycle

bus32-bit data&

32-bit addrper cycle

CPU

Cache

on-chip

15

Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock cycles

One Word Wide Memory Organization

CPU

Cache

Memory

bus

on-chip If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory

1 cycle to send address25 cycles to read DRAM1 cycle to return data

27 total clock cycles miss penalty

4/27 = 0.148

What if the block size were four words? 1 cycle to send 1st address100 cycles to read DRAM1 cycle to return last data word

102 total clock cycles miss penalty

One Word Wide Memory Organization

CPU

Cache

Memory

bus

on-chip

25 cycles

25 cycles

25 cycles

25 cycles

(4 x 4)/102 = 0.157 bytes/clock cycle

Number of bytes transferred per clock cycle (bandwidth) for a single miss is

16

Interleaved Memory OrganizationFor a block size of four words1 cycle to send 1st address25 + 3 = 28 cycles to read DRAM1 cycle to return last data word30 total clock cycles miss penalty

Memorybank 0

CPU

Cache

Memorybank 1

bus

on-chip

Memorybank 2

Memorybank 3

Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/30 = 0.533 bytes per clock cycle = 4.264 bits/clock cycle.

25 cycles

25 cycles

25 cycles

25 cycles

Further Improvements to Memory Organization(DDR-SDRAMs)

An external clock (300 MHz) synchronizes memory addressesExample – 4 M DRAM – outputs one bit from the array2048 column latches and 1 multiplexorSDRAM is provided the starting address and the burst length 2/4/8–need not provide successive addresses.DDR – double data rate – transfers data on both the raising and falling edge of the external clock. 1980 DRAMS were 64 Kbit, column access to an existing row 150 ns2004 DRAMS were 1024 Mbit, column access to existing row 3 ns.

17

Further Improvements – two-level cache

Figure shows the AMD Athlon and Duron processor architecture Two-level caches allow L1 cache to be smaller – improves the hit time as they are fasterL2 cache is larger – its access time is less crytical – larger block sizedL2 is accessed whenever a miss occurs in L1, which reduces the L1 miss penalty dramatically. L2 is also used to store the contents of the “victim buffer” – data rejected from L1 cache when a L1 miss occurs

Reducing Cache Misses through AssociativityWe recall that a direct-mapped cache allows one memory location to map to only one block in cache (uses tags) - needs only one comparator. A fully associative cache - a block in memory can map to any block in the cache. Thus all entries in cache must be searched. This is done in parallel, with one comparator for each memory block. It is expensive (from hardware point of view). Works for small cachesIn-between the two extremes are set-associative caches - a block in memory maps to only one set of blocks, but can occupy any position within that set.

18

Reducing Cache Misses through AssociativityA n-way set-associative cache has sets with n blocks eachAll blocks in the set have to be searched - reduces the number of comparators to n.

One-way set-associative(same as direct mapped)

Two-way set-associative

Four-way set-associative

Eight-way set-associative (same as fully associative)

As associativity increases the miss rate decreases (1-way 10.3%, 8-way 8.1% data miss rate), but the hit time increases.

22

32

For set-assoc. caches, any doubling of associativity decreases the number of index bits by one and increases the number of tag bitsby 1. For fully associative cache, no index bits since it is only one set.

4-way associative cache

Four blocks

Four comparators

19

It had 20 tag bits vs. 22 for 4-way associative cache and 10 index bits vs. 8 for the 4-way associative cache. How many tag and index bits for an 8-way associative cache?

Recall the Direct Mapped Cache

Hit 20Tag 10Index

DataIndex TagValid012...

102110221023

31 30 . . . 13 12 11 . . . 2 1 0Byte offset

20

Data

32

23 tag bits7 index bits

The basic principle is “Least-recently Used” (LRU) –replace the block that is older.Keeping track of a block’s “age” done in hardwareIt is practical for small set-associativity (2-way or 4-way). For higher associativity LRU is either approximated of replacement is randomFor 2-way set-associative, random replacement has 10% higher miss rate than LRUAs caches become larger the miss rates for both strategies fall, and the difference between the two is smaller.

Which block to replace in an associative cache?

20

Associativity usually improves miss ratio, but not always. Give a short series of address references for which a 2-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size.

Exercise

2-way has half the number of sets for same size. All map to same set

ABC

The sequence A,B,C,A,B,C generates: Miss, miss, miss, Hit, miss, miss, Hit..

A BCThe same sequence generates: Miss, miss, miss, miss, miss, miss….

AB

Suppose a computer address size is k bits (using byte addressing), the cache size is S bytes, the block size is B bytes, and the cache is A-way set-associative. Assume that B is a power of 2, so B=2b. Figure out what the following quantities are:

- the number of sets in the cache?- the number of index bits in the address?- the number of bits needed to implement the cache?

Address size = k bitsCache size = S bytes/cacheBlock size = B= 2b bytes/blockAssociativity= A blocks/setThe number of sets/cache=Bytes/cache = Bytes/cache = S

Bytes/set Block/set × Bytes/block A×B

Exercise

21

Index bits 2(#index bits) = sets/cache = S__A×B

#Index bits =log2 ( S ) = log2 ( S ) = log2 ( S ) - log2 (2b) = log2 ( S ) - b A×B A×2b A A

Tag address bits = total address bits - index bits - block offset bits= k - [log2 ( S ) - b] - b = K - log2 ( S )

A A

Bits in tag memory/cache = Tag address bits/block × Blocks/set ×Sets/cache = [K - log2 (S) ] A × S = S [K - log2 (S) ]

A A×B B A

Exercise - continued

Virtual MemoryWhen multiple applications (processes) run at the same time, the main memory (DRAM) becomes too smallVirtual memory extends the memory hierarchy to the hard disk, and treats the RAM as “cache” for the hard disk. This way each process is allocated a portion of the RAM, and each program has its own range of physical memory addressesVirtual memory “translates” (maps) the virtual addresses of each program to physical addresses in main memory Protections have to be in place in case of data sharing.

22

DRAM

Virtual MemoryThe CPU generates virtual addresses and memory is accessed by physical addressesThe memory is treated as fully-associative cache, and divided into pagesTranslation eliminates the need to find a contiguous block of memory to allocate to a program.

page

Shared memory

Virtual Memory - continuedThe translation mechanism maps the CPU 32-bit address to the real physical address using a virtual page number and a page offsetVirtual address space is much larger than physical address space(220 vs. 218) 4 GB vs. 1 GB RAM - illusion of infinite memory

23

Virtual Memory - continuedThe number of page offset bits determines the page size –typically 4 KB to 16 KB) - should be large enough to reduce the chances of page faultWhen there is a page fault - millions of clock cycles as penalty - it is treated in software through the exception mechanism. Software can reduce page faults by cleverly deciding which pages to replace in DRAM (older pages)Pages always exist on the hard disk, but are loaded into DRAM only when needed. A write-back mechanism insures that pages that were altered (written into in RAM) are saved on disk before being discarded.

Virtual Memory - continuedThe translation mechanism is provided by a page tableEach program has its own page table which contains the physical addresses of the pages and is indexed by the virtual page number.Each program when it has possession of the CPU has its pointer to the page table loaded by the OS and its page table is readSince each process has its own page table, programs can have same virtual address spacebecause the page table will have different mappings for different programs (protection)

24

Page Table

The size of the page table has to be limited, such that no one process gobbles up the whole physical memory

Pointer to the location of the first address of the page table of the active process

Valid bit indicates if the page is in DRAM

The page table of a process is not fixed- it is altered by the OS to assure different processes do not collide and in case of page faults

Page FaultsIf the Valid Bit is 0 - Page fault - the address points to a page on the hard disk The page needs to be loaded in DRAM by the OS and the page table written to change the mapping to a new address in physicalmemory and turn the Valid Bit to 1If physical memory is full, an

existing page needs to be discarded before a new page is loaded from disk

OS uses least-recently used scheme

It uses a reference bit for each physical page set whenever that page is accessed. OS looks at it, while also periodically resets this bit (a statistical LRU)

A dirty bit is added to the page table to indicate if the page was altered - if yes it needs to be saved before being discarded

25

Translation-Look-aside Buffer (TLB)To optimize the translation process and reduce memory access timeTLB is a cache that holds recently used page table mappings.TLB tags hold the virtual page number and its data holds the corresponding physical page number. TLB also holds the reference bit, valid bit and dirty bit TLB miss - page in page table

loaded by the CPU - much more frequent or

Page not in page table - page fault exception

In case of a miss the CPU selects which entry in the TLB needs to be replaced. Its reference and dirty bits are then written back into the page table. Miss rates for the TLB are 0.01-1%

penalty is 10-100 clock cycles much smaller than page fault!

Example:Consider a virtual memory system with 40-bit virtual byte address, 16 KB page and 36-bit physical byte address.What is the total size of the page table for each process on this machine, assuming that the valid, protection, dirty and use bitstake a total of 4 bits and that all the virtual pages are in use? Assume that disk addresses are not stored on the page table.

Page table size = #entries × entry sizeThe #entries = # pages in virtual address = 240 bytes =

16 × 103 bytes/page= 240 = 226 entries

24 × 210

The width of each entry is 4+ 36 = 40 bitsThus the size of the page table is 226 × 40 = 5 × 226 bytes= 335 MB

23

26

TLB and cache working together (Intrinsity FastMATH Proc.)

4 KB pages, TLB - 16 entries, fully associative - all need to be compared. Each entry is 64-bits

20 tag bits (virtual page #)20 data bits (physical page #)valid, ref and dirty bits, etc.One of the extra bits is a write

access bit. Prevents programs from writing into pages for which they have only read access - part of protection mechanism. There could be three misses -

cache miss, TLB miss and page fault.

A TLB miss in this case takes 16 cycles on average.CPU saves process state then

gives control of the CPU to another process, then brings page from disk.

27

How are TLB misses and Page Faults handled?TLB miss – no entry in TLB matches the virtual address.

In that case, if the page is in memory (as indicated by the page table) then that address is placed in the TLB.

So the TLB miss is handled by the OS in software. Once the TLB has the virtual address in, then the instruction that caused the TLB miss is re-executed.

If the valid bit of the retrieved page address in the TLB is 0, then a page fault

When a page fault occurs, the OS takes control and stores the states of the process that caused the page fault, as well as the address of the instruction that caused the page fault in the EPC.

How are TLB misses and Page Faults handled?The OS then finds a place for the page by discarding an

old one (if it was dirty it first has to be saved on disk)After that the OS starts the transfer of the needed page

from hard disk and gives control of the CPU to another process (millions of cycles).

Once the page was transferred, then the OS reads the EPC and returns control to the offending process so it can complete.

Also, if that instruction that caused the page fault was a sw, the write control line for the data memory is de-asserted to prevent the sw from completing.

When an exception occurs, the processor sets a bit that disable exceptions, so that a subsequent exception will not overwrite the EPC.

28

The influence of Block size In general, larger block size take advantage of spatial locality BUT:Larger block size means larger miss penalty - Takes longer time to fill up the blockIf block size is too big relative to cache size, miss rate will go up

Too few cache blocksIn general, Average Access Time = Hit Time × (1 - Miss Rate) + Miss Penalty × Miss Rate

MissPenalty

Block Size

MissRate Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

Block Size

AverageAccess

Time

Increased Miss Penalty& Miss Rate

Block Size

The Influence of AssociativityEvery change that improves the miss rate can also negatively affect overall performanceEx. We can reduce the miss rate by increasing associativity (30% gain for small caches going from direct-mapped to two-way associative). But large associativity does not make sense for modern caches which are large, since hardware costs more (more comparators) and the access time is larger. While for cache full associativity does not pay, for paged memory it is good because misses are very expensive. Large page size means that Page Table is small.

29

The influence of associativity (SPEC2000)

Small caches

Large caches

Memory writes options There are two options: write-through (for cache) and write-

back (for paged memory). During write-back pages are written to disk only if they were

modified prior to being replaced. The advantages of write-back are that multiple writes to a given page require only one write to the disk, and using high bandwidth, not one word-at-a-time. Individual words can be written in a page much faster (cache rate) than if they were written-through to disk. The advantage of write-through is that misses are simpler to handle and easier to implement (using write buffer). In the future more caches will use write-back because of the CPU-Memory gap.

30

Processor-DRAM Memory Gap (latency)

Solutions to reduce the gap:-L3 cache- Have the L2, L3 caches do something while idle

7% DRAM annual performance improvement

Sources of (Cache) Misses Compulsory (cold start or process migration, first reference): first access to a block

“Cold” fact of life: not a whole lot you can do about itNote: If you are going to run “billions” of instruction, Compulsory Misses are insignificant

Conflict (collision): Multiple memory locations (blocks) mapped to the same cache location

Solution 1: increase cache sizeSolution 2: increase associativity

Capacity: Cache cannot contain all blocks accessed by the program

Solution: increase cache sizeInvalidation: other process (e.g., I/O) updates memory

31

Additional conflict misses when going from two-way to one-way associative cache

Additional conflict misses when going from four-way to two-way associative cache

Total Misses Rate vs. Cache type and size

Capacity misses reduce for larger caches

Design alternatives Increase cache size

Decreases capacity missesMay increase access time

Increase associativityDecreases conflict miss rateMay increase access time

Increase block size: Decreases miss rate due to spatial localityBut increased miss penaltyVery large blocks may increase miss rate for small cachesSo design of memory hierarchies is interesting

32

Processor-DRAM Memory Gap for Multi-cores

Cores

Performance degradation for memory intensive applications

Processor-DRAM Memory Gap for Multi-coresSolution is to have “3-D chips” which place the DRAM on chip