1
Textbook: Burdea and Coiffet, Virtual Reality Technology, 2nd Edition, Wiley, 2003
Textbook web site: www.vrtechnology.org
2
Textbook web site: www.vrtechnology.org
Laboratory Hardware
3
Topics
14:332:331The Memory Hierarchy
4
A Typical Memory HierarchyArchitectures need to create the illusion of unlimited and fast memory; To do so we remember that applications do not access all the program memory, or data memory, at onceRather programs access a relatively small portion of their address space at any moment in time;Temporal locality - means that a referenced item in memory tends to be referenced again soon; (loops)Spatial locality - means that if an item is referenced, those adjacent to it tend to be referenced soon. (arrays)
SecondLevelCache
(SRAM)
A Typical Memory Hierarchy
SecondaryMemory(Disk)
MainMemory(DRAM)D
ataC
acheInstr
Cache
Control
Datapath
On-Chip Components
RegFile
ITLBD
TLB
Speed: 0.5 ns 1 ns 5 ns 50-70 ns 5,000,000 ns
Size: 128 B 64 KB 256 KB 4 GB TB’sCost/GB highest $10,000 $100 $1 lowest
These principles allows memory to be organized as a hierarchy ofmultiple levels with different access speeds and sizes
5
A Typical Memory HierarchySince main memory is much slower, to avoid pipeline stalls, the data and the instructions that the CPU needs soon should be moved into cache(s).Even though memory consists of multiple levels, data is copied only between two adjacent levels at a time.Within each level, a unit of information that is present or not is called a block. Blocks can be either single-word (32 bits wide), or multiple-word.
block
How is the Hierarchy Managed?When the data that the CPU needs are present in cache, it is a hit. When the needed data are not present, it is a miss, and misses have penalties -When there is a miss, the pipeline is frozen until data is fetched from main memory- affects performance. Data needs to always be present in the lowest level of the hierarchy. Temporal Locality - Keep most recently accessed data items closer to the processorSpatial Locality - Move blocks consisting of contiguous words to the upper levels
6
How do we apply the principle of locality?
To Processor
From Processor
Upper LevelMemory
Blk X
Lower LevelMemory
Blk Y
Hit: data needed by CPU appears in some block in the upper level (Block X)Hit Rate: the fraction of accesses found in the upper levelHit Time: Time to access the upper level = RAM access time + Time to determine if the access is a hit/missMiss penalty: time for data to be retrieve from a lower level memory (needed data is in Block Y) and deliver it to the processorHit Time << Miss Penalty
Data cannot be present hereIf it is not present here
Miss PenaltyMiss: data needs to be retrieve from a lower level block (Blk Y)
Miss Rate = 1 - (Hit Rate)Miss Penalty: Time to replace a block in the upper level with a block from the lower level + Time to deliver this block to the processor
In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss RateRecall that CPUtime=CPUexec.time (incl hits) +MemAccess.timeMemStall.Access.time=(Read-Stall.cycles+Write-stall.cycles) ×clock cycle.timeMemStall time=Reads/Program × Read miss rate × Miss penalty + Writes/Program × Write miss rate × Miss penalty (ignore write buffer stalls)Or Mem-stall = (Misses/program) × Miss penalty
7
Miss PenaltyWhat is the degrading influence of misses (stalls)?Performance perfect cache = CPUtime with stalls Performance cache with stalls CPUtime without stalls
= CPU exec time + Mem. Stall time = CPU exec time
# Instr. x CPIwithstall x Clock cycle time = CPIwithstall#Instr. x CPIperfect x Clock cycle time CPIperfect
CPIwithstall depends on the application. CPIwithstall = CPIperfect+CPImisses=
= CPIperfect+CPImiss inst+CPI miss.data.memIf miss penalty is 100 cycles, miss rate for instructions is 2%, miss rate for data memory cache is 4%, and data memory access rate is36% (for gcc example) and CPIperfect is 2, thenCPImisses = (2%+36%x4%)x100=3.44
Performance perfect cache = 2+3.44 = 2.72 thus 172% degradationPerformance with stalls 2
Miss PenaltyIncreasing the clock rate will not solve the problem - doubling the
clock rate, for example, will mean that the miss penalty goes from 100 cycles to 200 cycles. In that case Performance fast clock with stalls = ICxCPIstallxClock cycle time
Performance slow clock with stalls ICxCPIstallxClockcycletime/2
= [(2%+36%x4%)100 + 2]x2 = 10.88=1.23 Performance grows 23% not 100%!(2%+36%x4%)200+2 8.88
We need a way to reduce both the miss rate (%) and the miss penalty (100-200 cycles) Reducing the miss penalty is done by doing a multi-level cache - a miss in the primary cache means data is retrieved from secondarycache (faster - less cycles) instead of main memory. Reducing the miss rate is dependent on cache architecture as we will see shortly.
8
The simplest cache has blocks of one-word size. Direct-mapped cache - each memory location (32-bit memory address) maps exactly to one cache locationBut, main memory is much larger than cache, thus several memory addresses map to one cache location.Question 1: - How do we know whether the requested word is in the cache or not?Question 2: - How do we know if the data in cache corresponds to the requested word or not?Question 3: - How do we know if the data found in cache is valid?
The Simplest Cache
Each block in cache is indexed, allowing each block to be addressed when the CPU is looking for instructions or data stored in cache. A tag identifies which memory location corresponds to that particular block in cache. The tag contains the upper portion of the address, while the lower portion is used in the index. Bits 0 and 1 are not used.
The first bit in the cache block is a valid bitwhich tells the cache controller if the data in that block are valid or not.
Cache
9
Cache - exampleCache initially empty. All Valid bits are off
CPU requests address 10110 - a miss
CPU requests address 11010 - another miss
Cache - example
CPU requests address 00011 -miss
CPU requests address 10010 – miss
Then CPU requests address 10000 - hit
CPU requests address 10000 -miss
10000
Temporal locality - recently accessed words replace less recently accessed words
After that CPU requests address 10110 - hitThen CPU requests address 11010 - hit
10
If there is an instruction miss, Control Block stalls, PC-4 is sent to memory, memory performs read, data is placed in cache in the lower bits index slot, the tag is written with the upper memory address, valid is 1 and instruction is re-fetched
MIPS Direct Mapped Cache Example
Hit
=
20Tag 10Index
20
Data
32
DataIndex TagValid012...
102110221023
Cache
31 30 . . . 13 12 11 . . . 2 1 0Byte offset
Memory address
1111 (15)
1110 (14)
1101 (13)
1100 (12)
11 431011 (11)
1010 (10)
91001 (9)
8 561000 (8)
0111 (7)
60110 (6)
50101 (5)
4 20 4 0100 (4)
190011 (3)
0010 (2)
1 170001 (1)
0000 (0)
TAGBlock
Example:The series of memory address references as word addresses is1,4,8,5,20,17,19,11,4,43,5,6,9,17. Assume direct-mapped cache with 16 1-word blocks, which is initially empty. Label each reference in the list as Hit or Miss and show the final contents of the cache.
Hit1701 0001
Hit900 1001
Miss600 0110
Hit500 0101
Miss4310 1011
Miss400 0100
Miss1100 1011
Miss9 00 1001
Miss5611 1000
Miss1901 0011
Miss1701 0001
Miss2001 0100
Miss500 0101
Miss800 1000
Miss400 0100
Miss100 0001
HIT/MISSREFERENCEBinary Representation
11
Cache sizeHow large needs the cache be? Each row in the cache is formed of the valid bit + tag bits + data bits Assuming that there are n index bits, thus 2n rows in the cache, then
the cache size is 2n ×[32 + (32 - n - 2) + 1] = 2n × (63-n)For 64 kB of data that needs to be placed in cache which has 1-word blocks, since 1 word = 4 byte, cache needs to hold 16 k
words = 16 k blocks. 214 = 16,384 which means n = 14Thus the cache size is 2n ×(63-n)= 16,384(63-14)= 802,816 bits =
100 kB.For a 256 kB of data, 1-word blocks, we have 64 k words, n = 16,
the total size is 216 ×(63-16)= 65,536x47=3,080,192 bits=385 kB
64-bit cache with 16-byte blocks
=
Hit16 Tag
12Index
32
2
Block offset
Data
32
12
16 KB cache with 16-words blocks
=
Hit18 Tag
8Index
32
4
Block offset
Data
32
Cache sizeThe same data quantity can be accommodated by a cache
that has larger blocks (4-words each) but less number of index bits. Using several words per block improves the miss rate
vs. single-word per block caches. - spatial localityFor the same application gcc 5.4% miss rate (1-word
blocks) goes down to 1.9% (4-word blocks). For spice the miss rate goes from 1.2% (1-word blocks) to
0.4% (4-word blocks). However, the miss penalty increases since now 4
words need to be loaded at once (more cycles used in a stall). Early restart – fetch the requested word first (works for
instruction cache if memory can deliver one instr/cycle)
13
Cache sizeIf blocks become too large vs. cache size, the number of blocks for a
given cache size becomes too small, the miss rate goes back up.
256 KB cache
16 KB cache
4 KB cache
64 KB cache
Based on SPEC92
Handling writesWhen the CPU writes data to cache, it should also be
written in main memory to keep it consistent (called write-through)
Writing to main memory is slow and reduces performance (ex. A write to main memory may take 100 cycles). If 10% of instructions are sw, then CPI becomes 1+100×.1=11 cycles/instruction vs. 1. )
Thus we need to use a write buffer which should be several words deep to account for write “bursts”
To avoid buffer overflow (which corrupts data), the rate of reading the buffer by the main memory needs to be larger than the rate at which the buffer is filled by the CPU.An alternative is write back write in memory only
when cache block is being replaced.
14
Memory systems that support cacheDRAMS are designed to increase density not access timeTo reduce the miss penalty we need to change the memory access
design, to increase throughput.One word-wide memory Wide memory Interleaved memory
Sequ
entia
l acc
ess
Para
llel a
cces
s to
all w
ords
in a
blo
ck
Sim
ulta
neou
s mem
ory
read
The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways.
Memory Systems that Support Caches
DRAMMemory
One word wide organization (one word wide bus and one word wide memory)
Assume1 clock cycle (2 ns) to send the address
25 clock cycles (50 ns) for DRAM cycletime,
1 clock cycle (2ns) to return a word of data
Memory-Bus to Cache bandwidthnumber of bytes accessed from memory and transferred to cache/CPU per clock cycle
bus32-bit data&
32-bit addrper cycle
CPU
Cache
on-chip
15
Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock cycles
One Word Wide Memory Organization
CPU
Cache
Memory
bus
on-chip If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory
1 cycle to send address25 cycles to read DRAM1 cycle to return data
27 total clock cycles miss penalty
4/27 = 0.148
What if the block size were four words? 1 cycle to send 1st address100 cycles to read DRAM1 cycle to return last data word
102 total clock cycles miss penalty
One Word Wide Memory Organization
CPU
Cache
Memory
bus
on-chip
25 cycles
25 cycles
25 cycles
25 cycles
(4 x 4)/102 = 0.157 bytes/clock cycle
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
16
Interleaved Memory OrganizationFor a block size of four words1 cycle to send 1st address25 + 3 = 28 cycles to read DRAM1 cycle to return last data word30 total clock cycles miss penalty
Memorybank 0
CPU
Cache
Memorybank 1
bus
on-chip
Memorybank 2
Memorybank 3
Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/30 = 0.533 bytes per clock cycle = 4.264 bits/clock cycle.
25 cycles
25 cycles
25 cycles
25 cycles
Further Improvements to Memory Organization(DDR-SDRAMs)
An external clock (300 MHz) synchronizes memory addressesExample – 4 M DRAM – outputs one bit from the array2048 column latches and 1 multiplexorSDRAM is provided the starting address and the burst length 2/4/8–need not provide successive addresses.DDR – double data rate – transfers data on both the raising and falling edge of the external clock. 1980 DRAMS were 64 Kbit, column access to an existing row 150 ns2004 DRAMS were 1024 Mbit, column access to existing row 3 ns.
17
Further Improvements – two-level cache
Figure shows the AMD Athlon and Duron processor architecture Two-level caches allow L1 cache to be smaller – improves the hit time as they are fasterL2 cache is larger – its access time is less crytical – larger block sizedL2 is accessed whenever a miss occurs in L1, which reduces the L1 miss penalty dramatically. L2 is also used to store the contents of the “victim buffer” – data rejected from L1 cache when a L1 miss occurs
Reducing Cache Misses through AssociativityWe recall that a direct-mapped cache allows one memory location to map to only one block in cache (uses tags) - needs only one comparator. A fully associative cache - a block in memory can map to any block in the cache. Thus all entries in cache must be searched. This is done in parallel, with one comparator for each memory block. It is expensive (from hardware point of view). Works for small cachesIn-between the two extremes are set-associative caches - a block in memory maps to only one set of blocks, but can occupy any position within that set.
18
Reducing Cache Misses through AssociativityA n-way set-associative cache has sets with n blocks eachAll blocks in the set have to be searched - reduces the number of comparators to n.
One-way set-associative(same as direct mapped)
Two-way set-associative
Four-way set-associative
Eight-way set-associative (same as fully associative)
As associativity increases the miss rate decreases (1-way 10.3%, 8-way 8.1% data miss rate), but the hit time increases.
22
32
For set-assoc. caches, any doubling of associativity decreases the number of index bits by one and increases the number of tag bitsby 1. For fully associative cache, no index bits since it is only one set.
4-way associative cache
Four blocks
Four comparators
19
It had 20 tag bits vs. 22 for 4-way associative cache and 10 index bits vs. 8 for the 4-way associative cache. How many tag and index bits for an 8-way associative cache?
Recall the Direct Mapped Cache
Hit 20Tag 10Index
DataIndex TagValid012...
102110221023
31 30 . . . 13 12 11 . . . 2 1 0Byte offset
20
Data
32
23 tag bits7 index bits
The basic principle is “Least-recently Used” (LRU) –replace the block that is older.Keeping track of a block’s “age” done in hardwareIt is practical for small set-associativity (2-way or 4-way). For higher associativity LRU is either approximated of replacement is randomFor 2-way set-associative, random replacement has 10% higher miss rate than LRUAs caches become larger the miss rates for both strategies fall, and the difference between the two is smaller.
Which block to replace in an associative cache?
20
Associativity usually improves miss ratio, but not always. Give a short series of address references for which a 2-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size.
Exercise
2-way has half the number of sets for same size. All map to same set
ABC
The sequence A,B,C,A,B,C generates: Miss, miss, miss, Hit, miss, miss, Hit..
A BCThe same sequence generates: Miss, miss, miss, miss, miss, miss….
AB
Suppose a computer address size is k bits (using byte addressing), the cache size is S bytes, the block size is B bytes, and the cache is A-way set-associative. Assume that B is a power of 2, so B=2b. Figure out what the following quantities are:
- the number of sets in the cache?- the number of index bits in the address?- the number of bits needed to implement the cache?
Address size = k bitsCache size = S bytes/cacheBlock size = B= 2b bytes/blockAssociativity= A blocks/setThe number of sets/cache=Bytes/cache = Bytes/cache = S
Bytes/set Block/set × Bytes/block A×B
Exercise
21
Index bits 2(#index bits) = sets/cache = S__A×B
#Index bits =log2 ( S ) = log2 ( S ) = log2 ( S ) - log2 (2b) = log2 ( S ) - b A×B A×2b A A
Tag address bits = total address bits - index bits - block offset bits= k - [log2 ( S ) - b] - b = K - log2 ( S )
A A
Bits in tag memory/cache = Tag address bits/block × Blocks/set ×Sets/cache = [K - log2 (S) ] A × S = S [K - log2 (S) ]
A A×B B A
Exercise - continued
Virtual MemoryWhen multiple applications (processes) run at the same time, the main memory (DRAM) becomes too smallVirtual memory extends the memory hierarchy to the hard disk, and treats the RAM as “cache” for the hard disk. This way each process is allocated a portion of the RAM, and each program has its own range of physical memory addressesVirtual memory “translates” (maps) the virtual addresses of each program to physical addresses in main memory Protections have to be in place in case of data sharing.
22
DRAM
Virtual MemoryThe CPU generates virtual addresses and memory is accessed by physical addressesThe memory is treated as fully-associative cache, and divided into pagesTranslation eliminates the need to find a contiguous block of memory to allocate to a program.
page
Shared memory
Virtual Memory - continuedThe translation mechanism maps the CPU 32-bit address to the real physical address using a virtual page number and a page offsetVirtual address space is much larger than physical address space(220 vs. 218) 4 GB vs. 1 GB RAM - illusion of infinite memory
23
Virtual Memory - continuedThe number of page offset bits determines the page size –typically 4 KB to 16 KB) - should be large enough to reduce the chances of page faultWhen there is a page fault - millions of clock cycles as penalty - it is treated in software through the exception mechanism. Software can reduce page faults by cleverly deciding which pages to replace in DRAM (older pages)Pages always exist on the hard disk, but are loaded into DRAM only when needed. A write-back mechanism insures that pages that were altered (written into in RAM) are saved on disk before being discarded.
Virtual Memory - continuedThe translation mechanism is provided by a page tableEach program has its own page table which contains the physical addresses of the pages and is indexed by the virtual page number.Each program when it has possession of the CPU has its pointer to the page table loaded by the OS and its page table is readSince each process has its own page table, programs can have same virtual address spacebecause the page table will have different mappings for different programs (protection)
24
Page Table
The size of the page table has to be limited, such that no one process gobbles up the whole physical memory
Pointer to the location of the first address of the page table of the active process
Valid bit indicates if the page is in DRAM
The page table of a process is not fixed- it is altered by the OS to assure different processes do not collide and in case of page faults
Page FaultsIf the Valid Bit is 0 - Page fault - the address points to a page on the hard disk The page needs to be loaded in DRAM by the OS and the page table written to change the mapping to a new address in physicalmemory and turn the Valid Bit to 1If physical memory is full, an
existing page needs to be discarded before a new page is loaded from disk
OS uses least-recently used scheme
It uses a reference bit for each physical page set whenever that page is accessed. OS looks at it, while also periodically resets this bit (a statistical LRU)
A dirty bit is added to the page table to indicate if the page was altered - if yes it needs to be saved before being discarded
25
Translation-Look-aside Buffer (TLB)To optimize the translation process and reduce memory access timeTLB is a cache that holds recently used page table mappings.TLB tags hold the virtual page number and its data holds the corresponding physical page number. TLB also holds the reference bit, valid bit and dirty bit TLB miss - page in page table
loaded by the CPU - much more frequent or
Page not in page table - page fault exception
In case of a miss the CPU selects which entry in the TLB needs to be replaced. Its reference and dirty bits are then written back into the page table. Miss rates for the TLB are 0.01-1%
penalty is 10-100 clock cycles much smaller than page fault!
Example:Consider a virtual memory system with 40-bit virtual byte address, 16 KB page and 36-bit physical byte address.What is the total size of the page table for each process on this machine, assuming that the valid, protection, dirty and use bitstake a total of 4 bits and that all the virtual pages are in use? Assume that disk addresses are not stored on the page table.
Page table size = #entries × entry sizeThe #entries = # pages in virtual address = 240 bytes =
16 × 103 bytes/page= 240 = 226 entries
24 × 210
The width of each entry is 4+ 36 = 40 bitsThus the size of the page table is 226 × 40 = 5 × 226 bytes= 335 MB
23
26
TLB and cache working together (Intrinsity FastMATH Proc.)
4 KB pages, TLB - 16 entries, fully associative - all need to be compared. Each entry is 64-bits
20 tag bits (virtual page #)20 data bits (physical page #)valid, ref and dirty bits, etc.One of the extra bits is a write
access bit. Prevents programs from writing into pages for which they have only read access - part of protection mechanism. There could be three misses -
cache miss, TLB miss and page fault.
A TLB miss in this case takes 16 cycles on average.CPU saves process state then
gives control of the CPU to another process, then brings page from disk.
27
How are TLB misses and Page Faults handled?TLB miss – no entry in TLB matches the virtual address.
In that case, if the page is in memory (as indicated by the page table) then that address is placed in the TLB.
So the TLB miss is handled by the OS in software. Once the TLB has the virtual address in, then the instruction that caused the TLB miss is re-executed.
If the valid bit of the retrieved page address in the TLB is 0, then a page fault
When a page fault occurs, the OS takes control and stores the states of the process that caused the page fault, as well as the address of the instruction that caused the page fault in the EPC.
How are TLB misses and Page Faults handled?The OS then finds a place for the page by discarding an
old one (if it was dirty it first has to be saved on disk)After that the OS starts the transfer of the needed page
from hard disk and gives control of the CPU to another process (millions of cycles).
Once the page was transferred, then the OS reads the EPC and returns control to the offending process so it can complete.
Also, if that instruction that caused the page fault was a sw, the write control line for the data memory is de-asserted to prevent the sw from completing.
When an exception occurs, the processor sets a bit that disable exceptions, so that a subsequent exception will not overwrite the EPC.
28
The influence of Block size In general, larger block size take advantage of spatial locality BUT:Larger block size means larger miss penalty - Takes longer time to fill up the blockIf block size is too big relative to cache size, miss rate will go up
Too few cache blocksIn general, Average Access Time = Hit Time × (1 - Miss Rate) + Miss Penalty × Miss Rate
MissPenalty
Block Size
MissRate Exploits Spatial Locality
Fewer blocks: compromisestemporal locality
Block Size
AverageAccess
Time
Increased Miss Penalty& Miss Rate
Block Size
The Influence of AssociativityEvery change that improves the miss rate can also negatively affect overall performanceEx. We can reduce the miss rate by increasing associativity (30% gain for small caches going from direct-mapped to two-way associative). But large associativity does not make sense for modern caches which are large, since hardware costs more (more comparators) and the access time is larger. While for cache full associativity does not pay, for paged memory it is good because misses are very expensive. Large page size means that Page Table is small.
29
The influence of associativity (SPEC2000)
Small caches
Large caches
Memory writes options There are two options: write-through (for cache) and write-
back (for paged memory). During write-back pages are written to disk only if they were
modified prior to being replaced. The advantages of write-back are that multiple writes to a given page require only one write to the disk, and using high bandwidth, not one word-at-a-time. Individual words can be written in a page much faster (cache rate) than if they were written-through to disk. The advantage of write-through is that misses are simpler to handle and easier to implement (using write buffer). In the future more caches will use write-back because of the CPU-Memory gap.
30
Processor-DRAM Memory Gap (latency)
Solutions to reduce the gap:-L3 cache- Have the L2, L3 caches do something while idle
7% DRAM annual performance improvement
Sources of (Cache) Misses Compulsory (cold start or process migration, first reference): first access to a block
“Cold” fact of life: not a whole lot you can do about itNote: If you are going to run “billions” of instruction, Compulsory Misses are insignificant
Conflict (collision): Multiple memory locations (blocks) mapped to the same cache location
Solution 1: increase cache sizeSolution 2: increase associativity
Capacity: Cache cannot contain all blocks accessed by the program
Solution: increase cache sizeInvalidation: other process (e.g., I/O) updates memory
31
Additional conflict misses when going from two-way to one-way associative cache
Additional conflict misses when going from four-way to two-way associative cache
Total Misses Rate vs. Cache type and size
Capacity misses reduce for larger caches
Design alternatives Increase cache size
Decreases capacity missesMay increase access time
Increase associativityDecreases conflict miss rateMay increase access time
Increase block size: Decreases miss rate due to spatial localityBut increased miss penaltyVery large blocks may increase miss rate for small cachesSo design of memory hierarchies is interesting
32
Processor-DRAM Memory Gap for Multi-cores
Cores
Performance degradation for memory intensive applications
Processor-DRAM Memory Gap for Multi-coresSolution is to have “3-D chips” which place the DRAM on chip