Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | reynold-ryan |
View: | 215 times |
Download: | 1 times |
Review of Mem. HierarchyCSCE430/830
Review of Memory Hierarchy & Storage
CSCE430/830 Computer Architecture
Lecturer: Prof. Hong Jiang
Fall, 2008
Portions of these slides are derived from:Dave Patterson © UCB
Review of Mem. HierarchyCSCE430/830
The Principle of Locality
• The Principle of Locality:– Program access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be
referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
• Last 15 years, HW relied on locality for speed
It is a property of programs which is exploited in machine design.
Review of Mem. HierarchyCSCE430/830
Memory Hierarchy - the Big Picture
• Problem: memory is too slow and too small
• Solution: memory hierarchy
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
L2Off-Chip
Cache
MainMemory(DRAM)
L1 O
n-C
hip
Cach
e0.5-25 5,000,000 (5ms)Speed (ns): 80-250
<1KSize (bytes): >100G<16G<16M
0.25-0.5
Review of Mem. HierarchyCSCE430/830
Fundamental Cache Questions
• Q1: Where can a block be placed in the upper level? (Block placement)
• Q2: How is a block found if it is in the upper level? (Block identification)
• Q3: Which block should be replaced on a miss? (Block replacement)
• Q4: What happens on a write? (Write strategy)
Review of Mem. HierarchyCSCE430/830
Q1: Where can a block be placed in the upper level?
• Block 12 placed in 8 block cache:– Fully associative, direct mapped, 2-way set
associative
– S.A. Mapping = (Block Number) Modulo (Number Sets)
Cache
01234567 0123456701234567
Memory
111111111122222222223301234567890123456789012345678901
Full Mapped Direct Mapped(12 mod 8) = 4
2-Way Assoc(12 mod 4) = 0
Review of Mem. HierarchyCSCE430/830
Q2: How is a block found if it is in the upper level?
• Tag on each block– No need to check index or block offset
• Increasing associativity shrinks index, expands tag
BlockOffset
Block Address
IndexTag
Review of Mem. HierarchyCSCE430/830
Q3: Which block should be replaced on a miss?
• Easy for Direct Mapped
• Set Associative or Fully Associative:– Random
– LRU (Least Recently Used)
Assoc: 2-way 4-way 8-way
Size LRU Ran LRU Ran LRU Ran
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Review of Mem. HierarchyCSCE430/830
Q4: What happens on a write?
Write-Through Write-Back
Policy
Data written to cache block
also written to lower-level memory
Write data only to the cache
Update lower level when a block falls out
of the cache
Debug Easy Hard
Do read misses produce writes? No Yes
Do repeated writes make it to lower
level?Yes No
Additional option (on miss)-- let writes to an un-cached address allocate a new cache line (“write-
allocate”).
Review of Mem. HierarchyCSCE430/830
Set Associative Cache Design
• Key idea: – Divide cache into sets
– Allow block anywhere in a set
• Advantages:– Better hit rate
• Disadvantage:– More tag bits
– More hardware
– Higher access time
Address
22 8
V TagIndex
0
1
2
253
254255
Data V Tag Data V Tag Data V Tag Data
3222
4-to-1 multiplexor
Hit Data
123891011123031 0
A Four-Way Set-Associative Cache
Review of Mem. HierarchyCSCE430/830
Cache Performance Measures
• Hit rate: fraction found in the cache– So high that we usually talk about Miss rate = 1 - Hit Rate
• Hit time: time to access the cache
• Miss penalty: time to replace a block from lower level, including time to replace in CPU
– access time: time to acccess lower level
– transfer time: time to transfer block
• Average memory-access time (AMAT)
= Hit time + Miss rate x Miss penalty (ns or clocks)
Review of Mem. HierarchyCSCE430/830
• Miss-oriented Approach to Memory Access:
– CPIExecution includes ALU and Memory instructions
CycleTimeyMissPenaltMissRateInst
MemAccessExecution
CPIICCPUtime
CycleTimeyMissPenaltInst
MemMissesExecution
CPIICCPUtime
Cache performance
• Separating out Memory component entirely– AMAT = Average Memory Access Time
– CPIALUOps does not include memory instructions
CycleTimeAMATInst
MemAccessCPI
Inst
AluOpsICCPUtime
AluOps
yMissPenaltMissRateHitTimeAMAT DataDataData
InstInstInst
yMissPenaltMissRateHitTime
yMissPenaltMissRateHitTime
Review of Mem. HierarchyCSCE430/830
PhysicalMemory Space
• Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry)
• Virtual memory => treat memory cache for disk
Details of Page Table
Virtual Address
Page Table
indexintopagetable
Page TableBase Reg
V AccessRights PA
V page no. offset12
table locatedin physicalmemory
P page no. offset12
Physical Address
frame
frame
frame
frame
virtual address
Page Table
Review of Mem. HierarchyCSCE430/830
Page tables may not fit in memory!
A table for 4KB pages for a 32-bit address space has 1M entries
Each process needs its own address space!
P1 index P2 index Page Offset
31 12 11 02122
32 bit virtual address
Top-level table wired in main memory
Subset of 1024 second-level tables in main memory; rest are on disk or
unallocated
Two-level Page Tables
Review of Mem. HierarchyCSCE430/830
V=0 pages either reside
on disk or have not yet
been allocated.
OS handles V=0
“Page fault”
Physical and virtual pages
must be the same size!
The TLB caches page table entries
TLB
Page Table
2
0
1
3
virtual address
page off
2frame page
250
physical address
page off
TLB caches page table
entries.
MIPS handles TLB misses in software
(random replacement). Other machines use
hardware.
for ASID
Physical
frameaddres
s
Review of Mem. HierarchyCSCE430/830
Virtually Indexed, Physically Tagged Cache
What motivation?• Fast cache hit by parallel TLB access• No virtual cache shortcomings
How could it be correct?• Require cache way size <= page size; now physical index is from page offset• Then virtual and physical indices are identical works like a physically ⇒indexed cache!
Review of Mem. HierarchyCSCE430/830
Summary #1/3: The Cache Design Space
• Several interacting dimensions– cache size
– block size
– associativity
– replacement policy
– write-through vs write-back
– write allocation
• The optimal choice is a compromise– depends on access characteristics
» workload
» use (I-cache, D-cache, TLB)
– depends on technology / cost
• Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
Review of Mem. HierarchyCSCE430/830
Summary #2/3: Caches
• The Principle of Locality:– Program access a relatively small portion of the address space at any
instant of time.» Temporal Locality: Locality in Time» Spatial Locality: Locality in Space
• Three Major Categories of Cache Misses:– Compulsory Misses: sad facts of life. Example: cold start misses.– Capacity Misses: increase cache size– Conflict Misses: increase cache size and/or associativity.
Nightmare Scenario: ping pong effect!
• Write Policy: Write Through vs. Write Back• Today CPU time is a function of (ops, cache misses)
vs. just f(ops): affects Compilers, Data structures, and Algorithms
Review of Mem. HierarchyCSCE430/830
Summary #3/3: TLB, Virtual Memory
• Page tables map virtual address to physical address
• TLBs are important for fast translation
• TLB misses are significant in processor performance– funny times, as most systems can’t access all of 2nd level cache without
TLB misses!
• Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed?2) How is block found? 3) What block is replaced on miss? 4) How are writes handled?
• Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy benefits, but computers insecure
• Prepare for debate + quiz on Wednesday
Review of Mem. HierarchyCSCE430/830
Summary of Virtual Machine Monitor
• Virtual Machine Revival– Overcome security flaws of modern OSes– Processor performance no longer highest priority– Manage Software, Manage Hardware
• “… VMMs give OS developers another opportunity to develop functionality no longer practical in today’s complex and ossified operating systems, where innovation moves at geologic pace .”
[Rosenblum and Garfinkel, 2005]
• Virtualization challenges for processor, virtual memory, I/O
– Paravirtualization, ISA upgrades to cope with those difficulties
• Xen as example VMM using paravirtualization– 2005 performance on non-I/O bound, I/O intensive apps: 80% of
native Linux without driver VM, 34% with driver VM
• Opteron memory hierarchy still critical to performance
Review of Mem. HierarchyCSCE430/830
Disk Device Performance
Platter
Arm
Actuator
HeadSectorInnerTrack
OuterTrack
• Disk Latency = Seek Time + Rotation Time + Transfer Time + Controller Overhead
• Seek Time? depends no. tracks move arm, seek speed of disk
• Rotation Time? depends on speed disk rotates, how far sector is from head
• Transfer Time? depends on data rate (bandwidth) of disk (bit density), size of request
ControllerSpindle
Review of Mem. HierarchyCSCE430/830
Redundant Arrays of (Inexpensive) Disks
• Files are "striped" across multiple disks
• Redundancy yields high data availability
– Availability: service still provided to user, even if some components failed
• Disks will still fail• Contents reconstructed from data redundantly
stored in the array Capacity penalty to store redundant info
Bandwidth penalty to update redundant info
Review of Mem. HierarchyCSCE430/830
Summary: RAID Techniques: Goal was performance, popularity due to reliability of storage
• Disk Mirroring, Shadowing (RAID 1)
Each disk is fully duplicated onto its "shadow" Logical write = two physical writes
100% capacity overhead
• Parity Data Bandwidth Array (RAID 3)
Parity computed horizontally
Logically a single high data bw disk
• High I/O Rate Parity Array (RAID 5)
Interleaved parity blocks
Independent reads and writes
Logical write = 2 reads + 2 writes
10010011
11001101
10010011
00110010
10010011
10010011