Post on 27-Dec-2015
transcript
THE MEMORY HIERARCHY
Jehan-François Pârisjfparis@uh.edu
Chapter Organization• Technology overview• Caches
– Cache associativity, write through andwrite back, …
• Virtual memory– Page table organization, the translation lookaside
buffer (TLB), page fault handling, memory protection
• Virtual machines• Cache consistency
TECHNOLOGY OVERVIEW
Dynamic RAM
• Standard solution for main memory since 70's– Replaced magnetic core memory
• Bits represented stored on capacitors– Charged state represents a one
• Capacitors discharge – Must be dynamically refreshed– Achieved by accessing each cell several
thousand times each second
Dynamic RAM
Row select
ColumnSelect
nMOS transistor
Capacitor
Ground
The role of the nMOS transistorRow select (gate)
ColumnSelect(source)
drain
• When the gate is positive with respect to the ground, electrons are attracted to the gate (the "field effect")and current can go through
• Normally, no current can go from the source to the drain
Not on the exam
Magnetic disks
Platter
R/W headArm
Servo
Magnetic disk (I)
• Data are stored into circular tracks• Tracks are partitioned into a variable number of
fixed-size sectors• If disk drive has more than one platter, all tracks
corresponding to the same position of the R/ W head form a cylinder
Magnetic disk (II)
• Disk spins at a speed varying between– 5,400 rpm (laptops) and– 15,000 rpm (Seagate Cheetah X15, …)– Accessing data requires
• Positioning the head on the right track: –Seek time
• Waiting for the data to reach the R/W head–On the average half a rotation
Disk access times
• Dominated by seek time and rotational delay
• We try to reduce seek times by placing all data that are likely to be accessed together on nearby tracks or same cylinder
• Cannot do as much for rotational delay– On the average half a rotation
Average rotational delay
RPM Delay
(ms)
5400 5.67200 4.2
10,000 3.015,000 2.0
Overall performance
• Disk access times are still dominated by rotational latency– Were 8-10 ms in the late 70's when rotational
speeds were 3,000 to 3,600 RPM• Disk capacities and maximum transfer rates have
done much better – Pack many more tracks per platter– Pack many more bits per track
The internal disk controller
• Printed circuit board attached to disk drive– As powerful as the CPU of a personal
computer of the early 80's• Functions include
– Speed buffering– Disk scheduling– …
Reliability issues
• Disk drives have more reliability issues than most other computer components– Moving parts eventually wear– Infant mortality– Would be too costly to produce
perfect magnetic surfaces• Disks have bad blocks
Disk failure rates
• Failure rates follow a bathtub curve– High infantile mortality – Low failure rate during useful life– Higher failure rates as disks wear out
Disk failure rates (II)
Failurerate
Time
Infantilemortality
Useful life
Wearout
Disk failure rates (III)
• Infant mortality effect can last for months for disk drives
• Cheap ATA disk drives seem to age less gracefully than SCSI drives
MTTF
• Disk manufacturers advertise very highMean Times To Fail (MTTF) for their products– 500,000 to 1,000,000 hours, that is,
57 to 114 years• Does not mean that disk will last that long!• Means that disks will fail at an average rate of one
failure per 500,000 to 100,000 hours duringtheir useful life
More MTTF Issues (I)
• Manufacturers' claims are not supported by solid experimental evidence
• Obtained by submitting disks to a stress test at high temperature and extrapolating results to ideal conditions– Procedure raises many issues
More MTTF Issues (II)
• Failure rates observed in the field aremuch higher– Can go up to 8 to 9 percent per year
• Corresponding MTTFs are 11 to 12.5 years
• If we have 100 disks and a MTTF of 12.5 years, we can expect an average of 8 disk failures per year
Bad blocks (I)
• Also known as– Irrecoverable read errors– Latent sector errors
• Can be caused by– Defects in magnetic substrate– Problems during last write
Bad blocks (II)
• Disk controller uses redundant encoding that can detect and correct many errors
• When internal disk controller detects a bad block– Marks it as unusable– Remaps logical block address of bad block to
spare sectors• Each disk is extensively tested during
burn in period before being released
The memory hierarchy (I)
Level Device Access Time
1 Fastest registers(2 GHz CPU)
0.5 ns
2 Main memory 10-60 ns
3 Secondary storage (disk) 7 ms
4 Mass storage(CD-ROM library)
a few s
The memory hierarchy (II)
• To make sense of these numbers, let us consider an analogy
Writing a paper (I)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk3 Book in library
4 Book far away
Writing a paper (II)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-120 s
3 Book in library
4 Book far away
Writing a paper (III)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-140 s
3 Book in library 162 days
4 Book far away
Writing a paper (IV)
Level Resource Access Time
1 Open book on desk 1 s
2 Book on desk 20-140 s
3 Book in library 162 days
4 Book far away 63 years
Major issues
• Huge gaps between– CPU speeds and SDRAM access times– SDRAM access times and disk access times
• Both problems have very different solutions– Gap between CPU speeds and SDRAM
access times handled by hardware– Gap between SDRAM access times and disk
access times handled by combination of software and hardware
Why?
• Having hardware handle an issue– Complicates hardware design– Offers a very fast solution– Standard approach for very frequent actions
• Letting software handle an issue– Cheaper– Has a much higher overhead– Standard approach for less frequent actions
Will the problem go away?
• It will become worse– RAM access times are not improving as fast
as CPU power– Disk access times are limited by rotational
speed of disk drive
What are the solutions?
• To bridge the CPU/DRAM gap:– Interposing between the CPU and the DRAM
smaller, faster memories that cache the data that the CPU currently needs• Cache memories• Managed by the hardware and invisible to
the software (OS included)
What are the solutions?
• To bridge the DRAM/disk drive gap:• Storing in main memory the data blocks that are
currently accessed (I/O buffer)• Managing memory space and disk space as a
single resource (Virtual memory)• I/O buffer and virtual memory are managed by
the OS and invisible to the user processes
Why do these solutions work?
• Locality principle:– Spatial locality:
at any time a process only accesses asmall portion of its address space
– Temporal locality:this subset does not change too frequently
Can we think of examples?
• The way we write programs• The way we act in everyday life
– …
CACHING
The technology
• Caches use faster static RAM (SRAM)– Similar organization as that of D flipflops
• Can have– Separate caches for instructions and data
• Great for pipelining– A unified cache
A little story (I)
• Consider a closed-stack library– Customers bring book requests to circulation desk– Librarians go to stack to fetch requested book
• Solution is used in national libraries– Costlier than open-stack approach– Much better control of assets
A little story (II)
• Librarians have noted that some books get asked again and again– Want to put them closer to the circulation desk
• Would result in much faster service• The problem is how to locate these books
– They will not be at the right location!
A little story (III)
• Librarians come with a great solution– They put behind the circulation desk shelves with
100 book slots numbered from 00 to 99– Each slot is a home for the most recently
requested book that has a call number whose last two digits match the slot number • 3141593 can only go in slot 93• 1234567 can only go in slot 67
A little story (IV)
The call number of the book I need is 3141593
Let me see if it's in bin 93
A little story (V)
• To let the librarian do her job each slot much contain either– Nothing or– A book and its reference number
• There are many books whose reference number ends in 93 or any two given digits
A little story (VI)
Could I get this time the book whose call number 4444493?
Sure
A little story (VII)
• This time the librarian will– Go bin 93– Find it contains a book with a different call
number• She will
– Bring back that book to the stacks– Fetch the new book
Basic principles
• Assume we want to store in a faster memory 2n words that are currently accessed by the CPU– Can be instructions or data or even both
• When the CPU will need to fetch an instruction or load a word into a register– It will look first into the cache– Can have a hit or a miss
Cache hits
• Occur when the requested word is found in the cache– Cache avoided a memory access– CPU can proceed
Cache misses
• Occur when the requested word is not found in the cache– Will need to access the main memory– Will bring the new word into the cache
• Must make space for it by expelling one of the cache entries
–Need to decide which one
Handling writes (I)
• When CPU has to store the contents of a register into main memory– Write will update the cache
• If the modified word is already in the cache– Everything is fine
• Otherwise– Must make space for it by expelling one of the
cache entries
Handling writes (II)
• Two ways to handle writes– Write through:
• Each write updates both the cache and the main memory
– Write back:• Writes are not propagated to the main
memory until the updated word is expelled from the cache
Handling writes (II)
• Write through • Write back
CPU
Cache
RAM
CPU
Cache
RAM
later
Pros and cons
• Write through:– Ensures that memory is always up to date
• Expelled cache entries can be overwritten
• Write back:– Faster writes– Complicates cache expulsion procedure
• Must write back cache entries that have been modified in the cache
Picking the right solution
• Caches use write through:– Provides simpler cache expulsions– Can minimize write-through overhead with
additional circuitry• I/O Buffers and virtual memory use
write back:– Write-through overhead would be too high
A better write through (I)
• Add a small buffer to speed up write performance of write-through caches– At least four words
• Holds modified data until they are written into main memory – Cache can proceed as soon as data are
written into the write buffer
Write buffer
A better write through (II)
• Write through • Better write through
CPU
Cache
RAM
CPU
Cache
RAM
A very basic cache
• Has 2n entries• Each entry contains
– A word (4 bytes)– Its RAM address
• Sole way to identify the word– A bit indicating whether the cache entry
contains something useful
A very basic cache (I)
RAM Address WordRAM Address WordRAM Address WordRAM Address WordRAM Address WordRAM Address WordRAM Address WordRAM Address Word
Actualcachesaremuchbigger
RAM Address WordRAM Address WordRAM Address WordRAM Address Word
RAM Address WordRAM Address WordRAM Address WordRAM Address Word
Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N
A very basic cache (II)
000001010011100101110111
RAM Address WordRAM Address WordRAM Address WordRAM Address Word
RAM Address WordRAM Address WordRAM Address WordRAM Address Word
Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N
Comments (I)
• The cache organization we have presented is nothing but the hardware implementation of a hash table
• Each entry has– a key: the word address– a value: word contents plus valid bit
Comments (II)
• The hash function is h(k) = (k/4) mod N
where k is the key and N is the cache size– Can be computed very fast
• Unlike conventional hash tables, this organization has no provision for handling collisions– Use expulsion to resolve collisions
Managing the cache
• Each word fetched into cache can occupy a single cache location– Specified by n+1 to 2 bits of its address
• Two words with the same n+1 to 2 bitscannot be at the same time in the cache– Happens whenever the addresses of the two
words differ by K 2n+2
Example
• Assume cache can contain 8 words• If word 48 is in the cache it will be stored at
cache index (48/4) mod 8 = 12 mod 8 = 4• In our case 2n+2 = 23+2 = 32• The only possible cache index for word 80 would
be (80/4) mod 8 = 20 mod 8 = 4• Same for words 112, 144, 176, …
Managing the cache
• Each word fetched into cache can occupy a single cache location– Specified by n+1 to 2 bits of its address
• Two words with the same n+1 to 2 bitscannot be at the same time in the cache– Happens whenever the addresses of the two
words differ by K 2n+2
Saving cache space
• We do not need to store whole address of each word in cache– Bits 1 and 0 will always be zero– Bits n + 1 to 2 can be inferred from the
cache index• If cache has 8 entries, bits 4 to 2
– Will only store in tag the remaining bits of address
A very basic cache (III)
Cache uses bits 4 to 2ofwordaddress
000001010011100101110111
Bits 31:5 WordBits 31:5 WordBits 31:5 WordBits 31:5 Word
Bits 31:5 WordBits 31:5 WordBits 31:5 WordBits 31:5 Word
Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N
Storing a new word in the cache
• Location of new word entry will be obtained from LSB of word address – Discard 2 LSB
• Always zero for a well-aligned word– Remove n next LSB for a cache of size 2n
• Given by cache index
MSB of word address 00n next LSB
Accessing a word in the cache (I)
• Start with word address
• Remove two least significant bit• Always zero
Word address
Word address minus two LSB
Accessing a word in the cache (II)
• Split remainder of address into
– n least significant bits• Word address in the cache
– Cache tag
n LSB
Word address minus two LSB
Cache Tag
Towards a better cache
• Our cache takes into account temporal locality of accesses– Repeated accesses to the same location
• But not their spatial locality– Accesses to neighboring locations
• Cache space is poorly used– Need 26 + 1 bits of overhead to store
32 bits of data
Multiword cache (I)
• Each cache entry will contain a block of 2, 4 , 8, … words with consecutive addresses– Will require words to be well aligned
• Pair of words should start at an address that is multiple of 2×4 = 8
• Group of four words should start at an address that is multiple of 4×4 = 16
Multiword cache (II)
WordWord
WordWord
WordWord
WordWord
000001010011
000001
100101
010011
000001
110111
100101
010011
000001
ValidY/NY/NY/NY/NY/NY/NY/NY/N
110111
100101
010011
000001
ValidY/NY/NY/NY/NY/NY/NY/NY/N
110111
100101
010011
000001
TagValidY/NY/NY/NY/NY/NY/NY/NY/N
110111
100101
010011
000001
110111
100101
010011
000001
Contents
100101
010011
WordWord
WordWord
Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6
WordWord
WordWord
WordWord
WordWord
Multiword cache (III)
• Has 2n entries each containing 2m words• Each entry contains
– 2m words – A tag– A bit indicating whether the cache entry
contains useful data
Storing a new word in the cache
• Location of new word entry will be obtained from LSB of word address – Discard 2 + m LSB
• Always zero for a well-aligned group of words
– Take n next LSB for a cache of size 2n
MSB of address 2 +m LSBn next LSB
Example
• Assume– Cache can contain 8 entries– Each block contains 2 words
• Words 48 and 52 belong to the same block– If word 48 is in the cache it will be stored at cache
index (48 /8) mod 8 = 6 mod 8 = 6– If word 48 is in the cache it will be stored at cache
index (49 /8) mod 8 = 6 mod 8 = 6
Selecting the right block size
• Larger block sizes improve the performance of the cache– Allows us to exploit spatial locality
• Three limitations– Spatial locality effect less pronounced if block
size exceeds 128 bytes– Too many collisions in very small caches– Large blocks take more time to be fetched
into the cache
1 KB
8 KB
16 KB
64 KB
256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Mis
s ra
te
64164
Block size (bytes)
Collision effect in small cache
• Consider a 4KB cache– If block size is 16 B, that is, 4 words,
cache will have 256 blocks– …– If block size is 128 B, that is 32 words,
cache will have 32 blocks• Too many collisions
Problem
• Consider a very small cache with 8 entries and a block size of 8 bytes (2 words)– Which words will be fetched in the cache
when the CPU accesses words at address 32, 48, 60 and 80?
– How will these words will be stored in the cache?
Solution (I)
• Since block size is 8 bytes– 3 LSB of address used to address one of the
8 bytes in a block• Since cache holds 8 blocks,
– Next 3 LSB of address used by the cache index
• As a result, tag has 32 – 3 – 3 =26 bits
Solution (I)
• Consider words at address 32
• Cache index is (32/23) mod 23 = (32/8) mod 8 = 4• Block tag is 32/26 = 32/64 = 0
Row 4 Tag=0 32 33 34 35 36 37 38 39
Solution (II)
• Consider words at address 48• Cache index is (48/8) mod 8 =6• Block tag is 48/64 = 0
Row 6 Tag=0 48 49 50 51 52 53 54 55
Solution (III)
• Consider words at address 60• Cache index is (60/8) mod 8 =7• Block tag is 60/64 = 0
Row 6 Tag=0 56 57 58 59 60 61 62 63
Solution (IV)
• Consider words at address 80• Cache index is (80/8) mod 8 = 10 mod 8 = 2• Block tag is 80/64 = 1
Row 2 Tag=1 80 81 82 83 84 85 86 67
Set-associative caches (I)
• Can be seen as 2, 4, 8 caches attached together• Reduces collisions
Set-associative caches (II)
000001010011100101110111
Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block
Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block
Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N
000001010011100101110111
Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block
Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block
Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N
Set-associative caches (III)
• Advantage:– We take care of more collisions
• Like a hash table with a fixed bucket size– Results in lower miss rates than direct-
mapped caches• Disadvantage:
– Slower access– Best solution if miss penalty is very big
Fully associative caches
• The dream!• A block can occupy any index position in the
cache• Requires an associative memory
– Content-addressable– Like our brain!
• Remain a dream
Designing RAM to support caches
• RAM connected to CPU through a "bus"– Clock rate much slower than CPU clock rate
• Assume that a RAM access takes– 1 bus clock cycle to send the address– 15 bus clock cycle to initiate a read– 1 bus clock cycle to send a word of data
Designing RAM to support caches
• Assume– Cache block size is 4 words– One-word bank of DRAM
• Fetching a cache block would take1 + 4×15 + 4×1 = 65 bus clock cycles
– Transfer rate is 0.25 byte/bus cycle• Awful!
Designing RAM to support caches
• Could – Double bus width (from 32 to 64 bits)– Have a two-word bank of DRAM
• Fetching a cache block would take1 + 2×15 + 2×1 = 33 bus clock cycles
– Transfer rate is 0.48 byte/bus cycle• Much better
• Costly solution
Designing RAM to support caches
• Could – Have an interleaved memory organization– Four one-word banks of DRAM– A 32-bit bus
32 bits
RAMbank 1
RAMbank 0
RAMbank 2
RAMbank 3
Designing RAM to support caches
• Can do the 4 accesses in parallel• Must still transmit the block 32 bits by 32 bits• Fetching a cache block would take
1 + 15 + 4×1 = 20 bus clock cycles– Transfer rate is 0.80 word/bus cycle
• Even better• Much cheaper than having a 64-bit bus
ANALYZING CACHE PERFORMANCE
Memory stalls
• Can divide CPU time into– NEXEC clock cycles spent executing
instructions – NMEM_STALLS cycles spent waiting for memory
accesses• We have
CPU time = (NEXEC + NMEM_STALLS)×TCYCLE
Memory stalls
• We assume that– cache access times can be neglected– most CPU cycles spent waiting for memory
accesses are caused by cache misses• Distinguishing between read stalls and write
stalls
NMEM_STALLS = NRD_STALLS + NWR_STALLS
Read stalls
• Fairly simple
NRD_STALLS = NMEM_RD×Read miss rate×
Read miss penalty
Write stalls (I)
• Two causes of delays– Must fetch missing blocks before updating them
• We update at most 8 bytes of the block!– Must take into account cost of write through
• Buffering delay depends of proximity of writes not number of cache misses
–Writes too close to each other
Write stalls (II)
• We have
NWR_STALLS =NWRITES×Write miss rate×
Write miss penalty + NWR_BUFFER_STALLS
• In practice, very few buffer stalls if the buffer contains at least four words
Global impact
• We have
NMEM_STALLS = NMEM_ACCESSES×Cache miss rate×
Cache miss penalty • and also
NMEM_STALLS = NINSTRUCTIONS×(NMISSES/Instruction)×
Cache miss penalty
Example
• Miss rate of instruction cache is 2 percentMiss rate of data cache is 4 percentIn the absence of memory stalls, each instruction would take 2 cyclesMiss penalty is 100 cycles36 percent of instructions access the main memory
• How many cycles are lost due to cache misses?
Solution (I)
• Impact of instruction cache misses0.02×100 =2 cycles/instruction
• Impact of data cache misses0.36×0.04×100 =1.44 cycles/instruction
• Total impact of cache misses2 + 1.44 = 3.44 cycles/instruction
Solution (II)
• Average number of cycles per instruction2 + 3.44 = 5.44 cycles/instruction
• Fraction of time wasted 3.44 /5.44 = 63 percent
Problem
• Redo the example with the following data– Miss rate of instruction cache is 3 percent
Miss rate of data cache is 5 percentIn the absence of memory stalls, each instruction would take 2 cyclesMiss penalty is 100 cycles40 percent of instructions access the main memory
Solution
• The fraction of time wasted to memory stalls is 71 percent
Average memory access time
• Some authors call it AMATTAVERAGE = TCACHE + f×TMISS
where f is the cache miss rate• Times can be expressed
– In nanoseconds– In number of cycles
Example
• A cache has a hit rate of 96 percent• Accessing data
– In the cache requires one cycle– In the memory requires 100 cycles
• What is the average memory access time?
Solution
• Miss rate = 1 – Hit rate = 0.04• Applying the formula
TAVERAGE = 1 + 0.04×100 = 5 cycles
Corrected
Impact of a better hit rate
• What would be the impact of improving the hit rate of the cache from 96 to 98 percent?
Solution
• New miss rate = 1 – New hit rate = 0.02• Applying the formula
TAVERAGE = 1 + 0.02×100 = 3 cycles
When the hit rate is above 80 percent small improvements in the hit rate willresult in much better miss rate
Examples
• Old hit rate: 80 percentNew hit rate: 90 percent– Miss rates goes from 20 to 10 percent!
• Old hit rate: 94 percentNew hit rate: 98 percent– Miss rates goes from 6 to 2 percent!
In other words
It's the miss rate, stupid!
Improving cache hit rate
• Two complementary techniques– Using set-associative caches
• Must check tags of all blocks with the same index values
–Slower • Have fewer collisions
–Fewer misses– Use a cache hierarchy
A cache hierarchy (I)
CPU
L1
L2
L3
RAM
L1 misses
L2 misses
L3 misses
A cache hierarchy
• Topmost cache– Optimized for speed, not miss rate– Rather small– Uses a small block size
• As we go down the hierarchy– Cache sizes increase– Block sizes increase– Cache associativity level increases
Example
• Cache miss rate per instruction is 2 percentIn the absence of memory stalls, each instruction would take one cycleCache miss penalty is 100 nsClock rate is 4GHz
• How many cycles are lost due to cache misses?
Solution (I)
• Duration of clock cycle1/(4 Ghz) = 0.25×10-9 s = 0.25 ns
• Cache miss penalty100ns = 400 cycles
• Total impact of cache misses0.02×400 = 8 cycles/instruction
Solution (II)
• Average number of cycles per instruction1 + 8 = 9 cycles/instruction
• Fraction of time wasted 8/9 = 89 percent
Example (cont'd)
• How much faster would the processor if we added a L2 cache that – Has a 5 ns access time– Would reduce miss rate to main memory to
0.5 percent?• Will see later how to get that
Solution (I)
• L2 cache access time5ns = 20 cycles
• Impact of cache misses per instructionL1 cache misses + L2 cache misses =
0.02×20+0.005×400 = 0.4 + 2.0 =2.4 cycles/instruction
• Average number of cycles per instruction1 + 2.4 = 3.4 cycles/instruction
Solution (II)
• Fraction of time wasted 2.4/3.4 = 63 percent
• CPU speedup 9/3.4 = 2.6
How to get the 0.005 miss rate
• Wanted miss rate corresponds to a combined cache hit rate of 99.5 percent
• Let H1 be hit rate of L1 cache and H2 be the hit rate of the second cache
• The combined hit rate of the cache hierarchy isH = H1 +(1-H1)H2
How to get the 0.005 miss rate
• We have0.995 = 0.98 +0.02H2
• H2 = (0.995 – 0.98)/0.02 = 0.75– Quite feasible!
Can we do better? (I)
• Keep 98 percent hit rate for L1 cache• Raise hit rate of L2 cache to 85 percent
– L2 cache is now slower: 6ns• Impact of cache misses per instruction
L1 cache misses + L2 cache misses =0.02×24+0.02×0.15×400 = 0.48 + 1.2 =1.68 cycles/instruction
The verdict
• Fraction of time wasted per cycle1.68/2.68 = 63 percent
• CPU speedup 9/2.68 = 3.36
Would a faster L2 cache help?
• Redo the example assuming – Miss rate of L1 cache is till 98 percent– New faster L2 cache
• Access time reduced to 3 ns• Hit rate only 50 percent
The verdict
• Fraction of time wasted 87 percent
• CPU speedup 1.72
New L2 cache with a lower access timebut a higher miss rate performs much worsethan original L2 cache
Cache replacement policy
• Not an issue in direct mapped caches– We have no choice!
• An issue in set-associative caches – Best policy is least recently used (LRU)
• Expels from the cache a block in thesame set as the incoming block
• Pick block that has not been accessed for the longest period of time
Implementing LRU policy
• Easy when each set contains two blocks– We attach to each block a use bit that is
• Set to 1 when the block is accessed• Reset to 0 when the other block is accessed
– We expel block whose use bit is 0• Much more complicated for higher associativity
levels
REALIZATIONS
Caching in a multicore organization
• Multicore organizations often involve multiple chips– Say four chips with four cores per chip
• Have a cache hierarchy on each chip– L1, L2, L3 – Some caches are private, other are shared
• Accessing a cache on a chip is much faster than accessing a cache on another chip
AMD 16-core system (I)
• AMD 16-core system– Sixteen cores on four chips
• Each core has a 64-KB L1 and a 512-KB L2 cache
• Each chip has a 2-MB shared L3 cache
X/Y where X is latency in cyclesY is bandwidth in bytes/cycle
AMD 16-core system (II)
• Observe that access times are non-uniform– Takes more time to access L1 or L2 cache of
another core than accessing shared L3 cache– Takes more time to access caches in another
chip than local caches– Access times and bandwidths depend on
chip interconnect topology
VIRTUAL MEMORY
Main objective (I)
• To allow programmers to write programs that reside– partially in main memory– partially on disk
Main objective (II)
Main memory Address space (I)
Address space (II)
Motivation
• Most programs do not access their whole address space at the same time
• Compilers go through several phases– Lexical analysis– Preprocessing (C, C++)– Syntactic analysis– Semantic analysis– …
Advantages (I)
– VM allows programmers to write programs that would not otherwise fit in main memory• They will run although much more slowly• Very important in 70's and 80's
– VM allows OS to allocate the main memory much more efficiently• Do not waste precious memory space• Still important today
Advantages
• VM let programmers use– Sparsely populated– Very large address spaces
VMDC S L
Sparsely populated address spaces
• Let programmers put different items apart from each other– Code segment– Data segment– Stack– Shared library– Mapped files
Wait untilyou take 4330 to
study this
Big difference with caching
• Miss penalty is much bigger– Around 5 ms– Assuming a memory access time of 50 ns,
5 ms equals 100,000 memory accesses– For caches, miss penalty was around
100 cycles
Consequences
• Will use much larger block sizes– Blocks, here called pages, measure 4 KB, 8KB,
… with 4 KB an unofficial standard• Will use fully associative mapping to reduce
misses, here called page faults• Will use write back to reduce disk accesses
– Must keep track of modified (dirty) pages in memory
Virtual memory
• Combines two big ideas– Non-contiguous memory allocation:
processes are allocated page frames scattered all over the main memory
– On-demand fetch:Process pages are brought in main memory when they are accessed for the first time
• MMU takes care of almost everything
Main memory
• Divided into fixed-size page frames– Allocation units – Sizes are powers of 2 (512B, …, 4KB, … )– Properly aligned– Numbered 0 , 1, 2, . . .
0 1 2 3 4 5 6 7 8
Program address space
• Divided into fixed-size pages– Same sizes as page frames– Properly aligned– Also numbered 0 , 1, 2, . . .
0 1 2 3 4 5 6 7
The mapping
• Will allocate non contiguous page frames to the pages of a process
0 1 2
3 4 5 6 70 1 2
The mapping
Page Number Frame number
0 01 42 2
The mapping
• Assuming 1KB pages and page frames
Virtual Addresses Physical Addresses
0 to 1,023 0 to 1,023
1,024 to 2,047 4,096 to 5,119
2,048 to 3,071 2,048 to 3,071
The mapping
• Observing that 210 = 1000000000 in binary• We will write 0-0 for ten zeroes and 1-1 for ten
ones
Virtual Addresses Physical Addresses
0000-0 to 0001-1 0000-0 to 0001-1
0010-0 to 0011-1 1000-0 to 1001-1
0100-0 to 0101-1 0100-0 to 0101-1
The mapping
• The ten least significant bits of the address do not change
Virtual Addresses Physical Addresses
000 0-0 to 000 1-1 000 0-0 to 000 1-1
001 0-0 to 001 1-1 100 0-0 to 100 1-1
010 0-0 to 010 1-1 010 0-0 to 010 1-1
The mapping
• Must only map page numbers into page frame numbers
Page number Page frame number
000 000
001 100
010 010
The mapping
• Same in decimal
Page number Page frame number
0 0
1 4
2 2
The mapping
• Since page numbers are always in sequence, they are redundant
Page number Page frame number
0 0
1 4
2 2 X
The algorithm
• Assume page size = 2p
• Remove p least significant bits from virtual address to obtain the page number
• Use page number to find corresponding page frame number in page table
• Append p least significant bits from virtual address to page frame number to get physical address
Realization
2 897
897
1
5
7
3
5
Virtual Address
Physical Address
PAGE TABLE
Page No Offset
PageFrameNo
(10 bits)
The offset
• Offset contains all bits that remain unchanged through the address translation process
• Function of page size
Page size Offset
1 KB 10 bits2 KB 11 bits 4KB 12 bits
The page number
• Contains other bits of virtual address• Assuming 32-bit addresses
Page size Offset Page number
1 KB 10 bits 22 bits2 KB 11 bits 21 bits
4KB 12 bits 20 bits
Internal fragmentation
• Each process now occupies an integer number of pages
• Actual process space is not a round number– Last page of a process is rarely full
• On the average, half a page is wasted– Not a big issue– Internal fragmentation
On-demand fetch (I)
• Most processes terminate without having accessed their whole address space– Code handling rare error conditions, . . .
• Other processes go to multiple phases during which they access different parts of their address space– Compilers
On-demand fetch (II)
• VM systems do not fetch whole address space of a process when it is brought into memory
• They fetch individual pages on demand when they get accessed the first time– Page miss or page fault
• When memory is full, they expel from memory pages that are not currently in use
On-demand fetch (III)
• The pages of a process that are not in main memory reside on disk– In the executable file for the program being
run for the pages in the code segment– In a special swap area for the data pages that
were expelled from main memory
On-demand fetch (IV)
Main memory Code Data
Disk Executable
Swap area
On-demand fetch (V)
• When a process tries to access data that are nor present in main memory– MMU hardware detects that the page is
missing and causes an interrupt– Interrupt wakes up page fault handler– Page fault handler puts process in waiting state
and brings missing page in main memory
Advantages
• VM systems use main memory more efficiently than other memory management schemes– Give to each process more or less what it
needs • Process sizes are not limited by the size of main
memory– Greatly simplifies program organization
Sole disadvantage
• Bringing pages from disk is a relatively slow operation– Takes milliseconds while memory access take
nanoseconds• Ten thousand times to hundred thousand
times slower
The cost of a page fault
• Let– Tm be the main memory access time
– Td the disk access time– f the page fault rate– Ta the average access time of the VM
Ta = (1 – f ) Tm + f (Tm + Td ) = Tm + f Td
Example
• Assume Tm = 50 ns and Td = 5 ms
f Mean memory access time
10-3 = 50 ns + 5 ms/103 = 5,050 ns
10-4 = 50 ns + 5 ms/104 = 550 ns
10-5 = 50 ns + 5 ms/105 = 100 ns
10-6 = 50 ns + 5 ms/ 106 = 55 ns
Conclusion
• Virtual memory works best when page fault rate is less than a page fault per 100,000 instructions
Locality principle (I)
• A process that would access its pages in a totally unpredictable fashion would perform very poorly in a VM system unless all its pages are in main memory
Locality principle (II)
• Process P accesses randomly a very large array consisting of n pages
• If m of these n pages are in main memory, the page fault frequency of the process will be( n – m )/ n
• Must switch to another algorithm
Tuning considerations
• In order to achieve an acceptable performance,a VM system must ensure that each process has in main memory all the pages it is currently referencing
• When this is not the case, the system performance will quickly collapse
First problem
• A virtual memory system has– 32 bit addresses– 8 KB pages
• What are the sizes of the – Page number field?– Offset field?
Solution (I)
• Step 1:Convert page size to power of 2
8 KB = 2----- B
• Step 2:Exponent is length of offset field
Solution (II)
• Step 3:Size of page number field =Address size – Offset size
Here 32 – ____ = _____ bits
• Highlight the text in the box to see the answers
13 bits for the offset and 19 bits for the page number
PAGE TABLE REPRESENTATION
Page table entries
• A page table entry (PTE) contains– A page frame number– Several special bits
• Assuming 32-bit addresses, all fit into four bytes
Page frame number Bits
The special bits (I)
• Valid bit:1 if page is in main memory, 0 otherwise
• Missing bit:1 if page is in not main memory, 0 otherwise
• Serve the same functionUse different conventions
The special bits (II)
• Dirty bit:1 if page has been modified since it was brought into main memory,0 otherwise– A dirty page must be saved in the process
swap area on disk before being expelled from main memory
– A clean page can be immediately expelled
The special bits (III)
• Page-referenced bit:1 if page has been recently modified,0 otherwise– Often simulated in software
Where to store page tables
• Use a three-level approach• Store parts of page table
– In high speed registers located in the MMU:the translation lookaside buffer (TLB)(good solution)
– In main memory (bad solution)– On disk (ugly solution)
The translation look aside buffer
• Small high-speed memory– Contains fixed number of PTEs– Content-addressable memory
• Entries include page frame number and page number
Page frame number BitsPage number
Realizations (I)
• TLB of Intrisity FastMATH– 32-bit addresses– 4 KB pages– Fully associative TLB with 16 entries– Each entry occupies 64 bits
• 20 bits for page number• 20 bits for page frame number• Valid bit, dirty bit, …
Realizations (II)
• TLB of ULTRA SPARC III– 64-bit addresses
• Maximum program size is 244 bytes, that is,16 TB
– Supported page sizes are 4 KB, 16KB, 64 KB, 4MB ("superpages")
Realizations (III)
• TLB of ULTRA SPARC III– Dual direct-mapping (?) TLB
• 64 entries for code pages• 64 entries for data pages
– Each entry occupies 64 bits• Page number and page frame number• Context• Valid bit, dirty bit, …
The context (I)
• Conventional TLBs contain the PTE's for a specific address space – Must be flushed each time the OS switches
from the current process to a new process• Frequent action in any modern OS
– Introduces a significant time penalty
The context (II)
• UltraSPARC III architecture adds to TLB entries a context identifying a specific address space– Page mappings from different address
spaces can coexist in the TLB– A TLB hit now requires a match for both
page number and context– Eliminates the need to flush the TLB
TLB misses
• When a PTE cannot be found in the TLB, a TLB miss is said to occur
• TLB misses can be handled– By the computer firmware:
• Cost of miss is one extra memory access– By the OS kernel:
• Cost of miss is two context switches
Letting SW handle TLB misses
• As in other exceptions, must save current value of PC in EPC register
• Must also assert the exception by the end of the clock cycle during which the memory access occurs– In MIPS, must prevent WB cycle to occur after
MEM cycle that generated the exception
Example
• Consider the instructionlw $1, 0($2)
– If word at address $2 is not in the TLB,we must prevent any update of $1
Performance implications
• When TLB misses are handled by the firmware, they are very cheap– A TLB hit rate of 99% is very good:
Average access cost will be
Ta = 0.99×Tm + 0.01×2Tm= 1.01Tm
• Less true if TLB misses are handled by the kernel
Storing the rest of the page table
• PTs are too large to be stored in main memory– Will store active part of the PT in main memory– Other entries on disk
• Three solutions– Linear page tables– Multilevel page tables– Hashed page tables
Storing the rest of the page table
• We will review these solutions even though page table organizations are an operating system topic
Linear page tables (I)
• Store PT in virtual memory (VMS solution)• Very large page tables need more than 2 levels
(3 levels on MIPS R3000)
Linear page tables (II)
PhysicalMemory
Virtual MemoryVirtual Memory
PTOtherOther PTs PTs
Linear page tables (III)
• Assuming a page size of 4KB,– Each page of virtual memory requires 4 bytes
of physical memory– Each PT maps 4GB of virtual addresses– A PT will occupy 4MB– Storing these 4MB in virtual memory will
require 4KB of physical memory
Multi-level page tables (I)
• PT is divided into – A master index that always remains in main
memory– Sub indexes that can be expelled
Multi-level page tables (II)
VIRTUAL ADDRESS
PHYSICAL ADDRESS
MASTER INDEX
Offset
Offset
1ary 2ary
SUBINDEX
< Page Number >
Frame No
FrameAddr
(unchanged)
Multi-level page tables (III)
• Especially suited for a page size of 4 KB and 32 bits virtual addresses
• Will allocate– 10 bits of the address for the first level,– 10 bits for the second level, and– 12 bits for the offset.
• Master index and sub indexes will all have 210
entries and occupy 4KB
Hashed page tables (I)
• Only contain paged that are in main memory– PTs are much smaller
• Also known as inverted page tables
Hashed page table (II)
PN hashPN
PFN
PN = page numberPFN = page frame number
Selecting the right page size
• Increasing the page size– Increases the length of the offset– Decreases the length of the page number– Reduces the size of page tables
• Less entries– Increases internal fragmentation
• 4KB seems to be a good choice
MEMORY PROTECTION
Objective
• Unless we have an isolated single-user system, we must prevent users from– Accessing– Deleting– Modifying
the address spaces of other processes, including the kernel
Historical considerations
• Earlier operating systems for personal computers did not have any protection– They were single-user machines– They typically ran one program at a time
• Windows 2000, Windows XP, Vista and MacOS X are protected
Memory protection (I)
• VM ensures that processes cannot access page frames that are not referenced in their page table.
• Can refine control by distinguishing among– Read access– Write access– Execute access
• Must also prevent processes from modifying their own page tables
Dual-mode CPU
• Require a dual-mode CPU• Two CPU modes
– Privileged mode or executive mode that allows CPU to execute all instructions
– User mode that allows CPU to execute only safe unprivileged instructions
• State of CPU is determined by a special bits
Switching between states
• User mode will be the default mode for all programs– Only the kernel can run in supervisor mode
• Switching from user mode to supervisor mode is done through an interrupt – Safe because the jump address is at a well-
defined location in main memory
Memory protection (II)
• Has additional advantages:– Prevents programs from corrupting address
spaces of other programs– Prevents programs from crashing the kernel
• Not true for device drivers which are inside the kernel
• Required part of any multiprogramming system
INTEGRATING CACHES AND VM
The problem
• In a VM system, each byte of memory has two addresses– A virtual address– A physical address
• Should cache tags contain virtual addresses or physical addresses?
Discussion• Using virtual addresses
– Directly available– Bypass TLB– Cache entries specific
to a given address space
– Must flush caches when the OS selects another process
• Using physical addresses– Must access first TLB– Cache entries not
specific to a given address space
– Do not have to flush caches when the OS selects another process
The best solution
• Let the cache use physical addresses– No need to flush the cache at each context
switch– TLB access delay is tolerable
Processing a memory access (I)
• if virtual address in TLB :get physical address
else :
create TLB miss exceptionbreak
…
I use Python because it is very compact:hetland.org/writing/instant-python.html
Processing a memory access (II)
if read_access :while data not in cache :
stalldeliver data to CPU
else : # write_access
… Continues on next page
Processing a memory access (III)if write_access_OK :
while data not in cache :stall
write data into cacheupdate dirty bitput data and address in write buffer
else :
# illegal accesscreate TLB miss exception
More Problems (I)
• A virtual memory system has a virtual address space of 4 Gigabytes and a page size of 4 Kilobytes. Each page table entry occupies 4 bytes.
More Problems (II)
• How many bits are used for the byte offset?
• Since 4K =2___, the byte offset will
use __ bits.
• Highlight text in box to see the answer
Since 4KB= 212bytes, the byte offset uses 12 bits
More Problems (III)
• How many bits are used for the page number?
• Since 4G = 2__ we will have __-bit virtual addresses. Since the byte offset occupies ___ of these __ bits, __ bits are left for the page number.
The page number uses 20 bits of the address
More Problems (IV)
• What is the maximum number of page table entries in a page table?
• Address space/ Page size =
2__ / 2__ =
2 ___ PTE’s.
220 page table entries
More problems (VI)
• A computer has 32 bit addresses and a page size of one kilobyte.
• How many bits are used to represent the page number?
___ bits• What is the maximum number of entries in a
process page table?2___ entries
Answer
• As 1KB = 210 bytes, the byte offset occupies10 bits
• The page number uses the remaining 22 bits ofthe address
Some review questions
• Why are TLB entries 64-bit wide while page table entries only require 32 bits?
• What would be the main disadvantage of a virtual memory system lacking a dirty bit?
• What is the big limitation of VM systems that cannot prevent processes from executing the contents of any arbitrary page in their address space?
Answers
• We need extra space for storing the page number
• It would have to write back to disk all pages thatit expels even when they were not modified
• It would make the system less secure
VIRTUAL MACHINES
Key idea
• Let different operating systems run at the same time on a single computer– Windows, Linux and Mac OS– A real-time OS and a conventional OS– A production OS and a new OS being tested
How it is done
• A hypervisor /VM monitor defines two or more virtual machines
• Each virtual machine has– Its own virtual CPU– Its own virtual physical memory– Its own virtual disk(s)
The virtualization process
Actual hardware
CPU
Memory
Disk
Virtual hardware # 1
CPU
Memory
Virtual hardware # 2
CPU
Memory
Disk
Virtual hardware # 1
CPU
Memory
Disk
Hypervisor
Reminder
• In a conventional OS,– Kernel executes in privileged/supervisor
mode• Can do virtually everything
– User processes execute in user mode• Cannot modify their page tables• Cannot execute privileged instructions
KernelPrivileged
mode
Usermode
User processUser process
System call
Two virtual machines
HypervisorPrivilegedmode
Usermode
Userprocess
VM Kernel
Userprocess
Userprocess
VM Kernel
Userprocess
Explanations (II)
• Whenever the kernel of a VM issues a privileged instruction, an interrupt occurs– The hypervisor takes control and do the physical
equivalent of what the VM attempted to do:• Must convert virtual RAM addresses into
physical RAM addresses• Must convert virtual disk block addresses into
physical block addresses
Translating a block address
VM kernel
Virtual disk
Access block x, yof my virtual disk
That's block v, w of the actual disk
Actual disk
Hypervisor
Access block v, w
of actual disk
Handling I/Os
• Difficult task because– Wide variety of devices– Some devices may be shared among several
VMs• Printers• Shared disk partition
–Want to let Linux and Windowsaccess the same files
Virtual Memory Issues
• Each VM kernel manages its own memory– Its page tables map program virtual
addresses into what it believes to be physical addresses
The dilemma
User processA
VM kernel
Page 735 of process A is stored in page frame 435
That's page frame 993 of the actual RAM
Hypervisor
The solution (I)
• Address translation must remain fast!– Hypervisor lets each VM kernel manage their
own page tables but do not use them• They contain bogus mappings!
– It maintains instead its own shadow page tables with the correct mappings• Used to handle TLB misses
Why it works
• Most memory accesses go through the TLB
• The system can tolerate slower page table updates
The solution (II)
• To keep its shadow page tables up to date, hypervisor must track any changes made by the VM kernels
• Mark page tables read-only–Each attempt to update then by a VM
kernel results in an interrupt
Nastiest Issue
• The whole VM approach assumes that a kernel executing in user mode will behave exactly like a kernel executing in privileged mode except that privileged instruction will be trapped
• Not true for all architectures!– Intel x86 Pop flags (POPF) instruction– …
POPF instruction
• Pop top of stack into lower 16 bits of EFLAGS – Designed for a 16-bit architecture
• FFLAGS contains interrupt enable flag (IE)• When executed in privileged mode, POPF
updates all flags• When executed in user mode, POPF updates all
flags but the IE flag
Solutions
1. Modify the instruction set and eliminate instructions like POPF• IBM redesigned the instruction set of their
360 series for the 370 series2. Mask it through clever software
• Dynamic "binary translation" when direct execution of code could not work(VMWare)
Other Approaches (I)
• Can use the VM approach to let binaries written in a specific machine language run on a machine with a different instruction set
• Called emulators• Have a huge performance penalty
– Still work fairly well when the target machine is much faster than the original architecture
– Lets us run very old binaries
Other Approaches (II)
• Can use the VM approach to let programs written in any arbitrary low-level language run on many different architectures
• Java virtual machine (JVM)– Ported to may architectures– Allow execution of programs written in
"bytecode"– Professes to be inherently safe
CP/CMS (I)
• IBM was the dominant computer manufacturer during the 60's and the 70's– Machines were designed for batch processing– Lacked any decent time-sharing OS
• Wanted by universities– TSS/360 was not a great success
CP/CMS (II)
• IBM Cambridge Scientific Center– In Cambridge, MA– Developed a combination of
• A Control Program (CP) supporting virtual machines
• A time-sharing OS (CMS) for a single user• Was a great success!
CP/CMS (III)
• How it worked
CMS on a VM
CMS on a VM
CMS on a VM
CMS on a VM
CP (hypervisor)
CACHE CONSISTENCY
The problem• Specific to architectures with
– Several processors sharing the same main memory
– Multicore architectures• Each core/processor has its own private cache
– Needed for performance • Problem occur when same data are present in
two or more private caches
An example (I)
RAM
CPU
Cachex=0
CPU
Cachex=0
An example (II)
RAM
CPU
Cachex=1
CPU
Cachex=0
Increments x
Still assumes x =0
Our Objective
• Single copy serializability– All operations on all the variables should have
the same effect as if they were executed• in sequence with• a single copy of each variable
One-copy serializability rules
1. Whenever a processes accesses a variable it always gets the value stored by the processor that updated that variable last
2. A processor accessing a variable sees all updates applied to that variable in thesame order
– The exact order does not matter as long as everybody agrees on it
An example
RAM
CPU
Cachex=1
Sets x to 1
CPU
Cachex=0
Resets x to 0
CPU
Cachex=?
CPU
Cachex=?
Both CPUs must applythe two updates
in the same order
Big problem
• When a processor updates a cached variable, the new value of the variable is not immediately written into the main memory– Perfect one-copy serializability is not feasible
New rules
1. Whenever a processes accesses a variable it always gets the value stored by the processor that updated that variable last if the updates are sufficiently separated in times
2. A processor accessing a variable sees all updates applied to that variable in thesame order
– No compromise is possible here
A remark
• Data consistency issues appear in may disguises– Cache consistency– Distributed shared memory
• work done in early to mid 90's– Distributed file systems– Distributed databases
An example (I)
• UNIX workstations use a distributed file system called NFS (Network File System)
• An NFS comprises– client workstations– a centralized server
• NFS allows client workstations to cache contents of the file they access
• What happens when two workstations access the same file?
An example (II)
Server
x’ xx’’
A B
Inconsistent updates
Possible Approaches (I)
• Always keep a single copy:– Guarantees one–copy serializability– Would make the system too slow
• No caching!
• Prevent shared access:– Guarantees one–copy serializability– Would be very slow and complicated
Possible Approaches (II)
• Replicate and update:– Allows multiple processors to cache variables
already cached by other processors– Whenever a processor updates a cached variable,
it propagates the update to all other caches holding a copy of the variable
– Costly because processors tend to repeatedly update the same variable• Temporal locality of accesses
Possible Approaches (III)
• Replicate and invalidate:– Allows multiple processors to cache variables
already cached by other processors– Whenever a processor updates a cached
variable, we invalidate all other cached copies of the variable• Works well with write-through caches
–Will get the correct value later from RAM
A realization: Snoopy caches
• All caches are linked to the main memory through a shared bus
– All caches observe the writes performed by other caches
• When a cache notices that another cache performs a write on a memory location that it has in its cache, it invalidates the corresponding cache block
An example (I)
RAM
CPU
Cachex=2
CPU
CacheFetches x = 2
x = 2
An example (II)
RAM
CPU
Cachex = 2
CPU
Cachex = 2
Also fetches x
x = 2
An example (III)
RAM
CPU
Cachex = 0
CPU
Cachex = 2
Resets x to 0
x = 2
An example (IV)
RAM
CPU
Cachex = 0
CPU
Cachex = ?x = ?
Performs write-through
Detects write-through
and invalidates its copy of x
x = 0
An example (IV)
RAM
CPU
Cachex = 0
CPU
Cachex = 0
when CPU wants to access x. cache gets correct value
from RAM
x = 0
A last correctness condition
• Cache cannot reorder their memory updates– Cache RAM buffer must be FIFO
• First in first out
Example
• A CPU performs– x = 0;– x++; // sets x to 1;
• Final value for x in CPU cache is 1• If write buffer reorders write-through requests,
final value for x in RAM—and other caches will be 0– Ouch!
Miscellaneous fallacies (I)
• Segmented address spaces
– Address is segment number + offset in segment
– Supposed to let programmers organize their address space into meaningful segments
– Programmers—and compilers—hate them
Miscellaneous fallacies (II)
• Ignoring virtual memory behavior when accessing large two-dimensional arrays
– Must access array in a way that minimizes number of page faults
– Done by all good mathematical software libraries
Miscellaneous fallacies (III)
• Believing that you can virtualize any CPU architecture – Some are much more difficult than others
Concluding remarks
• As before, we have seen how human ingenuity has worked around hardware limitations– Cannot increase CPU speed above 3 to 4 GHZ
Pipelining, multicore architectures
– RAM is slower than CPUCaches
– Hard disks much slower than RAMVirtual memory, I/O buffering