of 89
8/14/2019 Random Access Memory and Problems
1/89
Problem 1 Interleaving factor
What is the optimal interleaving factor if the memorycycle is 8 CC (1+6+1)?
Assume 4 banks
1stRead
Request
2nd Read
Request
b1 b2 b3 b4
3rd Read
Request
X
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3rd ReadRequest
8/14/2019 Random Access Memory and Problems
2/89
Problem 2 (page 452 in yourtext book)
Block size = 1 word
Memory bus size = 1 word
Miss rate = 3%
Memory access per instruction = 1.2
Cache miss Penalty = 64 CC Avg Cycles per instruction = 2
SimpleMemory
CPUCache 64 CC
Assume 1000 instructions in your Program
If no miss then execution time is 2000 CC
. , -> One instruction needs 1 2 memory accesses 1000 instruction 1200. %accesses If miss rate is 3 then number of misses for 1200.accesses is 36
the execution time is= + =2000 36x64 4304 CC
= / = .Average cycles per instruction 4304 1000 4 3
8/14/2019 Random Access Memory and Problems
3/89
Problem 2 (wider bus 2words)
Block size = 4 word
Memory bus size = 2 word
Miss rate = 2.5%
Memory access per instruction = 1.2
Cache miss Penalty = 128 CC Avg Cycles per instruction = 2
InterleavedMemory
CPUCache 64 CC
Assume 1000 instructions in your Program
If no miss then execution time is 2000 CC
. , -> One instruction needs 1 2 memory accesses 1000 instruction 1200. . %accesses If miss rate is 2 5 then number of misses for 1200.accesses is 30
the execution time is= + =2000 30x128 5840 CC
= / = .Average cycles per instruction 5840 1000 5 84
8/14/2019 Random Access Memory and Problems
4/89
Problem 2 (2-wayinterleaving)
Block size = 4 word
Memory bus size = 1 word
Miss rate = 2.5%
Memory access per instruction = 1.2
Cache miss Penalty = 68x2 CC Avg Cycles per instruction = 2
-InterleavedMemory
CPUCache 64 CC
Assume 1000 instructions in your Program
If no miss then execution time is 2000 CC
. , -> One instruction needs 1 2 memory accesses 1000 instruction 1200. . %accesses If miss rate is 2 5 then number of misses for 1200.accesses is 30
the execution time is= + =2000 30x68x2 6000 CC
= / = .Average cycles per instruction 6000 1000 6 0
8/14/2019 Random Access Memory and Problems
5/89
DRAM (Dynamic RandomAccess Memory)
Address Multiplexing
RAS and CAS
Dynamic refreshing and implications Data to be written after a READ!
Amdahls rule of thumb
1000 MIPS should have 1000 Mmemory
Read your text book for DRAM, SRAM,RAMBUS, and Flash.
8/14/2019 Random Access Memory and Problems
6/89
SRAM (Static RandomAccess Memory)
Uses six transistor per bit.
Static
NO NEED to write data after a READ
Comparison versus DRAM More transistors
Less memory to 1/8
Expensive 8 - 10 times
No Refresh Faster 8-16 times
Used in Cache instead of main memory
8/14/2019 Random Access Memory and Problems
7/89
Flash
Similar to EEPROM
Ideal choice for embedded systems Low power requirement, no moving parts
Faster READ time, slower WRITE time Comparison versus EEPROM
Finer and simpler control
Better capacity
WRITE operation entire block to beerased and replaced. Writes are 50 to 100 times slower than
READ
NOR type flash and NAND type flash
8/14/2019 Random Access Memory and Problems
8/89
Virtual Memory
0 A
4K B
8K C
12K D
0
4K C
8K
12K
16K A
20K
24K B
28K
D
Virtual Memory
Physical
Main MemoryDisk
VirtualAddress
Physical
Address
8/14/2019 Random Access Memory and Problems
9/89
Virtual Versus First LevelCache
Block /Page S ize 16-128 B 4K-64K B
Hit Time 1-3 CC 50-150 CC
Miss Penalty 8-150 CC 1-10 M CC
Access Tim e 6-130 CC 0.8-8 M CCTransferTime 2-20 CC 0.2-2 M CC
Miss Rate 0.1-10% .00001 to 0.001%
Address M apping
25-45 Phys ical
Address to 14-20
Cache A ddress
32-64 Virtual Addres
to 25-45 Phy sical
Address
8/14/2019 Random Access Memory and Problems
10/89
Memory HierarchyQuestions
Where can a block be placed?
How is a block found if it is in?
Which block should be replaced? What happens on a write?
8/14/2019 Random Access Memory and Problems
11/89
Segmentation versus Paging
Remark Page Segment
Words per address One Two (Segment and offset)
Programmer visible Invisible Visible to programmer
Replacing a block Trivial HardMemory use inefficiency Internal fragmentation External fragmentation
Efficient disk traffic Yes (easy tune the page size) Not always (small segments)
Code Data
Paging
Segmentation
8/14/2019 Random Access Memory and Problems
12/89
CPU Execution TimeCPU Execution Time = (CPU Cycle time + Stalled Cycle) X
Cycle Time
Stalled Cycle = misses x penalty
Misses given either as misses/1000instruction or misses/memory-access AKA
miss rate. Instruction Count , Cycles per Instruction,
Miss are also required to compute CPUexecution time.
8/14/2019 Random Access Memory and Problems
13/89
Average Access Time withCache
Average Access Time= Hit Time + Miss Rate X Penalty
Multi-level Cache Avg Access Time = Hit TimeL1+
Miss RateL1X PenaltyL1
PenaltyL1 = Hit TimeL2+ Miss
RateL2X PenaltyL2
8/14/2019 Random Access Memory and Problems
14/89
I/O Systems
Objective
To understand mainly disk storagetechnologies.
8/14/2019 Random Access Memory and Problems
15/89
Storage Systems
Individual Disks
SCSI Subsystem
RAID Storage Area Networks (NAS & SAN)
8/14/2019 Random Access Memory and Problems
16/89
Disk Storage
Disk storage slowerthan Memory.
Disk offers bettervolume at a lowerunit price.
Nicely fit into theVirtual Memoryconcept.
Critical piece of the
performance
Courtesy:www.shortcourses.com/ choosing/storage/06.htm
http://www.shortcourses.com/choosing/storage/06.htmhttp://www.shortcourses.com/choosing/storage/06.htm8/14/2019 Random Access Memory and Problems
17/89
Disk System
Disk pack & Cache
Disk Embedded Controller
SCSI Host Adapter OS (kernel/driver)
User Program
HostSCSI
AdapterControll
er
Controll
er
Host I/O Bus
SCSI Bus
disk disk
T
8/14/2019 Random Access Memory and Problems
18/89
Individual Disks
Reference:http://www.stjulians.com/cs/diskstoragenotes.html
8/14/2019 Random Access Memory and Problems
19/89
Components of Disk Access Time
Seek Time
Rotational Latency
Internal Transfer Time Other Delays
That is,
Avg Access Time = Avg Seek Time + Avg Rotational Delay +
Transfer Time + Other overhead
8/14/2019 Random Access Memory and Problems
20/89
Problem (page 684) Seek Time = 5 ms/100 tracks RPM = 10000 RPM
Transfer rate 40 MB/sec
Other Delays = 0.1 ms
Sector Size = 2 KB
Average Access Time = Average Seek Time (5ms) + Average Rotational Delay (time for 1/2 revolution) + Transfer Time (512/(40x106)) + Other overhead (0.1 ms)
= 5 + 103
x60/(2x10000) + 512x103
/(40x106
) + 0.1 ms = 5 + 3 + 128/104 + 0.1 ms = 5 + 3 + 0.0128 + 0.1 =
8/14/2019 Random Access Memory and Problems
21/89
RAID
RAID stands for Redundant Array of
Inexpensive/Individual Disks
Uses a number of little disks instead of one large
disk. Six+ Types of RAID (0-5).
8/14/2019 Random Access Memory and Problems
22/89
Terminologies Mean Time Between Failure (MTBF)
Mean Time To Data Loss (MTTDL)
Mean Time To Data Inaccessability(MTTDI)
8/14/2019 Random Access Memory and Problems
23/89
8/14/2019 Random Access Memory and Problems
24/89
RAID-0 Performance
Throughput: best case - nearly n x singledisk value
Utilization: worst case nearly (1/ n) xsingle disk value
Data Reliability: (r)n , where ris thereliability of a disk, (r 1).
Sequential Access: Fast
Random Access: Multithreaded RandomAccess offers better performance.
When r = 0.8 and n = 2, reliability is 0.64
8/14/2019 Random Access Memory and Problems
25/89
RAID-1 Mirroring Issue Addressed:RAID-0 Reliability
Problem.
Shadowing ormirroring is used.
Data is not lost whena disk goes down.
One or more diskscan be used tomirror primarydisk.
Writes are posted toprimary andshadowing disks.
Read from any ofthem.
01
02
03
01
02
03
8/14/2019 Random Access Memory and Problems
26/89
RAID-1 Performance
Reliability is improved with mirroring: (1- (1-r)(1-r))
Example: when r is 0.8, the reliability of
RAID-1 is .96. Writes are more complex must be
committed to primary and all shadowing
disks. Writes much slower due to atomicity
requirement.
Expensive due to 1-to-1 redundancy.
RAID 0 1 St i i &
8/14/2019 Random Access Memory and Problems
27/89
RAID 0+1 -Stripping &Mirroring
RAID 0 + 1
01
03
05
02
04
06
01
03
05
02
04
06
RAID 1
n disks n disks
RAID 0 RAID 0
8/14/2019 Random Access Memory and Problems
28/89
Performance RAID-0+1
Let the Reliability of a RAID 0 sub-tree be R:
Then the reliability of RAID 1tree= 1 (1-R)
(1-R)
Reliability R is: R= r2(reliability of a single disk is r):
Throughput is same as RAID-0, however with 2 x n
disks Utilization is lower than RAID-0 due to mirroring
Write is marginally slower due to atomicity
When r = 0.9, R = 0.81, and R = 1 (0.19)2 = .96
RAID 1 0 Mi i &
8/14/2019 Random Access Memory and Problems
29/89
RAID 1+0 - Mirroring &Stripping
RAID 1+0
RAID 0
RAID 1 RAID 1
01
03
05
01
03
05
02
04
06
02
04
06
8/14/2019 Random Access Memory and Problems
30/89
Performance RAID-1+0
Let the Reliability of a RAID 1 sub-tree be R:
Then the reliability of RAID 0 tree= (R)2
Reliability R is:
R= 1-(1-r)2(reliability of a single disk is r):
Throughput is same as RAID-0, however with 2 x n
disks
Utilization is lower than RAID-0 due to mirroring Write is marginally slower due to its atomicity
When r = 0.9, R = 0.99, and R = (0.99)2 = .98
RAID 2 H i C d
8/14/2019 Random Access Memory and Problems
31/89
RAID-2 Hamming CodeArrays
Low commercial interest due tocomplex nature of Hamming codecomputation.
8/14/2019 Random Access Memory and Problems
32/89
RAID-3 Stripping with Parity
Single Bits
Or Words
P0
P1P2
Parity Disk
Stripe width
Logical
Storage
Physical
Storage
01 02 03 04 05 06 07 08 09 10 11 12
01
0509
02
0610
03
0711
04
0812
8/14/2019 Random Access Memory and Problems
33/89
RAID-3 Operation
Based on the principle of reversible form of
parity computation.
Where Parity P = C0 C1
Cn-1 C
n
Missing Stripe Cm = P C0 C 1 Cm-1 C m+1 Cn-1 C n
8/14/2019 Random Access Memory and Problems
34/89
RAID-3 Performance RAID-1s 1-to-1 redundancy issue is addressed by 1-for-n
Parity disk. Less expensive than RAID-1
Rest of the performance is similar to RAID-0.
This can withstand the failure of one of its disks.
Reliability = all the disks are working + exactly one failed = rn + nc1
rn-1 .(1-r)
When r = 0.9 and n = 5
= 0.95
+ 5 x 0.94
x (1- 0.9) = 0.6 + 5x0.66 x 0.1
= 0.6 + 0.33 = .93
8/14/2019 Random Access Memory and Problems
35/89
RAID-4 Performance
Similar to RAID-3, but supports largerchunks.
Performance measures are similar to
RAID-3.
8/14/2019 Random Access Memory and Problems
36/89
RAID-5 (Distributed Parity)
Parity Disk
Stripe width
Logical
Storage
Physical
Storage
01 02 03 04 05 06 07 08 09 10 11 12
01
P1
09
02
05
P2
03
06
10
04
07
11
Chunk
P0
08
12
8/14/2019 Random Access Memory and Problems
37/89
8/14/2019 Random Access Memory and Problems
38/89
Reading Assignment
Optical Disk
RAID study material
Worked out problems from the textbook: p 452, p 537, p 539, p561, p684, p 691, p 704
Amdahls Speedup
8/14/2019 Random Access Memory and Problems
39/89
Amdahl s SpeedupMultiprocessors
Speedup = 1/(Fractione/Speedupe+(1Fractione ))
Speedupe - number of processors
Fractione - Fraction of the program that runs
parallely on SpeedupeAssumption: Either the program runs in fully parallel
(enhanced) mode making use of all the processors ornon-enhanced mode.
8/14/2019 Random Access Memory and Problems
40/89
Multiprocessor Architectures Single Instruction Stream, Single Data Stream (SISD):
Single Instruction Stream, Multiple Data Stream (SIMD)
Multiple Instruction Streams, Single Data Stream (MISD)
Multiple Instruction Streams, Multiple Data Streams (MIMD)
8/14/2019 Random Access Memory and Problems
41/89
Shared Memory versus
8/14/2019 Random Access Memory and Problems
42/89
Shared Memory versusMessage Passing
No Shared Memory Message Passing1 Compatibility with well-understood
shared memory mechanism used in
centralized multi-processor systems.
Simpler hardware compared to
scalable shared memory
2 Simplify compiler design Communication is explicit. This
means it is easier to understand.
3 No need to learn a messaging protocol Improved modularity
4 Low communication overhead No need for expensive and complexsynchronization mechanisms similar
to the ones used in shared memory.
5 Caching to improve latency
8/14/2019 Random Access Memory and Problems
43/89
Two types of Shared Memory architectures
Main Memory I/O Systems
Processor
CacheCache
Processor
CacheCache
Processor
CacheCache
Processor
CacheCache
Centralized Shared -Memory Architecture
Distributed -Shared -Memory Architecture
Processor+
Cache
Memory I/O
Processor+
Cache
Memory I/O
Processor+
Cache
Memory I/O
Interconnection Network
Symmetric versus
8/14/2019 Random Access Memory and Problems
44/89
Symmetric versusDistributed Memory MP
SMP: Uses shared memoryfor Inter-process Communication
Advantage:
Close coupling due to shared memory
Sharing of data is faster between processors
Disadvantage
Scaling: Memory is a bottleneck
High and unpredictable Memory Latency
Distributed : Uses message passingfor Inter-process Communication
Advantage:
Low Memory Latency
Scaling: Scales better than SMP
Disadvantage
Control and management are more complex due to distributedmemory
8/14/2019 Random Access Memory and Problems
45/89
Performance Metrics
Communication bandwidth
Communication Latency: Ideally latency is as low aspossible. Communication latency is
= Sender Overhead + Time of flight + Transmission Time + Receiver Overhead
Communication Latency Hiding
8/14/2019 Random Access Memory and Problems
46/89
Cache Coherence Problem
Time Event
CacheContents for
CPU A
CacheContents for
CPU B
Memorycontents forLocation X
0 1
1 CPU A reads X 1 12 CPU B reads X 1 1 1
3 CPU A stores 0 in X 0 1 0
8/14/2019 Random Access Memory and Problems
47/89
Cache Coherence
A Memory System is coherent if
1. A read by a processor to location X that follows a write by another processorto X returns the written value if the read and write are sufficientlyseparatedin time and no other writes to X occur between the twoaccesses. Defines coherent view of memory.
2.3. Write to the same location are serialized; that is two writes to the same
location by any two processors are seen in the same order by all
processors. For example, if the values 1 and then 2 are written to alocation, processors can never read the value of the location as 2 andthen later read it as 1.
8/14/2019 Random Access Memory and Problems
48/89
8/14/2019 Random Access Memory and Problems
49/89
Features supported by CoherentMultiprocessors
Migration: Shared data in a cache can be moved toanother cache directly (without going through the mainmemory). This is referred to as migration. It is used ina transparent fashion. It reduces both latency (going
to another cache every time the data item isaccessed) and precious memory bandwidth.
Replication: Cache also provides replication for shareddata items that are being simultaneously read, sincethe caches make a copy of the data item in the localcache. Replication reduces both latency of access andcontention for a read shared data item.
8/14/2019 Random Access Memory and Problems
50/89
Migration and Replication
Main Memory I/O Systems
Processor
CacheCache
Processor
CacheCache
Processor
CacheCache
Processor
CacheCache
Centralized Shared-Memory Architecture (SMP)
cc cc cc cc
cc Cache Controller
Bus
Cache Coherent (CC)
8/14/2019 Random Access Memory and Problems
51/89
Cache Coherent (CC)Protocols
CC protocol implement cache coherency
Two types:
Snooping (Replicated): There are multiple copies
of the sharing status. Every cache that has acopy of the data from a block of physicalmemory also has a copy of the sharing status ofthe block, and no centralized state is kept.
Directory based (logically centralized): There is
only one copy of the sharing status of a block ofphysical memory. All the processors use thisone copy. This copy could be in any of theparticipating processors.
8/14/2019 Random Access Memory and Problems
52/89
Snooping Protocol
Main Memory I/O Systems
Processor
CacheCache
Processor
CacheCache
Processor
CacheCache
Processor
CacheCache
Centralized Shared -Memory Architecture (SMP)
cc cc cc cc
cc Cache Controller
Bus
:Tw o w ays
.1 W rite
In v a lid a tio n
.2 W rite B ro a d ca st
8/14/2019 Random Access Memory and Problems
53/89
8/14/2019 Random Access Memory and Problems
54/89
8/14/2019 Random Access Memory and Problems
55/89
Reading Assignment
Section 7.4 (Reliability, Availability, &Dependability) in your text book
Page 554 and 555
Section 7.11 I/O Design attempt allthe 5 problems in this section.
Invalidation versus Write
8/14/2019 Random Access Memory and Problems
56/89
Invalidation versus WriteDistribute
Multiple writes to the same data item with no intervening reads require multiple writebroadcast in an update protocol, but only one initial invalidation in a writeinvalidate protocol.
With multiword cache blocks, each word written in a cache block requires a writebroadcast in an update protocol, although only the first write to any word in the
block needs to generate an invalidate in an invalidation protocol.
An invalidation protocol works on cache blocks, while an update protocol must workon individual words.
The delay between writing a word in one processor and reading the written value inanother processor is usually less in a write update scheme, since written data areimmediately updated in the readers cache. By comparison, in an invalidationprotocol, the reader is invalidated first, then later reads the data and is stalled untila copy can be read and returned to the processor.
8/14/2019 Random Access Memory and Problems
57/89
Use of Valid Shared and
8/14/2019 Random Access Memory and Problems
58/89
Use of Valid, Shared, andDirty bits
Valid bit: Every time a block is loaded into a cache from memory, the tag for the blockis saved in cache and the valid bit is set to TRUE. A write update to the sameblock in a different processor may reset this valid bit due to write invalidate. Thus,when a cache block is accessed for READ or WRITE, tag should match AND thevalue of valid bit should be TRUE. If the tag matches but the valid bit is reset,then its a cache miss.
Shared bit: When a memory block is loaded into a cache block for the first time theshared bit is set to FALSE. When some other cache loads the same block, it isturned into TRUE. When this block is updated, write invalidate uses the value ofshared bit to decide whether to send write-invalidate message or not. If theshared bit is set then an invalidate message is sent, otherwise not.
Dirty bit: Dirty bit is set to FALSE when a block is loaded into a cache memory. It isset to TRUE when the block is updated the first time. When another processorwants to load this block, this block is migrated to the processor instead of loadingfrom the memory.
Summary of Snooping
8/14/2019 Random Access Memory and Problems
59/89
Summary of SnoopingMechanism
Re que st S ourceState of a ddressca ch e blo ck Fun ctio n a nd Ex pl ana tion
Read h it p rocessor shared or exc lus iveRead data in cache
Read m iss processor invalid P lace read m iss on bus
Read m is s processor shared A ddres s conflic t m iss : plac e read m iss on bus
Read mis s proc es sor ex c lus ive A ddres s conflic t m is s: write bac k bloc k, then plac e read m is s
W rite hit processor exc lusive W rite data in cache
W rite hit processor shared P lace write m iss on bus
W rite m iss processor invalid P lace write m iss on bus
W rite m iss processor shared A ddres s conflic t m iss : plac e read m iss on bus
W rit e m is s proc es sor ex c lus ive A ddres s conflic t m is s: write bac k bloc k, then plac e read mis s
Read m iss bus shared No action; allow memory to service read miss
Read m iss bus exc lusive
Attem pt to share data: place cac he block on bus and change
shared
W rite m is s bus s hared A ttem pt to write s hared bloc k; invalidate the bloc k
W rite m iss bus exc lusive
Attem pt to write block that is ex clusive elsewhere: write back
block and m ake its state invalid
8/14/2019 Random Access Memory and Problems
60/89
State Transition
In v a lid
Exclusive( / )read write
Shared( )read only
Cache State TransitionsBased on requests from CPU
CPU Write
CPU Read
CPU readhit
CPU read
miss
CPU Write hitCPU Read hit
CPU Write miss
CPU
read
miss
CPU
write
-rite back cache blocklace write miss on bus
lace read miss on bus
lace read miss on bus
Pl
aceread
misson b
us
Placewrit
e miss o
n bus
-
Write back
block
P la ce
w ri te
m is s
o n
b us
8/14/2019 Random Access Memory and Problems
61/89
State Transition
In v a lid
Exclusive( / )read write
Shared( )read only
Cache State TransitionsBased on requests from the bus
CPU Write
Write miss for thisblock
CPU readmiss
-
;
Write ba
ck block
abort
Memo
ry acces
s
W ri t
eb a
c kblo c
k
A bo r
tm em o
r yac
ces s
Read miss forthis block
Write miss for
this block
8/14/2019 Random Access Memory and Problems
62/89
Some Terminologies Polling: A process periodically checks if there is a message that it
needs to handle. This method of awaiting a message is calledpolling. Polling reduces processor utilization.
Interrupt: A process is notified when a message arrives using built-ininterrupt register. Interrupt increases processor utilization in
comparison to polling. Synchronous: A process sends a message and waits for the
response to come before sending another message or carryingout other tasks. This is way of waiting is referred to assynchronous communication.
Asynchronous: A process sends a message and continues to carryout other tasks while the requested message is processed. This
is referred as asynchronous communication.
Communication
8/14/2019 Random Access Memory and Problems
63/89
CommunicationInfrastructure
Multiprocessor Systems with sharedmemory can have two types ofcommunication infrastructure:
Shared Bus Interconnect
Interconnect
Shared Bus
8/14/2019 Random Access Memory and Problems
64/89
Directory Based Protocol
Node
2
Directory
Nod
en
Directory
Directory
Node
1
Directory
Node
n-1
Interconnect
8/14/2019 Random Access Memory and Problems
65/89
State of the Block
Shared: One or more processors have the block cached,and the value in memory is up to date.
Uncached: No processor has a copy of the block.
Exclusive: Exactly one processor has a copy of the
cache block, and it has written the block, so thememory copy is out of date. The processor is calledthe ownerof the block.
8/14/2019 Random Access Memory and Problems
66/89
Local, Remote, Home Node
Local Node: This is the node whererequest originates
Home Node: This is the node where
the memory location and thedirectory entry of an an addressreside.
Remote Node: A remote node is thenode that has copy of a cache block
8/14/2019 Random Access Memory and Problems
67/89
8/14/2019 Random Access Memory and Problems
68/89
Uncached State Operation
Read-miss: The requesting processor is sentthe requested data from memory and therequestor is made the only sharing node. Thestate of the block is made shared.
Write-miss:The requesting processor is sent therequested data and becomes the sharingnode. The block is made exclusive to indicatethat the only valid copy is cached. Sharerindicates the identity of the owner.
8/14/2019 Random Access Memory and Problems
69/89
Exclusive State Operation Read-miss: The owner processor is sent a data fetch message, which causes the
state of the block in the owners cache to transition to shared and causes theowner to send the data to the directory, which it is written to memory and sentback to the requesting processor. The identity of the requesting processor isadded to the set Sharer, which still contains the identity of the processor that wasthe owner (since it still has a readable copy).
Data write back: The owner processor is replacing the block and therefore must writeit back. This write back makes the memory copy up to date (the home directoryessentially becomes the owner), the block is now uncached and the Sharer isempty.
Write-miss: The block has a new owner. A message is sent to the old owner causing
the cache to invalidate the block and send the value to the directory, from which itis sent to the requesting processor, which becomes the new owner. Shareris set
to the identity of the new owner, and the state of the block remains exclusive.
8/14/2019 Random Access Memory and Problems
70/89
Cache State Transition
U n c a c h e d
Exclusive( / )read write
Shared( )read only
Write Miss
Read Miss
Read miss
-Data WriteBack
Write miss
re
admiss
Wr
ite M
iss
/etch invalidateata value Reply= { }hares P
ata value Reply+= { }hares P
ata Value Reply= { }hares P
Pl
aceread
misson b
us
;={
} ;
Invalidate
Shares
P
data
value re
ply
;
;
Fetch da
ta value
reply
+={ }
Shares
P
=
F et c
hS h
a res
D at
aV
al u
eRe p
ly
=
Sh
ar e
s
P
True Sharing Miss & False
8/14/2019 Random Access Memory and Problems
71/89
ue S a g ss & a seSharing Miss
Time Processor P1 Processor P2
1 Write X1
2 Read X2
3 Write X1
4 Write X25 Read X2
8/14/2019 Random Access Memory and Problems
72/89
Synchronization
We need a set of hardware primitiveswith the ability to atomically readand modify a memory location.
Example 1: Atomic
8/14/2019 Random Access Memory and Problems
73/89
pExchange
Interchanges a value in a register fora value in memory
You could build a lock with this. If the
memory location contains 1 thenthe lock is on.
Physical Main Memory
Register
A 0 1
8/14/2019 Random Access Memory and Problems
74/89
Other Operations
Test-and-Set
Fetch-and-Increment
Load Linked & Set
8/14/2019 Random Access Memory and Problems
75/89
Conditional
Load Linked: Load linked is a primitiveoperation that loads the content of aspecified location into cache.
Set Conditional: Set conditional operationis related to Load Linked operation byoperating on the same memory location.This operation sets the location to a new
value only if the location contains thesame value read by Load Linked,otherwise it will not.
8/14/2019 Random Access Memory and Problems
76/89
Operation of LL & SC
try: LL R2, 0(R1) ; load linkedDADDUI R3, R2, #1 ; increment
SC R3, 0(R1) ; Store conditional
BEQZ R3, try ; branch store fails
.Address of the memory location loaded by LL is kept in a register
.If there is an interrupt or invalidate then this register is cleared
. .SC checks if this register is zero.a If it is then it fails.b Otherwise it simply Stores the content of R3 in that memory address
8/14/2019 Random Access Memory and Problems
77/89
8/14/2019 Random Access Memory and Problems
78/89
8/14/2019 Random Access Memory and Problems
79/89
8/14/2019 Random Access Memory and Problems
80/89
Multiprocessor System
Uses Cache Coherence Bus Based Systems or Interconnect
Based System
Coherence system arbitrates writes(Invalidation/write distribute)
Thus, serializes writes!
8/14/2019 Random Access Memory and Problems
81/89
Spin Locks
Locks that a processor continuouslytries to acquire, spinning around aloop until it succeeds.
It is used when a lock is to be for ashortamount of time.
8/14/2019 Random Access Memory and Problems
82/89
Spin Lock implementation
lockit: LD R2, 0(R1) ;load of lock
BNEZ R2, lockit ;not available-spin
DADDUI R2, R0,#1 ;load locked value
EXCH R2, 0(R1) ;swap
BNEZ R2, lockit ;branch if lock wasn't 0
8/14/2019 Random Access Memory and Problems
83/89
Cache Coherence stepsstep 1 P0 P1 P2
Coherence
state of lock Bus/Directory activity
1 Has lock
Spins, testing if
lock=0
Spins, testing if
lock=0 Shared None
2 Set lock to 0
Invalidate
received
Invalidate
received Exclusive (P0)
Write invalidate of lock
variable from P0
3 Cache Miss Cache Miss Shared
Bus/Directory services P2
Cache miss; write back from
P0
4 Waits fo bus Lock = 0 Shared Cache miss for P2 satisfied
5 Lock = 0
Executes swap,
gets cache miss Shared Cache miss for P1 satisfied
6
Executes swap,
gets cache miss
Completes
swap: returns 0
and sets Lock =
1 Exclusive (P2)
Bus/Directory services P2
Cache miss; generates
invalidate
7
Swap completes
and returns 1,
and sets Lock =
1
Enter critical
section Exclusive (P1)
Bus/Directory services P1
Cache miss; generates write
back
8
Spins, testing if
lock=0 None
8/14/2019 Random Access Memory and Problems
84/89
Multi-threading
Multi-threading why and how?
Fine-grain
Almost round robin
High overhead (compared to coarse grain) Better throughput then coarse grain
Coarse-grain
Threads switched only on long stalls (L2
cache miss) Low overhead
Individual threads get better performance
8/14/2019 Random Access Memory and Problems
85/89
Reading Assignment
Memory Consistency Notes (Answers)
Problem on page 596
An example that uses LL (LoadLinked) and Set Conditional (SC).
8/14/2019 Random Access Memory and Problems
86/89
IETF Internet Engineering Task
8/14/2019 Random Access Memory and Problems
87/89
Force
Loosely self-organized groups (working groups)of
Two types of documents: I-D and RFC
I-Ds life is shorter compared to RFCs life RFCs include proposed standards and standards
RFC Example: RFC 1213, RFC 2543
I-D -> Proposed Standard -> Standard
8/14/2019 Random Access Memory and Problems
88/89
SIP Evolution
SIP I-D v1 97, v2 98
SIP RFC 2543 99
SIP RFC 3261 00, obsoletes RFC 2543
SIP working groups SIPPING, SIMPLE, PINT,SPIRITS,
SIPit SIP interoperability test
Predictor for SPEC92h k
8/14/2019 Random Access Memory and Problems
89/89
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
3 0 0
compre
ss eqntott
espres
so gcc l
i
doduc ear
ydro2d mdijdp su2
cor
P r e d ic t e d
P ro f i l e b a
Benchmarks
B e n ch m a rk
Instructionsbetweenmispredictions