Memory
Consistency
Models
Calcolatori Elettronici e Sistemi OperativiSources of out-of-order memory accesses
� Compiler optimizations
� Store buffers
� FIFOs for uncommitted writes
� Invalidate queues (for cache coherency)
� Data prefetch
� Banked cache architectures
� Networked interconnect
� Non-uniform memory access (NUMA) architectures:
different accesses to memory have different latencies
� ...
Compiler optimizations
� Language semantic does not consider
1. Side-effects of memory accesses
2. Multi-threading
3. Asynchronous execution
� Compiler can:
� Reorder instructions
� Eliminate operations
� Some compiler optimization can be controlled by the
volatile qualifier
void waitval (int *ptr) { while (*ptr == 0) continue;}
void waitval (int *ptr) { while (*ptr == 0) continue;}
waitval:ldr r3, [r0]cmp r3, #0movne pc, lr
loop:b loop
waitval:ldr r3, [r0]cmp r3, #0movne pc, lr
loop:b loop
Compiler does not need to consider that someone else can change *ptr
ARM assembly codeC code
int add3 (int x) { int i; for (i=0; i<3; i++) x += x; return x;}
int add3 (int x) { int i; for (i=0; i<3; i++) x += x; return x;}
add3:mov r0, r0, asl #3mov pc, lr
add3:mov r0, r0, asl #3mov pc, lr
This function always returns 8·x: compiler can optimize code
ARM assembly codeC code
int add_vals (int *vec) { int y = vec[1]; y += vec[0]; return y;}
int add_vals (int *vec) { int y = vec[1]; y += vec[0]; return y;}
add_vals:ldr r3, [r0]ldr r0, [r0, #4]add r0, r0, r3mov pc, lr
add_vals:ldr r3, [r0]ldr r0, [r0, #4]add r0, r0, r3mov pc, lr
Result does not depend on access order: compiler can change loads order
ARM assembly codeC code
Volatile
� Semantic
� Each read from a volatile variable requires an actual load
and may return a different value
� Compiler optimization cannot merge reads from the same address
� Each write to a volatile variable requires an actual store
� Compiler optimization cannot cancel stores
� Required to access I/O address space
Note:
this is the C/C++ semantic
the Java semantic differs (it also implies atomicity)
Examples
int *ptr; /* pointer to int */
volatile int *ptr_to_vol; /* pointer to volatile int */
int *volatile vol_ptr; /* volatile pointer to int */
volatile int *volatile vol_ptr_to_vol; /* volatile pointer to volatile int */
int *ptr; /* pointer to int */
volatile int *ptr_to_vol; /* pointer to volatile int */
int *volatile vol_ptr; /* volatile pointer to int */
volatile int *volatile vol_ptr_to_vol; /* volatile pointer to volatile int */
� Beware the semantic:
� a = *ptr_to_vol;
� is a volatile access
� a = *vol_ptr;
� is not a volatile access
Volatile
� Inconsistent qualification causes errors
� Volatile does not enforce ordering with non-volatile
accesses
� Volatile does not enforce order on how access are
actually performed
� Volatile does not mean atomic
Volatile
volatile int A;volatile int B;
A=1; /* these two lines won't be */B=1; /* reordered by compiler */
volatile int A;volatile int B;
A=1; /* these two lines won't be */B=1; /* reordered by compiler */
int A;volatile int B;
A=1; /* these two lines can be */B=1; /* reordered by compiler */
int A;volatile int B;
A=1; /* these two lines can be */B=1; /* reordered by compiler */
volatile int A;volatile int B;
A=1; /* these two lines won't be */B=1; /* reordered by compiler but */ /* accesses can be reordered */ /* by HW */
volatile int A;volatile int B;
A=1; /* these two lines won't be */B=1; /* reordered by compiler but */ /* accesses can be reordered */ /* by HW */
volatile int X;
X=1; /* this assignment can be interrupted or preempted */
volatile int X;
X=1; /* this assignment can be interrupted or preempted */
Memory barrier
� This inline assembly code:
1. contains no instructions
2. may read or write all of RAM
� Hence:
compiler memory accesses reordering is not allowed
around the barrier in either direction
asm volatile ("" : : : "memory");asm volatile ("" : : : "memory");
� Implementation on GCC
Store Buffer
� Record the store in buffer until is actually performed
� Hide memory latency
� Cache latency
� Cache-miss on write
� Processor can execute other instructions
� Data dependency (RAW)
� Wait until the write is actually performed in memory or in cache
� Read the data from the store buffer (store forwarding)
� Data dependency (WAW)
� Add a new entry in the store buffer
� Replace the previous write in the store buffer
Example
� Processor P1 executes
� 1) store A
� 2) store B
� A and B are shared with P2:
� A is in P2 cache
� B is in both caches
P1
Store
buffer
Cache
B
P2
Store
buffer
Cache
B A
Interconnect
Example
� Execution:
� 1: store A: cache miss� write the updated value in store buffer
� send a read request (data will come from P2 cache)� several clock cycles needed� P1 can proceed, (the new value is in the store buffer)� P2 does not see the write
� 2: store B: cache hit� data is written in cache� a coherence message is sent to P2
� P2 sees the write
� 3: A is loaded in P1 cache
� 4: A is updated in P1 cache� a coherence message is sent to P2
� P2 sees the write
� � P2 sees the store on B first, then the store on A
Consequence
initially: A=0 and B=0
A = 1
B = 1
while (B==0) continue;
assert (A==1); /* this can fail! */
P1 P2
If P2 sees the stores performed by P1 in reverse order, the assertion fails
Note:
A and B are volatiles
Cache coherency
� Cache coherency can require cache line invalidation
� A processor send an invalidate message to another one
� Target processor must invalidate cache line
� Invalidate Queue
� Store invalidate requests until the cache is busy
� Invalidate the line when the cache is ready
Data prefetch
� Processor can read data before the actual load
instruction
� Hide memory latency
� Preload data in cache
� Speculative execution
� Execute instructions after a branch before the branch
Banked cache architectures
� Caches split in several banks
� While accesses to busy banks must wait, accesses to idle
banks can proceed
Processor
Cache
Interconnect
Cache
Store
buffer
\
Definitions
� Program order
� The order of operations as specified by software
� Execution order
� The order of operations as executed by a processor
� Perceived order
� The order of operations as seen by processors and memories
� Memory consistency model
� Rules that specify the allowed behavior of programs in terms
of memory accesses
� Rules: order restrictions
Definitions
� Performed� Write
� a write by processor i is performed with respect to processor k when:� a read issued by k to the same address returns the value stored by i
� Read� a read by processor i is performed with respect to processor k when:
� a write issued by k to the same address cannot affect the value read by i
� Globally Performed� globally performed: is performed with respect to all processors
� Write� A write is globally performed when its modi cation has been fi
propagated to all processors
� Read� A read is globally performed when the value it returns is bound and the
write that wrote this value is globally performed
Memory consistency models
� Rules on access ordering can regard:
� Location (address of access)
� Direction
� read, write, read-write
� Value
� Causality
� behavior of an access depends on the behavior of another one
� Category
� shared / private
� synchronizing / not synchronizing
Memory consistency models
� Uniform consistency models
� Rules do no concern category of accesses
� Hybrid consistency models
� Category of accesses matters
Uniform consistency models
� Local Consistency (LC)
� Each process sees its own accesses in program order
� There is no restriction on the order of the accesses seen by other
processors
� Different processes may see different orders
� The weakest consistency model:
� it only guarantees sequential behavior on uniprocessor systems
� Not usable to program in parallel environments
Uniform consistency models
� Sequential consistency (SC)
� There is a global total order of all memory accesses (of all
processors)
� all processors agree with such global order
� global order can change at each run
� Each processor sees its own accesses in program order
� Offsets many architectural optimizations
� Easy to use
� Model implied by a cacheless system, with a single memory device,
with processors unable to perform Out-of-Order execution
SC: Consequence
initially: A=0 and B=0
1a. A = 1
2a. B = 1
1b. while (B==0) continue;
2b. assert (A==1);
P1 P2
The assertion cannot fail
P1:
P2:
W(A)1
1a
Note:
A and B are volatiles
time
history
access type
variable/address
data stored/read
SC: Consequence
initially: A=0 and B=0
1a. A = 1
2a. B = 1
1b. while (B==0) continue;
2b. assert (A==1);
P1 P2
The assertion cannot fail
P1:
P2:
W(A)1
R(B)1 R(A)1
W(B)1
1a 2a
1b 2b
R(B)0
1b
R(B)0
1b
R(B)0
1b
Note:
A and B are volatiles
time
SC: Consequence
initially: A=0 and B=0
1a. A = 1
2a. B = 1
1b. while (B==0) continue;
2b. assert (A==1);
P1 P2
The assertion cannot fail
� Each processor sees its own accesses in program order
� All processors agree with a global order
� Access 1a is before access 2b
It is easy to enforce order between accesses from different processors
Note:
A and B are volatiles
Sequential consistency
� Cache based system, no constraint on the interconnect
� Sufficient conditions
� All processors issue their access in program order
� A processor does not issue an access until its previous accesses have been
globally performed
� Need waiting for acknowledges from other processors
� Offsets many architectural optimizations
� No out-of -order execution
� Write-hit on cache must wait answers
� Easy to use
Comparison
� The union of all the Perceived orders can be valid or not
for a given consistency model
� Example: � P2 executes:
� I3: load A
� I4: load B
� 1) P1 sees I1, I2; P2 sees I1, I3, I4, I2
� valid execution for Sequential consistency� total order implied: I1, I3, I4, I2
� 2) P1 sees I1, I2; P2 sees I3, I2, I4, I1
� invalid execution for Sequential consistency� there is not an unique total order
� valid execution for Local consistency� P1 and P2 see their own accesses in order
� P1 executes:
� I1: store A
� I2: store B
Comparison
� Consistency model A is stronger than consistency model B if:
� each execution valid on A is also valid on B
� also: B is weaker than A
� If there exist
� some execution E1 valid on A and not valid on B
� some execution E2 valid on B and not valid on A
� then, A and B are incomparable
Uniform consistency models
� Causal consistency (Causal)
� All processors agree on the order of causally related events
� causally unrelated events can be observed in different orders
� Example
� X is initially 0
� event1: P1 writes 1 to X
� event2: P2 reads X and obtains 1
� event3: P1 writes 2 to X
� hence:� event
1 is happened before event
2
� event2 is happened before event
3
� all processors agree on such an ordering
� Example
� X is initially 0
� event1: P2 reads X and obtains 0
� event2: P1 writes 1 to X
� event3: P2 reads X and obtains 1
� hence:� event
1 is happened before event
2
� event2 is happened before event
3
� all processors agree on such an ordering
Causal consistency: example 1
initially: X=0
1a: X = 1
2a: X = 3
P1
1b: A = X
2b: X = 2
P2
1c: B = X
2c: D = X
3c: F = X
P3
1d: C = X
2d: E = X
3d: G = X
P4
result:
A=1 ; B=1 ; C=1 ; D=3 ; E=2 ; F=2 : G=3
Causal consistency: example 1
initially: X=0
P1:
P2:
P3:
P4:
W(X)1
R(X)1
R(X)1
R(X)1
W(X)2
W(X)3
R(X)3
R(X)2
R(X)2
R(X)3
Causal consistency: example 1
� for P3:� 2a < 2b
� for P4:� 2b < 2a
� �a single global order is not possible (2a ? 2b) � execution is not Sequentially consistent
� �no contradictions on causal dependencies � execution is Causally consistent
� Note: 2a and 2b are not causally related
Causal consistency: example 2
initially: X=0 , Y=0
1a: X = 1
2a: X = 2
P1
1b: A = X
2b: Y = 3
P2
1c: B = Y
2c: C = X
P3
result:
A=2 ; B=3 ; C=1
Causal consistency: example 2
initially: X=0 , Y=0
P1:
P2:
P3:
W(X)1
R(X)2 W(Y)3
R(Y)3 R(X)1
W(X)2
Causal consistency: example 2
� for P2: 2a < 2b (A=2)
� for P3: 2b < 2a (B=3 and C=1)
� P2 and P3 disagree on the order between 2a and 2b
� 2a and 2b are causally related (constraint due to A=2)
� � execution is not Causally consistent
Causal consistency: example 3
initially: X=0 , Y=0
1a: X = 1
2a: X = 2
P1
1b: A = X
2b: Y = 3
P2
1c: B = Y
2c: C = X
P3
result:
A=2 ; B=3 ; C=2
execution is Causally consistent
Uniform consistency models
� PRAM (pipelined ram) consistency (PRAM)
� Writes performed by a single process are seen by all other
processes in the order in which they were issued
� the perceived order of all writes seen can be different for each
process
� Cache consistency (CC)
� All writes to the same memory location are performed in
some sequential order
� all processes see the same order of writes for each location (but the
order of all writes can differ)
PRAM consistency: example
initially: X=0
1a: X = 1
P1
1b: A = X
2b: X = 2
P2
1c: B = X
2c: D = X
P3
1d: C = X
2d: E = X
P4
result:
A=1 ; B=1 ; C=2 ; D=2 ; E=1
PRAM consistency: example
initially: X=0
P1:
P2:
P3:
P4:
W(X)1
R(X)1 W(X)2
R(X)1
R(X)2
R(X)2
R(X)1
PRAM consistency: example
� all processors see the same order for writes of P1 (1a) and P2 (2b)
(trivial)
� � execution is PRAM consistent
� a single global order is not possible
� � execution is not Sequentially consistent
� P3 and P4 do not agree on causal relation between 1a and 2b
� � execution is not Causally consistent
Cache consistency: example
initially: X=0 ; Y=0
1a: X = 1
2a: A = Y
P1
1b: Y = 1
2b: B = X
P2
result:
A=0 ; B=0
Cache consistency: example
initially: X=0 ; Y=0
P1:
P2:
W(X)1
W(Y)1 R(X)0
R(Y)0
Cache consistency: example
� for P1: 1a < 1b (A=0)
� for P2: 1b < 1a (B=0)
� all processors see the same order for writes on X (1a) and on Y (1b)
(trivial)
� � execution is Cache consistent
� a single global order is not possible
� � execution is not Sequentially consistent
Uniform consistency models
� Processor consistency (PC)
� PRAM consistent and Cache Consistent
� Tie-Breaker (Peterson's) algorithm executes correctly under
Processor consistency
� Bakery algorithm needs Sequential consistency
� Processor consistent machines are easier to build than sequentially
consistent systems.
Processor consistency: example
initially: X=0 ; Y=0
1a: X = 1
2a: c = 1
3a: A = Y
P1
1b: Y = 1
2b: c = 2
3b: B = X
P2
result:
A=0 ; B=0
Processor consistency: example
initially: X=0 ; Y=0
P1:
P2:
W(X)1
W(Y)1 R(X)0
R(Y)0W(c)1
W(c)2
Processor consistency: example
� for P1:
� 1a < 2a < 3a < 1b < 2b
� for P2:
� 1b < 2b < 3b < 1a < 2a
� processors see different orders for writes on c
� � execution is not Processor consistent
� A=0 � for P1, 3a < 1b
� B=0 � for P2, 3b < 1a
Uniform consistency models
� Slow consistency (SC)
� All processors agree on the order of observed writes to
each location by a single processor
� Writes by a process must be immediately visible to itself
� System where writes propagate slowly to memory and other
processors
Uniform consistency models
Sequential
Consistency
Causal
Consistency
Processor
Consistency
PRAM
Consistency
Cache
Consistency
Slow
Consistency
Local
Consistency
SC vs PC: example
initially: A=0 and B=0
1a. A = 1;
2a. X = B;
1b. B = 1;
2b. Y = A;
P1 P2
Note:
A and B are volatiles
On sequential consistent systems X==0 and Y==0 is not possible
On processor consistent systems, X==0 and Y==0 is possible
SC vs PC: example
initially: A=0 , B=0 , C=0 , D=0 , E=0
1a. A = 1;
2a. B = D;
3a. C = 1;
1b. D = 1;
2b. E = A;
3b. while (C==0) continue;
4b. assert(B==1 || E==1);
P1 P2
The assertion 4b cannot fail on sequential consistent systems,
but can fail on processor consistent systems
Note:
A, B, C, D, E are volatiles
Consistency model and synchronization
� For 2 processes,
many synchronization patterns work in the same way
in processor consistent systems as well as in
sequential consistent systems
� It is possible to construct a situation in which processor
ordering fails, but there are few chances that such a
code is somewhat useful
Signaling
initially: A=0 and B=0
1a. A = 1;
2a. B = 1;
1b. while (B==0) continue;
2b. assert (A==1);
P1 P2
The assertion cannot fail on sequential
consistent and on processor consistent
systems
P1:
P2:
W(A)1
R(B)1 R(A)1
W(B)1
1a 2a
1b 2b
R(B)0
1b
R(B)0
1b
R(B)0
1b
Note:
A and B are volatiles
Barrier
initially: A=0 , B=0 , C=0 , D=0
1a. A = 1;
2a. B = 1;
3a. while (D==0) continue;
4a. assert (A==1 && C==1);
1b. C=1;
2b. D=1;
3b. while (B==0) continue;
4b. assert (A==1 && C==1);
P1 P2
The assertions cannot fail on sequential consistent and on processor consistent systems
Note:
A, B, C, D are volatiles
Consistency model and synchronization
� For 3 or more processes,
there are simple synchronization patters that work in
sequential consistent system but not in processor
consistent systems
� However,
it is easy to introduce small changes to have a correct
synchronization even in processor consistent systems
Signaling
initially: A=0 , B=0 , C=0
1a. A = 1;
2a. B = 1;
1b. while (B==0) continue;
2b. C=1;
P1 P2
Note:
A, B, C are volatiles
1c. while (C==0) continue;
2c. assert (A==1);
P3
The assertion 2c cannot fail on sequential consistent systems,
but can fail on processor consistent systems
WA1 and WC1 are performed on different variables by different cores:
on PC systems no order is enforced
Signaling exploiting cache coherency
initially: A=0 and B=0
1a. A = 1;
2a. B = 1;
1b. while (B==0) continue;
2b. B=2;
P1 P2
Note:
A and B are volatiles
1c. while (B!=2) continue;
2c. assert (A==1);
P3
The assertion 2c cannot fail on processor consistent systems
WB1 and WB2 are performed by different cores on the same variable:
cache coherency enforces access ordering
Signaling exploiting cache coherency
initially: A=0 , B=0 , C=0
1a. A = 1;
2a. B = 1;
1b. while (B==0) continue;
2b. B=1;
3b. C=1;
P1 P2
Note:
A, B, C are volatiles
1c. while (C==0) continue;
2c. assert (A==1);
P3
The assertion 2c cannot fail on processor consistent systems
WB1 (2a) and WB1 (2b) are performed by different cores on the same variable:
cache coherency enforces access ordering
WB1 and WC1 are performed by the same processor:
order is enforced by PRAM consistency
Hybrid consistency models
� Weak Consistency (WC)
� Release Consistency (RC)
� Entry Consistency (EC)
� Others
� Scope Consistency
� Location Consistency
� Dag Consistency
Hybrid consistency models
� Weak Consistency (WC)
� 2 types of accesses
� not synchronizing (read, write, read-write)
� synchronizing
� Accesses to synchronization variables are sequentially consistent
� No access to a synchronization variable is issued in a processor
before all previous data accesses have been performed
� No access is issued by a processor before a previous access to a
synchronization variable has been performed
� Standard read and writes obey to Local consistency
� A synchronization access works as a fence
Weak consistency
P1:
P2:
W(X)1
sync_R(Y)1 R(X)1
sync_W(Y)1
� 1a < 2a cannot be reordered, since 2a is a synch. access
� 1b < 2b cannot be reordered, since 1b is a synch. access
� Y=1 � 2a < 1b
� Hence:
� in 2b, X must be 1
1a 2a
1b 2b
Data Race
� Conflicting accesses� accesses to the same address from different processors,
where at least one is a write
� Access order� order can be enforced by the consistency model (SC) or by
using a synchronization access
� Data race:
� 2 conflicting accesses with no ordering imposed
SC-DRF
� A program executing on a weakly consistent system
appears sequentially consistent if:
� there are no data races (i.e., no competing accesses)
� synchronization is visible to the memory system
� Sequential consistency for Data-race free programs
Hybrid consistency models
� Release Consistency (RC)
� 2 kinds of synchronization accesses
� acquire
� only delays future accesses
� often associated to a read: load_acquire
� release
� only waits for previous accesses
� often associated to a write: store_release
� Synchronization accesses are Processor consistent
� acquire and release act as a semi-permeable barrier
access1
access2
access3
access4
access5access6
Acquire
Release
� access1 and access2 can be reordered before and after “Acquire”, but not after “Release”
� access3 and access4 can be reordered only between “Acquire” and “Release”
� access5 and access6 can be reordered before and after “Release”, but not before “Acquire”
1: access1
2: access2
3: acquire
4: access3
5: access4
6: release
7: access5
8: access6
Memory Acquire and Release Release consistency
P1:
P2:
W(X)1
R_acq(Y)1 R(X)1
W_rel(Y)1
� 1a < 2a cannot be reordered, since 2a is a release access
� 1b < 2b cannot be reordered, since 1b is an acquire access
� Y=1 � 2a < 1b
� Hence:
� in 2b, X must be 1
1a 2a
1b 2b
Hybrid consistency models
� Entry Consistency (EC)
� similar to RC
� differences:
� each shared variable is associated to a synchronizing variable
� the association can change dynamically under program control
� a synchronizing variable is a lock or a barrier
� acquire accesses can be exclusive or non-exclusive
Synchronizing accesses
� Synchronizing accesses
� Full fences
� Weak consistency
� Release and Acquire
� Release consistency
1. Reordering constraint
2. Memory access
Memory barriers
� Synchronizing accesses without access
� mechanism to control the out-of-order execution
� Instructions that prevents memory access reordering
� read barriers: prevent reordering of reads
� e.g., wait until the invalidate queue is empty
� write barriers: prevent reordering of writes
� e.g., wait until the store buffer is empty
� full barriers: act on all accesses
Memory barriers
P1:
P2:
W(X)1
R(Y)1 R(X)?
W(Y)1barrier1
� For P2:
� 1b < 2b < 3b program order
� 1a and 3a are executed in order,
but 1b an 2b can be executed out of order
� for P2 is the same as: “1a and 3a can be executed out of order”
� Y=1 � 3a < 1b
� Hence:
� in 2b, X can be either 0 or 1
1a 2a 3a
1b 2b
Memory barriers
P1:
P2:
W(X)1
R(Y)1 R(X)1
W(Y)1barrier1
barrier2
� For P2:
� 1b < 2b < 3b program order
� 1a < 2a < 3a barrier1 (and barrier2)
� Y=1 � 3a < 1b
� Hence:
� 1a < 2a < 3a < 1b < 2b < 3b
� in 3b, X must be 1
1a 2a 3a
1b 2b 3b
CPU's memory consistency models
� Processors implement out-of-order execution
� Store buffer, cache coherency, ...
� CPU specifications provide rules about possible reordering
� Different memory areas can have different rules
� ISA provide instructions to control reordering
� Barriers
Memory consistency model – Alpha
� There is a partial order: BEFORE (or <=)
� global relation (memory order)
� Processors can perform accesses out-of-order
� accesses: Instruction-fetch, Read, Write
� when addresses overlap:� IF-IF: maintain order� IF-W: maintain order� R-R: maintain order� R-W: maintain order� W-W: maintain order
� I-cache and pipeline are not coherent
� three kinds of barriers:� MB: force no-reordering between reads and writes� WMB: force no-reordering between writes� IMB: force no-reordering for reads, writes and I-fetches
� Order is not enforced for data dependency
initially: global_ptr = NULL
1a: ptr = malloc(...);
2a: ptr->key = val;
3a: ptr->data = data;
4a: wmb
5a: global_ptr = ptr;
P1
1b: while (global_ptr==NULL) continue;
2b: mb
3b: myval = global_ptr->key;
4b: mydata = global_ptr->data;
P2
there is a data dependency from 1b and 3b, but addresses do not overlap
a barrier is required for P2
Memory consistency model – Alpha
Memory consistency model – ARMv7
� No global memory order
� Accesses to a single address are seen in the same order
by all processors (Cache coherency)
� Instruction fetches, data reads, data writes can be
performed out-of-order
� Data dependent loads are not reordered
� I-cache and pipeline are not coherent
Memory consistency model – ARMv7
Note:DMB, DSB, and ISB instructions are added in ARMv7
Previous versions use CP15 to implement barrier operations
In ARMv6, barrier operations are always defined
In ARMv4 and ARMv5, barrier operations may not exist
� Three kinds of barriers:
� DMB: Data Memory Barrier
� All specified memory accesses before the barrier must be completed
before any (specified) memory accesses after the barrier is started
� DSB: Data Synchronization Barrier
� all specified memory accesses before the barrier must be completed
before any instruction after the barrier is started
� ISB: Instruction Synchronization Barrier
� flushes the pipeline
� Normal
� 3 levels of shareability
� Non-shareable
� for Normal memory that is used by only a single processor
� Inner Shareable
� for Normal memory that is shared between several processors
� Outer Shareable
� for Normal memory that is shared between processors and devices
� Cacheability
� Non-cacheable
� Write-Through Cacheable
� Write-Back Write-Allocate Cacheable
� Write-Back no Write-Allocate Cacheable
Memory consistency model – ARMv7
Memory types
Memory consistency model – ARMv7
Memory types� Device
� Accesses are strongly ordered� All memory accesses occur in program order.
� Shareability
� Shareable� for memory-mapped peripherals that are shared by several processors
� Non-shareable� for memory-mapped peripherals that are used only by a single processor
� Cacheability: Non-cacheable
� a write to Device memory is permitted to complete before it reaches the target
� Strongly-ordered
� Accesses are strongly ordered
� Shareability: All Strongly-ordered regions are assumed to be Shareable
� Cacheability: Non-cacheable
� a write to Strongly-ordered memory can complete only when it reaches the target
� DMB (or DSB) sy� Barrier for all memory accesses that refer to domain “Outer Shareable”
(full system barrier)
� DMB (or DSB) st� Barrier for writes that refer to domain “Outer Shareable”
� DMB (or DSB) sh� Barrier for all memory accesses that refer to domain “Inner Shareable”
� DMB (or DSB) stst� Barrier for writes that refer to domain “Inner Shareable”
� DMB (or DSB) un� Barrier for all memory accesses that refer to domain “Non-Shareable”
� DMB (or DSB) unst� Barrier for writes that refer to domain “Non-Shareable”
Memory consistency model – ARMv7 Memory consistency model – MIPS32
� Three kinds of barriers:
� Completion Barriers� all specified memory accesses before the barrier must be completed (globally performed)
before the barrier� memory accesses after the barrier are started after the barrier� SYNC (or SYNC 0): acts on R and W (required in all implementations)
� Ordering Barrier� all specified memory accesses before the barrier must be completed before the barrier� SYNC_WMB (or SYNC 4): acts on W� SYNC_MB (or SYNC 16): acts on R and W
� SYNC_ACQUIRE (or SYNC 17): acts on R (before) and R and W (after)
� SYNC_RELEASE (or SYNC 18): acts on R and W (before) and W (after)
� SYNC_RMB (or SYNC 19): acts on R
� Instruction cache barrier� Synchronize Caches to Make Instruction Writes Effective� SYNCI
� an I-cache line is updated� to be used after a code change
op
tio
na
l
Memory consistency model – IA-32
� Memory areas can be:
� UC: uncacheable� strong ordering is enforced
� useful for memory-mapped devices
� WC: write-combining� cached in special buffers, coherence not enforced
� useful for framebuffers (writes order is not relevant)
� WB: cacheable, with write-back policy� coherence enforced
� WT: cacheable, with write-through policy� coherence enforced
� useful for devices that access memory (DMA-capable devices) without
implementing cache coherency protocols
Memory consistency model – IA-32
� For WB and WT:
� there is a global memory ordering
� order is maintained for:� R-R, R-W, W-W
� order is not maintained for:� W-R
� the read obtains data from the forwarding path
� some streaming store instruction allows W-W reordering� MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD
� string operations allow W-W reordering
Memory consistency model – IA-32
� For WB memory areas:
� Individual processors use the same ordering principles as in a
single-processor system.
� Writes by a single processor are observed in the same order
by all processors.
� Writes from an individual processor are NOT ordered with
respect to the writes from other processors.
� Memory ordering obeys causality (memory ordering respects
transitive visibility).
� Any two stores are seen in a consistent order by processors
other than those performing the stores
� Locked instructions have a total order.
Memory consistency model – IA-32
� Three kinds of barriers:
� MFENCE� Serializes load and store operations
� guarantees that all loads and stores specified before the fence are globally
observable prior to any loads or stores being carried out after the fence.
� LFENCE� Serializes load operations
� guarantees ordering between two loads and prevents speculative loads
from passing the load fence
� SFENCE� Serializes store operations
� guarantees that every store instruction that precedes the SFENCE in
program order becomes globally visible before any store instruction that
follows the SFENCE
Memory consistency models and OS
� OS must provide primitives to enforce access ordering
� processor vs processor accesses
� not required on uni-processor systems
� processor vs device accesses
� required even on uni-processor systems
� Multi architectures issue
� Portable code must use the weakest model of all supported architectures
� Linux
� weakest model: ALPHA consistency model
� does not guarantee ordering between data dependent accesses
Linux memory barriers
� Compiler barrier� prevent compiler reordering of accesses
� processor can still perform out-of-order accesses
� barrier(): compiler directive, no instruction
� Processor vs processor barriers� smp_mb(): full memory barrier
� smp_rmb(): memory barrier for reads
� smp_wmb(): memory barrier for writes
� smp_read_barrier_depends(): memory barrier for data-dependency
� Processor vs anything barriers� mb(): full memory barrier
� rmb(): memory barrier for reads
� wmb(): memory barrier for writes
� read_barrier_depends(): memory barrier for data-dependency
Linux memory barriers – examples
smp_mb
smp_rmb
smp_wmb
smp_read_barrier_depends
mb
rmb
wmb
read_barrier_depends
barrier
barrier
barrier
nothing
mb
mb
wmb
mb
barrier
barrier
barrier
nothing
dsb
dsb
dsb st
nothing
barrier
barrier
barrier
nothing
sync
sync
sync
nothing
barrier
barrier
barrier
nothing
mfence
lfence
sfence
nothing
Alpha ARMv7 MIPS32 IA-32
uni-processor
systems
smp_mb
smp_rmb
smp_wmb
smp_read_barrier_depends
mb
rmb
wmb
read_barrier_depends
mb
mb
wmb
mb
mb
mb
wmb
mb
dmb ish
dmb ish
dms ishst
nothing
dsb
dsb
dsb st
nothing
synch
synch
synch
nothing
synch
synch
synch
nothing
mfence
barrier
barrier
nothing
mfence
lfence
sfence
nothing
Alpha ARMv7 MIPS32 IA-32
multi-processor
systems