Memory Consistency Models · 2019-03-05 · Memory Consistency Models Calcolatori Elettronici e...

Memory

Consistency

Models

Calcolatori Elettronici e Sistemi OperativiSources of out-of-order memory accesses

� Compiler optimizations

� Store buffers

� FIFOs for uncommitted writes

� Invalidate queues (for cache coherency)

� Data prefetch

� Banked cache architectures

� Networked interconnect

� Non-uniform memory access (NUMA) architectures:

different accesses to memory have different latencies

� ...

Compiler optimizations

� Language semantic does not consider

1. Side-effects of memory accesses

2. Multi-threading

3. Asynchronous execution

� Compiler can:

� Reorder instructions

� Eliminate operations

� Some compiler optimization can be controlled by the

volatile qualifier

void waitval (int *ptr) { while (*ptr == 0) continue;}

void waitval (int *ptr) { while (*ptr == 0) continue;}

waitval:ldr r3, [r0]cmp r3, #0movne pc, lr

loop:b loop

waitval:ldr r3, [r0]cmp r3, #0movne pc, lr

loop:b loop

Compiler does not need to consider that someone else can change *ptr

ARM assembly codeC code

int add3 (int x) { int i; for (i=0; i<3; i++) x += x; return x;}

int add3 (int x) { int i; for (i=0; i<3; i++) x += x; return x;}

add3:mov r0, r0, asl #3mov pc, lr

add3:mov r0, r0, asl #3mov pc, lr

This function always returns 8·x: compiler can optimize code


int add_vals (int *vec) { int y = vec[1]; y += vec[0]; return y;}

int add_vals (int *vec) { int y = vec[1]; y += vec[0]; return y;}

add_vals:ldr r3, [r0]ldr r0, [r0, #4]add r0, r0, r3mov pc, lr

add_vals:ldr r3, [r0]ldr r0, [r0, #4]add r0, r0, r3mov pc, lr

Result does not depend on access order: compiler can change loads order


Volatile

� Semantic

� Each read from a volatile variable requires an actual load

and may return a different value

� Compiler optimization cannot merge reads from the same address

� Each write to a volatile variable requires an actual store

� Compiler optimization cannot cancel stores

� Required to access I/O address space

Note:

this is the C/C++ semantic

the Java semantic differs (it also implies atomicity)

Examples

int *ptr; /* pointer to int */

volatile int *ptr_to_vol; /* pointer to volatile int */

int *volatile vol_ptr; /* volatile pointer to int */

volatile int *volatile vol_ptr_to_vol; /* volatile pointer to volatile int */

int *ptr; /* pointer to int */

volatile int *ptr_to_vol; /* pointer to volatile int */

int *volatile vol_ptr; /* volatile pointer to int */

volatile int *volatile vol_ptr_to_vol; /* volatile pointer to volatile int */

� Beware the semantic:

� a = *ptr_to_vol;

� is a volatile access

� a = *vol_ptr;

� is not a volatile access

Volatile

� Inconsistent qualification causes errors

� Volatile does not enforce ordering with non-volatile

accesses

� Volatile does not enforce order on how access are

actually performed

� Volatile does not mean atomic

Volatile

volatile int A;volatile int B;

A=1; /* these two lines won't be */B=1; /* reordered by compiler */


A=1; /* these two lines won't be */B=1; /* reordered by compiler */

int A;volatile int B;

A=1; /* these two lines can be */B=1; /* reordered by compiler */

int A;volatile int B;

A=1; /* these two lines can be */B=1; /* reordered by compiler */


A=1; /* these two lines won't be */B=1; /* reordered by compiler but */ /* accesses can be reordered */ /* by HW */


A=1; /* these two lines won't be */B=1; /* reordered by compiler but */ /* accesses can be reordered */ /* by HW */

volatile int X;

X=1; /* this assignment can be interrupted or preempted */

volatile int X;

X=1; /* this assignment can be interrupted or preempted */

Memory barrier

� This inline assembly code:

1. contains no instructions

2. may read or write all of RAM

� Hence:

compiler memory accesses reordering is not allowed

around the barrier in either direction

asm volatile ("" : : : "memory");asm volatile ("" : : : "memory");

� Implementation on GCC

Store Buffer

� Record the store in buffer until is actually performed

� Hide memory latency

� Cache latency

� Cache-miss on write

� Processor can execute other instructions

� Data dependency (RAW)

� Wait until the write is actually performed in memory or in cache

� Read the data from the store buffer (store forwarding)

� Data dependency (WAW)

� Add a new entry in the store buffer

� Replace the previous write in the store buffer

Example

� Processor P1 executes

� 1) store A

� 2) store B

� A and B are shared with P2:

� A is in P2 cache

� B is in both caches

P1

Store

buffer

Cache

B

P2

Store

buffer

Cache

B A

Interconnect

Example

� Execution:

� 1: store A: cache miss� write the updated value in store buffer

� send a read request (data will come from P2 cache)� several clock cycles needed� P1 can proceed, (the new value is in the store buffer)� P2 does not see the write

� 2: store B: cache hit� data is written in cache� a coherence message is sent to P2

� P2 sees the write

� 3: A is loaded in P1 cache

� 4: A is updated in P1 cache� a coherence message is sent to P2

� P2 sees the write

� � P2 sees the store on B first, then the store on A

Consequence

initially: A=0 and B=0

A = 1

B = 1

while (B==0) continue;

assert (A==1); /* this can fail! */

P1 P2

If P2 sees the stores performed by P1 in reverse order, the assertion fails

Note:

A and B are volatiles

Cache coherency

� Cache coherency can require cache line invalidation

� A processor send an invalidate message to another one

� Target processor must invalidate cache line

� Invalidate Queue

� Store invalidate requests until the cache is busy

� Invalidate the line when the cache is ready

Data prefetch

� Processor can read data before the actual load

instruction

� Hide memory latency

� Preload data in cache

� Speculative execution

� Execute instructions after a branch before the branch

Banked cache architectures

� Caches split in several banks

� While accesses to busy banks must wait, accesses to idle

banks can proceed

Processor

Cache

Interconnect

Cache

Store

buffer

\

Definitions

� Program order

� The order of operations as specified by software

� Execution order

� The order of operations as executed by a processor

� Perceived order

� The order of operations as seen by processors and memories

� Memory consistency model

� Rules that specify the allowed behavior of programs in terms

of memory accesses

� Rules: order restrictions

Definitions

� Performed� Write

� a write by processor i is performed with respect to processor k when:� a read issued by k to the same address returns the value stored by i

� Read� a read by processor i is performed with respect to processor k when:

� a write issued by k to the same address cannot affect the value read by i

� Globally Performed� globally performed: is performed with respect to all processors

� Write� A write is globally performed when its modi cation has been fi

propagated to all processors

� Read� A read is globally performed when the value it returns is bound and the

write that wrote this value is globally performed

Memory consistency models

� Rules on access ordering can regard:

� Location (address of access)

� Direction

� read, write, read-write

� Value

� Causality

� behavior of an access depends on the behavior of another one

� Category

� shared / private

� synchronizing / not synchronizing

Memory consistency models

� Uniform consistency models

� Rules do no concern category of accesses

� Hybrid consistency models

� Category of accesses matters

Uniform consistency models

� Local Consistency (LC)

� Each process sees its own accesses in program order

� There is no restriction on the order of the accesses seen by other

processors

� Different processes may see different orders

� The weakest consistency model:

� it only guarantees sequential behavior on uniprocessor systems

� Not usable to program in parallel environments


� Sequential consistency (SC)

� There is a global total order of all memory accesses (of all

processors)

� all processors agree with such global order

� global order can change at each run

� Each processor sees its own accesses in program order

� Offsets many architectural optimizations

� Easy to use

� Model implied by a cacheless system, with a single memory device,

with processors unable to perform Out-of-Order execution

SC: Consequence


1a. A = 1

2a. B = 1

1b. while (B==0) continue;

2b. assert (A==1);

P1 P2

The assertion cannot fail

P1:

P2:

W(A)1

1a

Note:


time

history

access type

variable/address

data stored/read

SC: Consequence


1a. A = 1

2a. B = 1


2b. assert (A==1);

P1 P2


P1:

P2:

W(A)1

R(B)1 R(A)1

W(B)1

1a 2a

1b 2b

R(B)0

1b

R(B)0

1b

R(B)0

1b

Note:


time

SC: Consequence


1a. A = 1

2a. B = 1


2b. assert (A==1);

P1 P2


� Each processor sees its own accesses in program order

� All processors agree with a global order

� Access 1a is before access 2b

It is easy to enforce order between accesses from different processors

Note:


Sequential consistency

� Cache based system, no constraint on the interconnect

� Sufficient conditions

� All processors issue their access in program order

� A processor does not issue an access until its previous accesses have been

globally performed

� Need waiting for acknowledges from other processors

� Offsets many architectural optimizations

� No out-of -order execution

� Write-hit on cache must wait answers

� Easy to use

Comparison

� The union of all the Perceived orders can be valid or not

for a given consistency model

� Example: � P2 executes:

� I3: load A

� I4: load B

� 1) P1 sees I1, I2; P2 sees I1, I3, I4, I2

� valid execution for Sequential consistency� total order implied: I1, I3, I4, I2

� 2) P1 sees I1, I2; P2 sees I3, I2, I4, I1

� invalid execution for Sequential consistency� there is not an unique total order

� valid execution for Local consistency� P1 and P2 see their own accesses in order

� P1 executes:

� I1: store A

� I2: store B

Comparison

� Consistency model A is stronger than consistency model B if:

� each execution valid on A is also valid on B

� also: B is weaker than A

� If there exist

� some execution E1 valid on A and not valid on B

� some execution E2 valid on B and not valid on A

� then, A and B are incomparable


� Causal consistency (Causal)

� All processors agree on the order of causally related events

� causally unrelated events can be observed in different orders

� Example

� X is initially 0

� event1: P1 writes 1 to X

� event2: P2 reads X and obtains 1


� hence:� event

1 is happened before event

2

� event2 is happened before event

3

� all processors agree on such an ordering

� Example

� X is initially 0




� hence:� event

1 is happened before event

2

� event2 is happened before event

3

� all processors agree on such an ordering

Causal consistency: example 1

initially: X=0

1a: X = 1

2a: X = 3

P1

1b: A = X

2b: X = 2

P2

1c: B = X

2c: D = X

3c: F = X

P3

1d: C = X

2d: E = X

3d: G = X

P4

result:

A=1 ; B=1 ; C=1 ; D=3 ; E=2 ; F=2 : G=3


initially: X=0

P1:

P2:

P3:

P4:

W(X)1

R(X)1

R(X)1

R(X)1

W(X)2

W(X)3

R(X)3

R(X)2

R(X)2

R(X)3


� for P3:� 2a < 2b

� for P4:� 2b < 2a

� �a single global order is not possible (2a ? 2b) � execution is not Sequentially consistent

� �no contradictions on causal dependencies � execution is Causally consistent

� Note: 2a and 2b are not causally related


initially: X=0 , Y=0

1a: X = 1

2a: X = 2

P1

1b: A = X

2b: Y = 3

P2

1c: B = Y

2c: C = X

P3

result:

A=2 ; B=3 ; C=1



P1:

P2:

P3:

W(X)1

R(X)2 W(Y)3

R(Y)3 R(X)1

W(X)2


� for P2: 2a < 2b (A=2)

� for P3: 2b < 2a (B=3 and C=1)

� P2 and P3 disagree on the order between 2a and 2b

� 2a and 2b are causally related (constraint due to A=2)

� � execution is not Causally consistent



1a: X = 1

2a: X = 2

P1

1b: A = X

2b: Y = 3

P2

1c: B = Y

2c: C = X

P3

result:

A=2 ; B=3 ; C=2

execution is Causally consistent


� PRAM (pipelined ram) consistency (PRAM)

� Writes performed by a single process are seen by all other

processes in the order in which they were issued

� the perceived order of all writes seen can be different for each

process

� Cache consistency (CC)

� All writes to the same memory location are performed in

some sequential order

� all processes see the same order of writes for each location (but the

order of all writes can differ)

PRAM consistency: example

initially: X=0

1a: X = 1

P1

1b: A = X

2b: X = 2

P2

1c: B = X

2c: D = X

P3

1d: C = X

2d: E = X

P4

result:

A=1 ; B=1 ; C=2 ; D=2 ; E=1


initially: X=0

P1:

P2:

P3:

P4:

W(X)1

R(X)1 W(X)2

R(X)1

R(X)2

R(X)2

R(X)1


� all processors see the same order for writes of P1 (1a) and P2 (2b)

(trivial)

� � execution is PRAM consistent

� a single global order is not possible

� � execution is not Sequentially consistent

� P3 and P4 do not agree on causal relation between 1a and 2b

� � execution is not Causally consistent

Cache consistency: example

initially: X=0 ; Y=0

1a: X = 1

2a: A = Y

P1

1b: Y = 1

2b: B = X

P2

result:

A=0 ; B=0



P1:

P2:

W(X)1

W(Y)1 R(X)0

R(Y)0


� for P1: 1a < 1b (A=0)

� for P2: 1b < 1a (B=0)

� all processors see the same order for writes on X (1a) and on Y (1b)

(trivial)

� � execution is Cache consistent

� a single global order is not possible

� � execution is not Sequentially consistent


� Processor consistency (PC)

� PRAM consistent and Cache Consistent

� Tie-Breaker (Peterson's) algorithm executes correctly under

Processor consistency

� Bakery algorithm needs Sequential consistency

� Processor consistent machines are easier to build than sequentially

consistent systems.

Processor consistency: example


1a: X = 1

2a: c = 1

3a: A = Y

P1

1b: Y = 1

2b: c = 2

3b: B = X

P2

result:

A=0 ; B=0



P1:

P2:

W(X)1

W(Y)1 R(X)0

R(Y)0W(c)1

W(c)2


� for P1:

� 1a < 2a < 3a < 1b < 2b

� for P2:

� 1b < 2b < 3b < 1a < 2a

� processors see different orders for writes on c

� � execution is not Processor consistent

� A=0 � for P1, 3a < 1b

� B=0 � for P2, 3b < 1a


� Slow consistency (SC)

� All processors agree on the order of observed writes to

each location by a single processor

� Writes by a process must be immediately visible to itself

� System where writes propagate slowly to memory and other

processors


Sequential

Consistency

Causal

Consistency

Processor

Consistency

PRAM

Consistency

Cache

Consistency

Slow

Consistency

Local

Consistency

SC vs PC: example


1a. A = 1;

2a. X = B;

1b. B = 1;

2b. Y = A;

P1 P2

Note:


On sequential consistent systems X==0 and Y==0 is not possible

On processor consistent systems, X==0 and Y==0 is possible

SC vs PC: example

initially: A=0 , B=0 , C=0 , D=0 , E=0

1a. A = 1;

2a. B = D;

3a. C = 1;

1b. D = 1;

2b. E = A;

3b. while (C==0) continue;

4b. assert(B==1 || E==1);

P1 P2

The assertion 4b cannot fail on sequential consistent systems,

but can fail on processor consistent systems

Note:

A, B, C, D, E are volatiles

Consistency model and synchronization

� For 2 processes,

many synchronization patterns work in the same way

in processor consistent systems as well as in

sequential consistent systems

� It is possible to construct a situation in which processor

ordering fails, but there are few chances that such a

code is somewhat useful

Signaling


1a. A = 1;

2a. B = 1;


2b. assert (A==1);

P1 P2

The assertion cannot fail on sequential

consistent and on processor consistent

systems

P1:

P2:

W(A)1

R(B)1 R(A)1

W(B)1

1a 2a

1b 2b

R(B)0

1b

R(B)0

1b

R(B)0

1b

Note:


Barrier

initially: A=0 , B=0 , C=0 , D=0

1a. A = 1;

2a. B = 1;

3a. while (D==0) continue;

4a. assert (A==1 && C==1);

1b. C=1;

2b. D=1;


4b. assert (A==1 && C==1);

P1 P2

The assertions cannot fail on sequential consistent and on processor consistent systems

Note:

A, B, C, D are volatiles

Consistency model and synchronization

� For 3 or more processes,

there are simple synchronization patters that work in

sequential consistent system but not in processor

consistent systems

� However,

it is easy to introduce small changes to have a correct

synchronization even in processor consistent systems

Signaling

initially: A=0 , B=0 , C=0

1a. A = 1;

2a. B = 1;


2b. C=1;

P1 P2

Note:

A, B, C are volatiles

1c. while (C==0) continue;

2c. assert (A==1);

P3

The assertion 2c cannot fail on sequential consistent systems,

but can fail on processor consistent systems

WA1 and WC1 are performed on different variables by different cores:

on PC systems no order is enforced

Signaling exploiting cache coherency


1a. A = 1;

2a. B = 1;


2b. B=2;

P1 P2

Note:


1c. while (B!=2) continue;

2c. assert (A==1);

P3

The assertion 2c cannot fail on processor consistent systems

WB1 and WB2 are performed by different cores on the same variable:

cache coherency enforces access ordering

Signaling exploiting cache coherency

initially: A=0 , B=0 , C=0

1a. A = 1;

2a. B = 1;


2b. B=1;

3b. C=1;

P1 P2

Note:

A, B, C are volatiles

1c. while (C==0) continue;

2c. assert (A==1);

P3

The assertion 2c cannot fail on processor consistent systems

WB1 (2a) and WB1 (2b) are performed by different cores on the same variable:

cache coherency enforces access ordering

WB1 and WC1 are performed by the same processor:

order is enforced by PRAM consistency

Hybrid consistency models

� Weak Consistency (WC)

� Release Consistency (RC)

� Entry Consistency (EC)

� Others

� Scope Consistency

� Location Consistency

� Dag Consistency


� Weak Consistency (WC)

� 2 types of accesses

� not synchronizing (read, write, read-write)

� synchronizing

� Accesses to synchronization variables are sequentially consistent

� No access to a synchronization variable is issued in a processor

before all previous data accesses have been performed

� No access is issued by a processor before a previous access to a

synchronization variable has been performed

� Standard read and writes obey to Local consistency

� A synchronization access works as a fence

Weak consistency

P1:

P2:

W(X)1

sync_R(Y)1 R(X)1

sync_W(Y)1

� 1a < 2a cannot be reordered, since 2a is a synch. access

� 1b < 2b cannot be reordered, since 1b is a synch. access

� Y=1 � 2a < 1b

� Hence:

� in 2b, X must be 1

1a 2a

1b 2b

Data Race

� Conflicting accesses� accesses to the same address from different processors,

where at least one is a write

� Access order� order can be enforced by the consistency model (SC) or by

using a synchronization access

� Data race:

� 2 conflicting accesses with no ordering imposed

SC-DRF

� A program executing on a weakly consistent system

appears sequentially consistent if:

� there are no data races (i.e., no competing accesses)

� synchronization is visible to the memory system

� Sequential consistency for Data-race free programs


� Release Consistency (RC)

� 2 kinds of synchronization accesses

� acquire

� only delays future accesses

� often associated to a read: load_acquire

� release

� only waits for previous accesses

� often associated to a write: store_release

� Synchronization accesses are Processor consistent

� acquire and release act as a semi-permeable barrier

access1

access2

access3

access4

access5access6

Acquire

Release

� access1 and access2 can be reordered before and after “Acquire”, but not after “Release”

� access3 and access4 can be reordered only between “Acquire” and “Release”

� access5 and access6 can be reordered before and after “Release”, but not before “Acquire”

1: access1

2: access2

3: acquire

4: access3

5: access4

6: release

7: access5

8: access6

Memory Acquire and Release Release consistency

P1:

P2:

W(X)1

R_acq(Y)1 R(X)1

W_rel(Y)1

� 1a < 2a cannot be reordered, since 2a is a release access

� 1b < 2b cannot be reordered, since 1b is an acquire access

� Y=1 � 2a < 1b

� Hence:


1a 2a

1b 2b


� Entry Consistency (EC)

� similar to RC

� differences:

� each shared variable is associated to a synchronizing variable

� the association can change dynamically under program control

� a synchronizing variable is a lock or a barrier

� acquire accesses can be exclusive or non-exclusive

Synchronizing accesses

� Synchronizing accesses

� Full fences

� Weak consistency

� Release and Acquire

� Release consistency

1. Reordering constraint

2. Memory access

Memory barriers

� Synchronizing accesses without access

� mechanism to control the out-of-order execution

� Instructions that prevents memory access reordering

� read barriers: prevent reordering of reads

� e.g., wait until the invalidate queue is empty

� write barriers: prevent reordering of writes

� e.g., wait until the store buffer is empty

� full barriers: act on all accesses

Memory barriers

P1:

P2:

W(X)1

R(Y)1 R(X)?

W(Y)1barrier1

� For P2:

� 1b < 2b < 3b program order

� 1a and 3a are executed in order,

but 1b an 2b can be executed out of order

� for P2 is the same as: “1a and 3a can be executed out of order”

� Y=1 � 3a < 1b

� Hence:

� in 2b, X can be either 0 or 1

1a 2a 3a

1b 2b

Memory barriers

P1:

P2:

W(X)1

R(Y)1 R(X)1

W(Y)1barrier1

barrier2

� For P2:

� 1b < 2b < 3b program order

� 1a < 2a < 3a barrier1 (and barrier2)

� Y=1 � 3a < 1b

� Hence:

� 1a < 2a < 3a < 1b < 2b < 3b


1a 2a 3a

1b 2b 3b

CPU's memory consistency models

� Processors implement out-of-order execution

� Store buffer, cache coherency, ...

� CPU specifications provide rules about possible reordering

� Different memory areas can have different rules

� ISA provide instructions to control reordering

� Barriers

Memory consistency model – Alpha

� There is a partial order: BEFORE (or <=)

� global relation (memory order)

� Processors can perform accesses out-of-order

� accesses: Instruction-fetch, Read, Write

� when addresses overlap:� IF-IF: maintain order� IF-W: maintain order� R-R: maintain order� R-W: maintain order� W-W: maintain order

� I-cache and pipeline are not coherent

� three kinds of barriers:� MB: force no-reordering between reads and writes� WMB: force no-reordering between writes� IMB: force no-reordering for reads, writes and I-fetches

� Order is not enforced for data dependency

initially: global_ptr = NULL

1a: ptr = malloc(...);

2a: ptr->key = val;

3a: ptr->data = data;

4a: wmb

5a: global_ptr = ptr;

P1

1b: while (global_ptr==NULL) continue;

2b: mb

3b: myval = global_ptr->key;

4b: mydata = global_ptr->data;

P2

there is a data dependency from 1b and 3b, but addresses do not overlap

a barrier is required for P2

Memory consistency model – Alpha

Memory consistency model – ARMv7

� No global memory order

� Accesses to a single address are seen in the same order

by all processors (Cache coherency)

� Instruction fetches, data reads, data writes can be

performed out-of-order

� Data dependent loads are not reordered

� I-cache and pipeline are not coherent


Note:DMB, DSB, and ISB instructions are added in ARMv7

Previous versions use CP15 to implement barrier operations

In ARMv6, barrier operations are always defined

In ARMv4 and ARMv5, barrier operations may not exist

� Three kinds of barriers:

� DMB: Data Memory Barrier

� All specified memory accesses before the barrier must be completed

before any (specified) memory accesses after the barrier is started

� DSB: Data Synchronization Barrier

� all specified memory accesses before the barrier must be completed

before any instruction after the barrier is started

� ISB: Instruction Synchronization Barrier

� flushes the pipeline

� Normal

� 3 levels of shareability

� Non-shareable

� for Normal memory that is used by only a single processor

� Inner Shareable

� for Normal memory that is shared between several processors

� Outer Shareable

� for Normal memory that is shared between processors and devices

� Cacheability

� Non-cacheable

� Write-Through Cacheable

� Write-Back Write-Allocate Cacheable

� Write-Back no Write-Allocate Cacheable


Memory types


Memory types� Device

� Accesses are strongly ordered� All memory accesses occur in program order.

� Shareability

� Shareable� for memory-mapped peripherals that are shared by several processors

� Non-shareable� for memory-mapped peripherals that are used only by a single processor

� Cacheability: Non-cacheable

� a write to Device memory is permitted to complete before it reaches the target

� Strongly-ordered

� Accesses are strongly ordered

� Shareability: All Strongly-ordered regions are assumed to be Shareable

� Cacheability: Non-cacheable

� a write to Strongly-ordered memory can complete only when it reaches the target

� DMB (or DSB) sy� Barrier for all memory accesses that refer to domain “Outer Shareable”

(full system barrier)

� DMB (or DSB) st� Barrier for writes that refer to domain “Outer Shareable”

� DMB (or DSB) sh� Barrier for all memory accesses that refer to domain “Inner Shareable”

� DMB (or DSB) stst� Barrier for writes that refer to domain “Inner Shareable”

� DMB (or DSB) un� Barrier for all memory accesses that refer to domain “Non-Shareable”

� DMB (or DSB) unst� Barrier for writes that refer to domain “Non-Shareable”

Memory consistency model – ARMv7 Memory consistency model – MIPS32


� Completion Barriers� all specified memory accesses before the barrier must be completed (globally performed)

before the barrier� memory accesses after the barrier are started after the barrier� SYNC (or SYNC 0): acts on R and W (required in all implementations)

� Ordering Barrier� all specified memory accesses before the barrier must be completed before the barrier� SYNC_WMB (or SYNC 4): acts on W� SYNC_MB (or SYNC 16): acts on R and W

� SYNC_ACQUIRE (or SYNC 17): acts on R (before) and R and W (after)

� SYNC_RELEASE (or SYNC 18): acts on R and W (before) and W (after)

� SYNC_RMB (or SYNC 19): acts on R

� Instruction cache barrier� Synchronize Caches to Make Instruction Writes Effective� SYNCI

� an I-cache line is updated� to be used after a code change

op

tio

na

l

Memory consistency model – IA-32

� Memory areas can be:

� UC: uncacheable� strong ordering is enforced

� useful for memory-mapped devices

� WC: write-combining� cached in special buffers, coherence not enforced

� useful for framebuffers (writes order is not relevant)

� WB: cacheable, with write-back policy� coherence enforced

� WT: cacheable, with write-through policy� coherence enforced

� useful for devices that access memory (DMA-capable devices) without

implementing cache coherency protocols


� For WB and WT:

� there is a global memory ordering

� order is maintained for:� R-R, R-W, W-W

� order is not maintained for:� W-R

� the read obtains data from the forwarding path

� some streaming store instruction allows W-W reordering� MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD

� string operations allow W-W reordering


� For WB memory areas:

� Individual processors use the same ordering principles as in a

single-processor system.

� Writes by a single processor are observed in the same order

by all processors.

� Writes from an individual processor are NOT ordered with

respect to the writes from other processors.

� Memory ordering obeys causality (memory ordering respects

transitive visibility).

� Any two stores are seen in a consistent order by processors

other than those performing the stores

� Locked instructions have a total order.



� MFENCE� Serializes load and store operations

� guarantees that all loads and stores specified before the fence are globally

observable prior to any loads or stores being carried out after the fence.

� LFENCE� Serializes load operations

� guarantees ordering between two loads and prevents speculative loads

from passing the load fence

� SFENCE� Serializes store operations

� guarantees that every store instruction that precedes the SFENCE in

program order becomes globally visible before any store instruction that

follows the SFENCE

Memory consistency models and OS

� OS must provide primitives to enforce access ordering

� processor vs processor accesses

� not required on uni-processor systems

� processor vs device accesses

� required even on uni-processor systems

� Multi architectures issue

� Portable code must use the weakest model of all supported architectures

� Linux

� weakest model: ALPHA consistency model

� does not guarantee ordering between data dependent accesses

Linux memory barriers

� Compiler barrier� prevent compiler reordering of accesses

� processor can still perform out-of-order accesses

� barrier(): compiler directive, no instruction

� Processor vs processor barriers� smp_mb(): full memory barrier

� smp_rmb(): memory barrier for reads

� smp_wmb(): memory barrier for writes

� smp_read_barrier_depends(): memory barrier for data-dependency

� Processor vs anything barriers� mb(): full memory barrier

� rmb(): memory barrier for reads

� wmb(): memory barrier for writes

� read_barrier_depends(): memory barrier for data-dependency

Linux memory barriers – examples

smp_mb

smp_rmb

smp_wmb

smp_read_barrier_depends

mb

rmb

wmb

read_barrier_depends

barrier

barrier

barrier

nothing

mb

mb

wmb

mb

barrier

barrier

barrier

nothing

dsb

dsb

dsb st

nothing

barrier

barrier

barrier

nothing

sync

sync

sync

nothing

barrier

barrier

barrier

nothing

mfence

lfence

sfence

nothing

Alpha ARMv7 MIPS32 IA-32

uni-processor

systems

smp_mb

smp_rmb

smp_wmb

smp_read_barrier_depends

mb

rmb

wmb

read_barrier_depends

mb

mb

wmb

mb

mb

mb

wmb

mb

dmb ish

dmb ish

dms ishst

nothing

dsb

dsb

dsb st

nothing

synch

synch

synch

nothing

synch

synch

synch

nothing

mfence

barrier

barrier

nothing

mfence

lfence

sfence

nothing

Alpha ARMv7 MIPS32 IA-32

multi-processor

systems

Date post:	25-Mar-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Memory Consistency Models · 2019-03-05 · Memory Consistency Models Calcolatori Elettronici e...

Documents