Ganesh Gopalakrishnan Associate Professor Computer Science University of Utah

Ganesh Gopalakrishnan

Associate Professor

Computer Science

University of Utah

www.cs.utah.edu/~ganesh

* Verification of Coherence Protocols against Shared Memory Consistency Models using Test Model-Checking

* Overview of the Utah Verifier Group

Intel MPG talk of 11/12/99, Santa Clara:

Past Utah Verifier group Members• Ratan Nalumasu, PhD ‘98 (HP)

– new partial-order reduction algorithm and model-checker PV

– approach to write high-level specs for coherency protocols and obtain split transaction protocols automatically

– test model-checking approach • Abdel Mokkedem, Postdoc (Compaq)

– help in above, plus modeling & verifying the PCI 2.1 protocol

• Rajnish Ghughal, MS ‘99 (Intel, Oregon)– test model-checking for weak memory models

Present group Members• Ravi Hosabettu, PhD student

– approach to pipelined processor modeling and verification using layered abstraction map

– recently finished verification of high-level design model of CPU with reorder buffer, branches, speculation, exceptions (PVS proof - 35 days)

• Michael Jones, PhD student– verifying the PCI 2.1 protocol using an abstraction map to

PCI_abstract followed by a special-purpose SML model-checker for PCI_abstract

• Annette Bunker, PhD student– background research

• New group members: Ritwik Bhattacharya, Jason Yang, Ali Sezgin, Prosenjit Chatterjee

Verification of Coherence Protocols Against Shared Memory Consistency Models

Using Test Model-Checking

FM and shared-memory system design• Processor-speed growth faster than memory

speed-growth

• Mismatch exacerbated by shared memory multiprocessors

• Complex protocols employed to hide memory latencies

• Need for formal verification techniques that can be employed during design

• Handle strong (e.g. seq consistency) and weak (e.g. TSO) memory models

11/12/99 Ganesh, Utah Verifier group -- Intel MPG talk

6

• Graf (CAV’94)– for more than SC (hence unsound for SC)– properties depend on design

• Alur, McMillan, Peled (LICS’96)– undecidable if data can be compared

• Nalumasu, Ghughal, Mokkedem, Gopalakrishnan (CAV’98)• Henzinger, Qadeer, Rajamani (CAV’99)

– needs invariants– invariants depend on design– assumes address-symmetry

• Collier (‘80s)– not available at design-time

Related Work

Memory Models

• Describes memory system’s behavior in

response to memory operations

Memory System

MemoryOperations(read or write)from variousprocesses

Uniprocessor Memory Model:the von Neumann model

• Memory operations (reads and writes) execute in the

order in which they appear in program

Memory

P

Sequential Consistency: A multiprocessor memory model

• Memory operations complete in program order

• A Write becomes instantly visible to all processors

Memory

PnP2P1 . . .

Weaker Memory Models

• Sequential Consistency : intuitive and strong memory model, but..– Does not allow many architectural optimizations

• Weaker memory models :– Memory operations can occur out of order– Allows for more architectural optimizations to

enable significant performance gain

• Many real processors are allowing weaker memory models e.g. Sun Ultra 4, Alpha, PowerPC, Intel etc.

An Example Weaker Memory ModelSPARC Total Store Order (TSO)

• The presence of local caches + write buffers + out of order memory accesses

• Performance vs. programming complexity

Memory

P1 P2 Pn

. . . .

Memory Model Verification Problem

CPU CPU

….

Mem

CPU+Cache

CPU+Cache

CPU+Cache

Snooping busMem

=

Why informal methods insufficient ?

• Danger of using incorrect optimizations– uniprocessor opt may not be legal for

multiprocessors

• Danger of incorrect implementations of legal optimizations

• Concurrency - informal methods inadequate

• Memory system semantics are complex and non-intuitive– more so for weaker memory models

An optimization : fine for uni-processor...

P1

F1 := 1R1 := F2

Writes have higher latencies than reads

A Simple Optimization : Let Read of F2 bypass write of F1

Works fine for uni-processor machines

… but not so for multiprocessors

P1

F1 := 1R1 := F2if (R1 == 0) critical section

P2

F2 := 1R2 := F1if (R2 == 0) critical section

Many optimizations in uni-processor designsnot applicable for multiprocessors

If Read bypasses Write then

Both P1 and P2 in critical section !!

Our main example: A Symmetric Multi-Processors (SMP) bus

CPU$

Memory

CPU$

CPU$

Coherentsnoopingbus

Problem studied: how can the CPU designer

- specify desired orderings of reads and writes

- verify the implementation for adherence (in appearance)

The `Utah Runway Bus Model’ (URM)

Runout

Runin

Cache lines

Noncoh

Client0 Client1

b a

Host

Broadcast

Coh_chans

Client0 Client1b a

Host

Broadcast

- Drive memory system model using test automata- See if error-state(s) reached

How test model-checking works

Deriving Test-automata

• Assume that memory-systems do not decode ‘data’ and use addresses only in = and != tests

• Establish Limited Address Theorems for the chosen memory model (PO in our case)– for an interesting class of programs, examining all two-address

programs is sufficient

• List all possible violations over 1- and 2-addresses

• Abstract these violations into test-automata

• Test automata – are sound

– completeness results under investigation

– found effective in practice


20

An Illustrative Example

P2X1 := AX2 := AX3 := A

....Xk := A

P1A := 1A := 2A := 3

....A := k

There exists some i,js.t. j < i /\ X(j) < X(i)

Suppose the observedexecutions are:

rd(1)

rd(0)

rd(0)

rd(1)

wr(0)

wr(1)

wr(1)Errorstate

P2P1

- Achieves the effect of k = infinity- Considers all interleavings

Then a_iare:


21

All one-address PO violations (1-3 of 5)

v is not the initialvalue T of a, and a is not writtenanywhere

(1)

P_i...rd(a, v)…

P_ j...…...

(2)

P_i...rd(a,v1)…rd(a,v2)...

P_ j…wr(a,v2)…wr(a,v1)...

P_ i and P_ jcould be thesame process

(3)

P_i...rd(a,v)…rd(a,T)...

P_ j…wr(a,v)…

P_ i and P_ jcould be thesame process


22

...All one-address PO violations (4-5 of 5)

v is not the initialvalue T of a, and a is not writtenbefore being read

(4) P_i...rd(a,v)…wr(a,v)...

(5) P_i...wr(a,v)…rd(a,T)...

Client0 Client1b a

Host

Broadcast Verification of Program Orderingfor all one-address programs

Error states: E1, E2

xmeans Write(A,x)

Read(A,-)


Client0 Client1b a

Host

Broadcast Verification of Program Orderingfor all two-address programs

x,ymeans Write(A,x)

Read(A,-)Read(B,-)

Write(B,y)


Client0 Client1b a

Host

BroadcastCan run demo of thismodel-checking on this laptop if there is interest (need to boot linux..)


26

How to Handle Weaker Memory Models?

• Identify new rules (if necessary)

• Create new tests and test model-checking automata

• Consider memory operations other than read and

write

– fences, barriers etc.


27

Weaker memory models - relaxations

• Partial-PO Relaxation :

– Relaxes PO partially - WR is always relaxed

– May relax WA in various orders

– examples : SPARC V9 TSO, PSO, Intel Pentium Pro,

Processor consistency etc.

• Complete-PO relaxation :

– Relaxes PO completely

– typically does not relax WA

– examples : SPARC V9 RMO, Alpha, PowerPC, Release

Consistency


28

SPARC Total Store Order (TSO)

• Relaxes Write-Read (WR) sub-rule

• Also relaxes WA in a subtle way

Memory

P1 P2 Pn

. . . .


29

TSO and PSO Specification (Ghughal, MS ‘99)

• TSO = (UPO,RO,WO,RW,WA-S,MB-WR)

• PSO = (UPO,RO,RW,WA-S,MB-WR,MB-WW)

• A series of “pure tests” are defined to test for

individual ordering rules (e.g. RO) in isolation


30

Motivation for Pure Tests

P2X1 := AX2 := AX3 := A

....Xk := A

P1A := 1A := 2A := 3

....A := k

There exists some i,js.t. j < i /\ X(j) < X(i)

rd(1)

rd(0)

rd(0)

rd(1)wr(0)

wr(1)

wr(1)Errorstate

P2P1

A visit to Error-state tells that ONE OFRO, WO, RW, or WR is violated -- NOTwhich one


31

Steps for creating test-automata

• Identify violation in the setting of a simple example

• Argue that regardless of WO, this violates RO

• Generalize error to execution sequence (next slide)

• Build test automata (following that)

P1A := 1;A := 2;

P2X := A;Y := A;Z := A;

Initialize all variables to 0

Finally A==2; X==Z==1 or 2, Y==1 or 2, Y!=X


32

Pure Test for RO over the same operand (WO is NOT assumed!)

P1

A:=1A:=2

..A:=k

P2

X[1]:=A X[2]:=A

..X[k]:=A

Condition : for all p, q, r : p < q < r : X[p] = X[r] => X[p] = X[q] = X[r]

• New Test for RO

• Formally proved that this (+ all others) are pure tests• Completeness still open.


33

Test Automata for RO on Same Operand Obtained Assuming Data Independence

s0

s1

A := 0

A := 0

s0

s1

read(A)

s2

X2 :=read(A)

X1 := read(A)

read(A)

P2

P1

Safety Property :

Finally, X1 = X3 = 1 => X1 = X2 = X3

Non-deterministic switch

read(A)

s2

X3 :=read(A)

read(A)

A := 1


34

Pure Test for RO- different operands- WO not assumed

P1

B:=1

P2

X := A;Y := B;C := Y;

P3

U := C;A := U;

• Initially all vars == 0• Finally all vars == 1 => In P2, B must have been read before A


35

Pure Test for RO- different operands- WO not assumed

P1

B:=1B:=2

..B:=k

P2

Y[0] := 0;X[1] := A;Y[1] := B;C := Y[i];X[2] := A;Y[2] := B;C := Y[2];

…X[k] := A;Y[k] := B;C := Y[k];

Condition : Exists i:1<= i<= k Forall j:0<=i: X[i] != Y[j]

P3

U[1] := C;A := U[1];U[2] := C;A := U[2];

...U[k] := C;A := U[k];

“X is getting ahead of all the Y’s so far” -- need to examinea history of values...

Turn into OR accumulatorvia data-independence!


36

Safety Property :

(P2 in S1 /\ y==0)

=> x==0

Test Automata for RO (diff opnds)

s0

B:=0s0

s1

read(A); t := read(B); C := t; y := y \/ t;

P2

P1

B:=1s0P3

u := read(C);A := u;

read(A); t := read(B); C := t;

x := read(A); t := read(B); C := t;


37

A Pure Test for (UPO, WO)

P1 P2

A := 1; B := 1;

B := 2; A := 2;

U[1] := B; V[1] := A;

... ...

... ...

A := 2k-1; B := 2k;

B := 2k; A := 2k; U[k] := B V[k] := A

Condition : forall i,j : U[i] is even or U[i] >= 2j or

V[j] is even or V[j] >= 2i

will need 2 bits for test model-checking automata


38

(P1 and P2

in their S1)

=>

u is even \/

u = 11 \/

v is even \/

v = 11

Test Automata for UPO,WO (diff opnds)

s0

s1

A := 01;B := 00;read(B);

P1

s0

s1

P2

A := 01;B := 00;u := read(B);

B := 01;A := 00;v := read(A);

B := 01;A := 00;read(A);

A := 11;B := 10;read(B);

B := 11;A := 10;read(A);


39

WA-Relaxation of TSO

• Execution valid under TSO but not under SC.

• WA Relaxation - captured by new rule WA-S

P1

A := 1;C := 1;U := C;X := B

P2

B := 1;D := 1;V := D;Y := A;

Initially A = B = C = D = U = V = X = Y = 0;

Finally, A = B = C = D = U = V = 1; X = Y = 0;


40

Rule of WA-S• WA :

– a write becomes visible to all processors “instantly”

– atomic set of events - all write events

• WA-S :

– a write becomes visible to all other processors

“instantly”

– atomic set of events - all write events in stores of

other processors


41

Memory Barriers - membar

• A Special type of memory operations which enforces

additional PO constraints as required

• could select a particular sub-rule of PO

• example : R1

:= A; membar

LoadStore; B := R2;

• also known as fences etc.


42

Rule of MB (MemBar)

• Define one event corresponding to each membar

instruction

Pi

L : membar storestore

• Enforce orderings between all relevant

operations before and after membar

• Consists of 4 sub-rules :

MB-RR , MB-RW, MB-WW, MB-WR


43

What about Rule of MB?

• only orders some reads and writes with respect to

each other

• Hence, could use test for sub-rules of PO to check for

various sub-rules of MB

– e.g. (CMP, RO) could be used for (CMP,MB-RR)

• will need a MB-RR instruction between every two

reads in Tests, but only 1 in test model-checking

automata


44

Test Automata for (CMP, MB-RR)

s0

s1

A := 0

A := 1

s0

s1

read(A)

s2

X2 :=read(A) ; MB-RR

X1 := read(A) ; MB-RR

read(A)

P2

P1

Finally, X1 = X3 => X1 = X2 = X3

Non-deterministic switch

read(A)

s2

X3 :=read(A) ; MB-RR

read(A)


45

New Tests and Test model-checking automata

• Also, developed new tests for

– CMP, UPO, RO - checks for read ordering between

two different operands

– CMP, UPO, WO - checks for write ordering

– CMP, UPO,CON - checks for coherency

• Developed corresponding test automata

• Provided formal proofs for each test and the test

model-checking automata abstraction


46

How to handle models such as Alpha weaker memory model?

• Relaxes Program Order completely

• Orderings guaranteed by explicit membar when needed

• Write atomicity is relaxed in a manner similar to TSO

• Specification as

(UPO, ROO, WA-S, MB, MB-WW)

• Tests developed for the same


47

Memory Systems Verified

• Verified three memory systems using VIS for SC

• Also did last example in Promela and SPIN / PV

• Serial Memory : a simple memory system

• Lazy Caching : A Simple bus-based protocol

involving queues

• Runway-PA8000 Memory system : A fairly

complex commercial multiprocessor memory

system from Hewlett Packard (the URM)


48

Experimental Results (VIS)

WA States Bdds Time

Serial Memory 212k 10k 34 sec

Lazy Caching 1.9M 513k 59 min

PA8000 985k 1.7M 40 hrs

PO States Bdds Time

Serial Memory 7k 7k 9 sec

Lazy Caching 7.8M 306k 36 min

PA8000 953k 1.6M 27 hrs


49

SC verification of the HP/Runway modelPromela, with SPIN and PV (#states)

Spin PV

PO-1 56K 2794

PO-2 > 5M/DNF 11M

SC-1 499K 7880

SC-2a > 5M/DNF 5.9M

SC-2b > 4M/DNF 574K


50

Experimental Results for TSO operational model (in VIS)

TA States Bdds Time

CMP, RO, WO 3k 4k < 1 s

CMP, PO 6.5M 50k 2:38 s

CMP, WR 6.5k 50k 1:25 s

CMP, RW 6.5k 50k 3:02 sCMP, RO 10k 2k 1:25 s

Green is Pass ; Red is Fail (as expected for TSO)


51

Memory Ordering Rules defined

• Seq Consistency (completeness under assumptions still being refined -

Nalumasu’98)

• Total Store Ordering

• Partial Store Ordering

• IBM 370

• Alpha

– Cross-checked our definitions for agreement against “Litmus Tests” defined in

the Alpha Architecture Manual

Test automata available for these (Ghughal MS’99)


52

Conclusions

• Test model-checking : – A new methodology for memory model

verification– could be effectively integrated in typical

design cycle

• Weaker memory model verification :– Developed test model-checking methodology

for weaker memory models– new rules to specify weaker memory models– new tests and test automata


53

Possible Future Work

• Explore instruction - data consistency : e.g. self-modifying code in shared data space?

• Explore other memory operations than read, write and barriers : e.g.

load-locked, store-conditional, TLB-flush, Cache flush, Cache Sync,...

• Explore issues related with explicit instruction fetching in shared data

space

• Use to study various speculative memory access schemes

• Work towards completeness


54

Possible Future Work (contd…)

• Explore multi-threaded executions• Explore speculation schemes for memory accesses• How to build reliable test-automaton coding techniques in the

framework of a model-checker• Use specialized reachability analysis procedures• Exploit symmetry - “semantic ones” too• See how it actually fits in design-flow• Explore possibilities to derive test-benches

BACKUP SLIDES


56

Formal Methods for Shared Memory System Design

Verification Provably-correctSynthesis

Theorem-proving

Model-checkingProtocol

Low-level concerns(e.g. deadlocks, progress,...)

Higher-level concerns (e.g. shared memory consistency models)

Finite-state Reachability


57

Example of problems due to “unexpected msgs”

Req Ack

Another Req? ? ?

Usually don’t know what to say…...saying nothing causes deadlock!

CacheCtrlr

DirectoryCtrlr


58

Overview of Synthesis Method

I ECacheCtrlr

F EDir Ctrlr

I E

F E

Req (N)ack


59

Model-checking Efficiency

Protocol N states / time(low level)

states / time(high level)

Mig 2 23,164 / 2.8 54 / 0.14 235 / 0.48 965 / 0.5

Inv 2 193,389 / 19.23 546 / 0.64 18,686 / 18.4


60

A Generic Example

P Q R

Q!aR!b

P?x

Q!c

R?y


61

Async Implementation of Example (i)

P Q R

Q!aR!b

P?x

R?y

Q!c

1 msg buffer location for Ack/Nack

R!!bQ!!a


62

Async Implementation of Example (ii)

P Q R

Q!aR!b

P?x

R?y

Q!c

R!!bQ!!aQ!!cP!!ack

Progress Buffer

Proctype broadcast

• Read <trans, from, val, addr> from runout0 and runout1

• Send to runin0, runin1, and runin2

These are coherent transactions being acquired by all clients and the host (bus controller)

Proctype host

• Wait for a transaction to appear in runin, and its coherency responses to appear in coh_chans

• Decide whether to– merely ingest the data being put-out

(cache to cache copy happening), or – to supply the data (in which case

determine return ‘mode’ - (Shared_return or Private_return.)

Proctype client

• One client decides to behave like “P”

• The other behaves like “Q” (test automata for POS)

• P and Q first check for all “read moves”– done while line is readable (shared,

private-clean, or dirty)

• Then check for write possibilities ...

..write possibilities:Proctype client

• Write possibilites checked only if similar request not already made

• Also line must be writeable (private-clean or dirty)

• Line invalid => request via rsp or rp

• Clients snoop transactions (including own) via channel runin

• Every client also sends coherency response to host

..Proctype client

• Either host or another client supplies data through one of the non_coh inputs (if client, it supplies via c2cw - if host, it supplies via hdr)

• Correct sharing status also indicated when data is sent out.

Date post:	12-Jan-2016
Category:	Documents
Upload:	elgin
View:	29 times
Download:	0 times

Ganesh Gopalakrishnan Associate Professor Computer Science University of Utah

Documents