Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev · Relevant Personal History • PhD...

Post on 15-Aug-2020

1 views 0 download

transcript

URL: http://www.cs.utah.edu/fvSupported by NSF awards SI2 (ACI-1148127), EAGER (CCF-1241849), Failure Resistant Systems (CCF 1255776)!

and SRC Task 2426.001, NSF Medium (CCF 7298529), EAGER (CCF 1346756) !SUPER Institute (for resilience research)!

and special thanks to Microsoft for funding (2006-2010) on getting established in this area!

Correctness Checking Concepts and Tools for High Performance Computing..or

Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev!School of Computing

University of Utah Salt Lake City, UT 84112

Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev!School of Computing

University of Utah Salt Lake City, UT 84112

URL: http://www.cs.utah.edu/fvSupported by NSF awards SI2 (ACI-1148127), EAGER (CCF-1241849), Failure Resistant Systems (CCF 1255776)!

and SRC Task 2426.001, NSF Medium (CCF 7298529), EAGER (CCF 1346756) !SUPER Institute (for resilience research)!

and special thanks to Microsoft for funding (2006-2010) on getting established in this area!

Correctness Checking Concepts and Tools for High Performance Computing

Bugs: Black Ice on the Road to Exascale..or

!3

Relevant Personal History• PhD from Stony Brook : 1981 (when Mead/Conway : VLSI, Hennessy : MIPS, Patterson : Sparc)!

• Joined Utah 1986!

• Taught OS as my second class!

• Wrote to Tanenbaum!

• Got Minix on 5.25 inch floppy!

• Class did kernel hacking on dual 5.25 inch IBM PC!

• ……..!

• Worked on various aspects of concurrency!

• Self-timed Circuit Design!

• Pipelined Processor Verification!

• Cache Coherence Protocols!

• Shared Memory Consistency Models!

• Feel privileged to work on Formal Methods for Concurrency in Service of HPC !!

!4

We have been fortunate to have built some tools in support of HPC FV

• Let us do some demos … so that you have some context to what I’ll be later saying

!5

DEMO: Dynamic Execution based !Debugging of MPI Programs!

!Tool Name : ISP

!6

DEMO: Symbolic Execution based debugging of!Sequential programs and!

GPU CUDA programs!!

Tool name : GKLEE

!7

Brief History of Why We are Where We Are• CISC machines (70s)!

• Pipelining —> Clock Frequency growth + Compilers!

• Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!

• Pipelining —> Better ILP use!

• Moore’s law : afforded Pipelining tricks!

• Dennard’s law : allowed voltage scaling!

• POWER DENSITY stayed the same!

• Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!

• Tejas Project Write-off — NY Times !

• Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!

!8

Brief History of Why We are Where We Are

!9

• CISC machines (70s)!

• Pipelining —> Clock Frequency growth + Compilers!

• Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!

• Pipelining —> Better ILP use!

• Moore’s law : afforded Pipelining tricks!

• Dennard’s law : allowed voltage scaling!

• POWER DENSITY stayed the same!

• Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!

• Tejas Project Write-off — NY Times !

• Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!

Brief History of Why We are Where We Are

!10

• CISC machines (70s)!

• Pipelining —> Clock Frequency growth + Compilers!

• Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!

• Pipelining —> Better ILP use!

• Moore’s law : afforded Pipelining tricks!

• Dennard’s law : allowed voltage scaling!

• POWER DENSITY stayed the same!

• Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!

• Tejas Project Write-off — NY Times !

• Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!

Brief History of Why We are Where We Are

!11

• CISC machines (70s)!

• Pipelining —> Clock Frequency growth + Compilers!

• Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!

• Pipelining —> Better ILP use!

• Moore’s law : afforded Pipelining tricks!

• Dennard’s law : allowed voltage scaling!

• POWER DENSITY stayed the same!

• Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!

• Tejas Project Write-off — NY Times !

• Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!

Brief History of Why We are Where We Are

!12

• CISC machines (70s)!

• Pipelining —> Clock Frequency growth + Compilers!

• Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!

• Pipelining —> Better ILP use!

• Moore’s law : afforded Pipelining tricks!

• Dennard’s law : allowed voltage scaling!

• POWER DENSITY stayed the same!

• Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!

• Tejas Project Write-off — NY Times !

• Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!

Smart Phones (describe the shape of things to come in HPC)!!

(from Adve, http://www.cs.berkeley.edu/~bodik/ASPLOS13/Symposium/sarita-adve-12-asplos-pc-symposium.pdf

!13

Today’s main HPC Mantra

• “Maximize the volume of computational results obtained per Watt”

!14

But what about correctness…. ?

!15

Industrial Flares

NvidiaNASA

Uintah (SCI Group, Utah)Marsden Lab, UCSD

Wikipedia

Today’s main HPC Mantra

• “Maximize the volume of computational results obtained per Watt”!

• Subject to Moore’s and Dennard’s laws

!16

(Courtesy Bob Colwell)

Today’s main HPC Mantra

!17

(Courtesy Bob Colwell)

• “Maximize the volume of computational results obtained per Watt”!

• Subject to Moore’s and Dennard’s laws

So, how prepared are we to debug Heterogeneous Concurrent Systems?

!18

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!19

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!20

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!21

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!22

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!23

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!24

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!25

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!26

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!27

Power-6 Studies

!28

Getting Resilience Ground Truths (Power-6)

!29

Power-7 Studies

!30

A “feel” of HPC Correctness• Constant pressure : The “most science per dollar”!

• Many dimensions of correctness!

• HPC explores unknown aspects of Sciences!

• Algorithmic Approximations are often made!

• Growing heterogeneity in HPC platforms!

• Floating-point representation is inexact!

• “Bit flips” !

• Correctness training lacks!

• Busy-enough doing Science!

• Finding and keeping “Pi men” is difficult!

• Always makes sense to switch to latest HW!

• Often the poorest documented

!31

RIKEN K machine

(Lazowka)

HPCSciences

A “feel” of HPC Correctness• Constant pressure : The “most science per dollar”!

• Many dimensions of correctness!

• HPC explores unknown aspects of Sciences!

• Algorithmic Approximations are often made!

• Growing heterogeneity in HPC platforms!

• Floating-point representation is inexact!

• “Bit flips” !

• Correctness training lacks!

• Busy-enough doing Science!

• Finding and keeping “Pi men” is difficult!

• Always makes sense to switch to latest HW!

• Often the poorest documented

!32 (Our twist)

FMHPC

RIKEN K machine

(Lazowka)

HPCSciences

A Heterogeneity-induced bug!(Berzins, Meng, Humphrey, XSEDE’12)

!33

P"="0.421874999999999944488848768742172978818416595458984375""C"="0.0026041666666666665221063770019327421323396265506744384765625""

Compute:"floor("P"/"C")"

Xeon%

"P"/"C"="161.9999…"floor("P"/"C")"="161%

Xeon%Phi%

"P"/"C"="162"floor("P"/"C")"="162%

Expecting 161 msgs

Sent 162 msgs

A Heterogeneity-induced bug!(Berzins, Meng, Humphrey, XSEDE’12)

!34

P"="0.421874999999999944488848768742172978818416595458984375""C"="0.0026041666666666665221063770019327421323396265506744384765625""

Compute:"floor("P"/"C")"

Xeon%

"P"/"C"="161.9999…"floor("P"/"C")"="161%

Xeon%Phi%

"P"/"C"="162"floor("P"/"C")"="162%

Expecting 161 msgs

Sent 162 msgs

Authors’ fix : used double-precision for P/C!Question: Is there a more deft solution ?

A Heterogeneity-induced bug!(Berzins, Meng, Humphrey, XSEDE’12)

!35

P"="0.421874999999999944488848768742172978818416595458984375""C"="0.0026041666666666665221063770019327421323396265506744384765625""

Compute:"floor("P"/"C")"

Xeon%

"P"/"C"="161.9999…"floor("P"/"C")"="161%

Xeon%Phi%

"P"/"C"="162"floor("P"/"C")"="162%

Expecting 161 msgs

Sent 162 msgs

Authors’ fix : used double-precision for P/C!Question: Is there a more deft solution ?!More important question : What exactly went wrong ??! (the XSEDE’12 authors moved along…)

Resilience• ~7 B transistors per GPU (and many B for CPUs) and a ton of memory!

• 10^18 Transistors Throbbing at GHz for Weeks!

• Some bit changes MUST be unplanned ones!

• In HPC, results combine more (than, say, in “cloud”)!

• “Bit flip” is a catch-all term for !

• High speed-variability of devices coupled with DVFS jitter!

• Local hot spots develop, aging chip electronics!

• Particle strikes!

• Energy is the main currency!

• Some of the energy-saving “games” that must be played (this invites bit-flips)!

• Dynamic Slack Detection, followed by lowering voltage + frequency!

• One PNNL study (Kevin Baker) : 36KW -> 18KW

!36

Our Position (1)

• Despite “bit flips” and such, it is amply clear that sequential and concurrency bugs still ought to be our principal focus!

• They occur quite predictably (unlike bit flips)!

• They are something we can control (and eliminate in many cases)

!37

Our Position (2)

• Unless we can debug in the small, there is NO WAY we can debug in the large

!38

Our Observations (3)

• There are SO MANY instances where experts are getting it wrong — and spreading the wrong

!39

Example-1• IBM Documentation: “If you debug your MPI program

under zero Eager Limit (buffering for MPI sends), then adding additional buffering does not cause new deadlocks”

• It can

!40

Example-1• IBM Documentation: “If you debug your MPI program

under zero Eager Limit (buffering for MPI sends), then adding additional buffering does not cause new deadlocks”

• It can

!41

Example-2

• A reduction kernel given as an early-chapter example of a recent Cuda book is broken!

• Reason: Assumes that CUDA atomic-add has a “fence” semantics!

• Erratum has been issued on book website

!42

Example-3

• A work-stealing queue in “GPU gems” is incorrect!

• Reason: Assumes “store store” ordering between two sequentially issued stores (must have used a fence in-between)

!43

Feature of GPU programming

• Programmers face concurrency corner-cases quite frequently

• As opposed to (e.g.) OS where low-level concurrency is usually hidden within the kernel

!44

Example-4

• If your code ran correctly in FORTRAN, it will also run correctly in C

!45

Example-4 invalidated

!46

Example-5 !Simple questions can’t be answered by today’s tools!

Does this program deadlock? (Yes.)

!47

Example-5 !Simple questions can’t be answered by today’s tools!

Does this program deadlock? (Yes.)

!48

Match

!49

__global__'void'kernel(int*'x,'int*'y)'{'''int'index'='threadIdx.x;'

''y[index]'='x[index]'+'y[index];'

''if'(index'!='63'&&'index'!='31)'''''y[index+1]'='1111;'

}'

Ini$ally(:(x[i](==(y[i](==(i(

Warp1size(=(32(

Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

!50

__global__'void'kernel(int*'x,'int*'y)'{'''int'index'='threadIdx.x;'

''y[index]'='x[index]'+'y[index];'

''if'(index'!='63'&&'index'!='31)'''''y[index+1]'='1111;'

}'

Ini$ally(:(x[i](==(y[i](==(i(

Warp1size(=(32(

The'hardware'schedules'these'instrucKons'in'“warps”'(SIMD'groups).''

Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

!51

__global__'void'kernel(int*'x,'int*'y)'{'''int'index'='threadIdx.x;'

''y[index]'='x[index]'+'y[index];'

''if'(index'!='63'&&'index'!='31)'''''y[index+1]'='1111;'

}'

Ini$ally(:(x[i](==(y[i](==(i(

Warp1size(=(32(

The'hardware'schedules'these'instrucKons'in'“warps”'(SIMD'groups).''

However,'this'“warp'view”'oSen'appears'to'be'lost'

Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

!52

__global__'void'kernel(int*'x,'int*'y)'{'''int'index'='threadIdx.x;'

''y[index]'='x[index]'+'y[index];'

''if'(index'!='63'&&'index'!='31)'''''y[index+1]'='1111;'

}'

Ini$ally(:(x[i](==(y[i](==(i(

Warp1size(=(32(

The'hardware'schedules'these'instrucKons'in'“warps”'(SIMD'groups).''

However,'this'“warp'view”'oSen'appears'to'be'lost'

E.g.'When'compiling'with'opKmizaKons'

Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

New(Answer:(0,(2,(4,(6,(8,(…'

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

!53

__global__'void'kernel(int*'x,'int*'y)'{'''int'index'='threadIdx.x;'

''y[index]'='x[index]'+'y[index];'

''if'(index'!='63'&&'index'!='31)'''''y[index+1]'='1111;'

}'

Ini$ally(:(x[i](==(y[i](==(i(

Warp1size(=(32(

The'hardware'schedules'these'instrucKons'in'“warps”'(SIMD'groups).''

However,'this'“warp'view”'oSen'appears'to'be'lost'

But'if'you'read'the'CUDA'documentaKon'Carefully,'you'noKce'you'had'to'use'a''C'VolaKle'that'restored'“correct”'answers!'

Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

Vola$le(x[],(y[]..'

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

!54

__global__'void'kernel(int*'x,'int*'y)'{'''int'index'='threadIdx.x;'

''y[index]'='x[index]'+'y[index];'

''if'(index'!='63'&&'index'!='31)'''''y[index+1]'='1111;'

}'

Ini$ally(:(x[i](==(y[i](==(i(

Warp1size(=(32(

The'hardware'schedules'these'instrucKons'in'“warps”'(SIMD'groups).''

However,'this'“warp'view”'oSen'appears'to'be'lost'

But'the'ability'to'“rescue'correct'answer”'is'no'longer'a'guarantee'(since'CUDA'5.0)'

Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

So you really trust your compilers?

• Talk to Prof. John Regehr of Utah!

• C-Smith : Differential testing of compilers!

• The single most impressive compiler testing work (IMHO) in recent times!

• Has found goof-ups in -O0 for short programs!

• Many bugs around C volatiles!

• Learned that NOTHING is known about how compilers (ought to) treat floating-point

!55

Without swift action, the “din” of !the blind leading the blind will sow more confusion

!56

Some threads offer advice ranging from “use volatiles”!! (was in early CUDA documentation; gone since 5.0)!! Others advocate the use of __syncthreads (barriers) !Or query device registers to know warp size

!https://devtalk.nvidia.com/default/topic/512376/ https://devtalk.nvidia.com/default/topic/499715/ https://devtalk.nvidia.com/default/topic/382928/

!And there are several threads simply discuss this issue

!https://devtalk.nvidia.com/default/topic/632471 https://devtalk.nvidia.com/default/topic/377816/

!There isn’t a comprehensive picture of dos and don’t and WHY !

Discussions on “warp-synchronous” code

!57

https://devtalk.nvidia.com/default/topic/499715/are-threads-of-a-warp-really-sync-/?offset=2

Example-8!Do GPUs obey coherence?!

(Coherence = per-location Seq Consistency)

• Ask me after the talk……. :)

• We are stress testing real GPUs

• and finding things out!

• (work is inspired by Bill Collier who called it “X-raying real machines” in his famous RAPA book)

!58

Our (humble) suggestion

• There is NO WAY the complexity of anything can be conquered without mathematics!

• The complexity of debugging needs the “mathematics of debugging” — the true mathematics of Software Engineering!

• i.e. formal methods!

• Must develop the “right kind” of formal methods!

• Coexist with the grubby!

• Take on problems in context!

• Win practitioner friends early — and KEEP THEM

!59

What is hard about HPC Concurrency?• The Scale of Concurrency and the Number of interacting APIs !

• MPI-2, MPI-3, OpenMP, CUDA, OpenCL, OpenACC, PThreads, use of NonBlocking Data Structures, dynamic Scheduling!

• Each API thinks it “owns” the machine!

• Exposure of Everyday Programmer to Low Level Concurrency is a worrisome reality!!

• Memory Consistency Models Matter!

• Governs visibility across threads / fences!

• Yet, very poorly specified / understood!

• Compiler Optimizations — not even basic studies exist!60

Is there a role for Formal Methods? • Yes indeed!!

• For instance, why is it that microprocessors don’t do “Pentium FDIV” any more?!

• Processor ALUs have only become even more complex!

• Answer : Formal gets serious use in the industry!

• Intel : Symbolic Trajectory Evaluation!

• Others : similar methods!

• Processors get FV to varying degrees for other subsystems!

• E.g. Cache coherence (at a protocol level)

!61

Is there a role for Formal Methods?

• Yes indeed!!

• there are a fascinating array of correctness challenges!

• Very little involvement from mainstream CS side!

• lack of exposure, limited interactions across departments,!

• Need “cool show-pieces” to draw students to HPC research…

!62

An example “cool project”!Utah Pi “cluster” built by PhD students at Utah!

“Mo” Mohammed Saeed Al Mahfoudh !and Simone Atzeni!

!(Under $500 ; Runs MPI, Habanero Java, …)

!63

Anyone wanting to do software testing for concurrency must slay two

exponentials

!64

!65

Anyone wanting to do software testing for concurrency must slay two

exponentials

A FM Grab-bag for anyone wanting to debug concurrent programs

• Slay input-space exponential using!

• Symbolic Execution!

• Slay schedule-space exponential by !

• Not jiggling schedules that are Happens-Before equivalent

!66

Not Exploring HB-Equivalent Schedules

!67

A FM Grab-bag for anyone wanting to debug concurrent programs

• Concepts in the fuel-tank must include!

• Lamport’s “happens before”!

• Define concurrency coverage using it!

• Design active-testing methods that systematically explore schedule-space!

• Memory consistency models!

• Data races and how to detect them!

• Symbolic execution!

• Helps achieve input-space Coverage

!68

Overview of our (active) projects• HPC Concurrency!

• Dynamic Verification Methods for MPI : CACM, Dec 2011!

• GPU data-race checking : PPoPP’12, SC’12, SC’14!

• Floating-point!

• Finding inputs that cause highest relative error (“sour spot search”) : PPoPP’14!

• Detecting and Root-Causing Non-determinism!

• Pruner project at LLNL - combined static / dynamic analysis for OpenMP race checking!

• System Resilience!

• We have developed an LLVM-level Fault Injector called KULFI!

• Using Coalesced Stack Trace Graphs to Highlight Behavioral Differences!

• Our main focus continues to be correctness tools for HPC Concurrency

!69

Biggest Gain due to Formal Methods:!Conceptual Cohesion!

• Example : Helps understand that Concurrency and Sequential Abstractions Tessellate

• Helps Understand that Sequential == Deterministic

• Helps Understand Data Races as Breaking the Sequential Contract

!70

Concurrency and Sequential Abstractions Tessellate !

!71

Fine%grained*concurrency**of*transistor%level*circuits*

Sequen6al*view*of*Boolean*Func6ons*(gates)*

Concurrent*State*Machines*Using*Gates*and*Flops*

Sequen6al*Program**Abstrac6ons*(e.g.*ISA)*

Shared*memory*or*Msg**Passing*based*Parallelism*

Solving*A*x*=*B*

Why Fixate on Data Races?

• Key assumption that enables sequential thinking!

• Sequential almost always means Deterministic!

• In an Out of Order CPU, nothing is sequential!

• Yet we think of assembly programs as “sequential”!

• Only because they yield deterministic results!

• Create Hazards (say in a time-sensitive way)!

• Then we lose this sequential / deterministic abstraction!

• Parallel Programming Almost Always Strives to produce Sequential i.e. Deterministic Outcomes!

!72

Races and Race-Free Generalized

!73

Fine%grained*concurrency**of*transistor%level*circuits*

Sequen6al*view*of*Boolean*Func6ons*(gates)*

Concurrent*State*Machines*Using*Gates*and*Flops*

Sequen6al*Program**Abstrac6ons*(e.g.*ISA)*

Shared*memory*or*Msg**Passing*based*Parallelism*

Solving*A*x*=*B*

Critical races!gives gates!that spike !

(broken Boolean!Abstraction)

Races and Race-Free Generalized

!74

Fine%grained*concurrency**of*transistor%level*circuits*

Sequen6al*view*of*Boolean*Func6ons*(gates)*

Concurrent*State*Machines*Using*Gates*and*Flops*

Sequen6al*Program**Abstrac6ons*(e.g.*ISA)*

Shared*memory*or*Msg**Passing*based*Parallelism*

Solving*A*x*=*B*

Races between!Clocks and Data!

Breaks !Seq. Abstraction.

Races and Race-Free Generalized

!75

Fine%grained*concurrency**of*transistor%level*circuits*

Sequen6al*view*of*Boolean*Func6ons*(gates)*

Concurrent*State*Machines*Using*Gates*and*Flops*

Sequen6al*Program**Abstrac6ons*(e.g.*ISA)*

Shared*memory*or*Msg**Passing*based*Parallelism*

Solving*A*x*=*B*

Data Races!Break Sequential!

Consistency ! ( Unsynchronized!

Interleavings!Matter )

Results on UT Lonestar Benchmarks

!76

Results on UIUC Parboil Benchmarks

!77

Uintah: A Scalable Computational Framework for Multi-physics problems

• Under continuous development over the past decade

• Scalability to 700K CPU cores possible now

• ~1M LOC or more!

• Modular extensibility to accommodate GPUs and Xeon Phis

• Partitions concerns

• App developer writes sequential apps!

• Infrastructure developer tunes / improves perf

!78

Uintah Organization

!79

ICE MPM ARCHES

SimulationController

Load Balancer

Scheduler

t4

t1 t2 t3

t5 t6

t7 t8 t9

t10

t11 t12t13

Application Packages

Abstract Directed Acyclic Tast Graph

Runtime System

Case Study: Data Warehouse Error!Collect Coalesced call-paths leading to DW::put().!

Diffed across two scheduler versions to isolate bug

!80

� ����

�������� ��

�������� ��

� � �������� ��

�������� ��

�������� ��

�������������������

��������� � !������

� �

�"���#�����

� � �

�"�����

$ %����&���!�����������

� �

$ %����&���!������ � �'

( )

��� !��� �'�����

( )

�������� ��� ��!�

� ���&����� � � ���

���&��� �� � � ���

��� � �*�� ��� ������������� + �,

� (

�! ���� -!�� ���% ����.���� ��

/

������ �*�#! ���������� -!�������

������ �*�#! �� � !0���� �

� �'�����

���� ��������

/

���� /�������

���� /�������

� (

���� ��������

/

���� �������

/ �

���!����!#������������ ��*1 .

���!����!#����.���� ��*2����

� ����� ���������������

� ����� ����. �&�������� %���

� ����� ����. �&�������� %���

� ����� ����. �&�������� %���

�3����� � �� � ���������&���

��� !�3��� ���

� ��� !�3��� ���

�� �

� (

/ �

� �

� �

( )

( )

�������������������

��������� � !������

( 4

�"�����

�3���&���!�������������

�3���&���!��� ����������

� �

�3���&���!���� � ���������

�3���&���!������ � �'

( )

��� !��� �'�������

( )

�3���&���!������ ������� � �'

�( )

�"���������3��

�������� ��� ��!�

� ���&����� � � ���

���&��� �� � � ���

��� � �*�� �� � ������������� + �,

� (

�! ���� -!�� ���% ����.���� ��

/

������ �*�#! ���������� -!�������

������ �*�#! �� � !0���� �

� �'������2

���� ��������

/

���� /�������

���� /�������

� (

���� ��������

/

���� ������5

/ �

���!����!#������������ ��*1 .

���!����!#����.���� ��*2����

� ����� ���������������

� ����� ����. �&�������� %���

� ����� ����. �&�������� %���

� ����� ����. �&�������� %���

�3����� � �� � ���������&���

��� !�3���� ���

� ��� !�3���� ���

�� �

� (

/ �

� �

� �

( )

� ����

�������� ��

�������� ��

�������� ��

�������� ��

�������� ��

�"���#�����

� � �

( 4

�����������

��� �����������������

��� �������������������������

��� ������� ����������

��� ���������������

� �

��� ������������������������

������������������

� �

��� ����� ���������������

��� ����� ���������������

!��

!��

"#���

����$�������

%

����$�������

����$������&

%

����$�������

%

����$������'

%

����$�����������$����(

%

����$�������������$����(

�����)������

% %%

����(��

�* + !*+, !,

%

����������

����$�������������

%

���������������

'�(��������)����-��������./����

%

��(�� � ��� � ���$(����� ���

%

%%

������0��������

'�(��������)������$(���������.1�-

%

%

������0�������&

�������.&����� �������$(����&����2��3

%

%

������,��������

&��������4������� �����-�������

%

%

�������������'

%

��� ����(���$������

%

��� ��� �(���$������

%

����������&����$(����&����

%

����������&��-�� ��������� ���

%

����������&��-�� ��������� ���

%

����������&��-�� ��������� ��&

% ��������������

%

��������������

%

%

%

%

%

% %

&�$(����.��)�������������5�&����

%

%

&�$(����.��)�������$(������4����$����(

%

%

%

% % % %

%

Conceptual view of Uintah equipped with a monitoring network (future work)

!81

Static Analysis of DWH and Scheduler

Automaton Learning from Traces

Tailor Learning for Hybrid Concurrency Events

Build Cross-Layer Monitoring Hierarchies

Derive System Control Invariants to Document + Debug via CSTG

HierarchicalActive Testing

and Monitoring

usingStandardized

InterfacesInternal Ready Queue Post MPI

Receive

External Ready Queue

GPU Ready Queue

CPUCheck MPI

Receive

Post MPISends

Check Host to Device Copy Device Device to

Host Copy

Internal Ready Task

Completed Task

TaskGraph

Post Device Copy

DeviceEnabled

DW::reduceMPI

MPIScheduler: :execute+A

MPIScheduler::initiateReduction

1

MPIScheduler: :execute+B

MPIScheduler::runTask

7 3

MPIScheduler::runReductionTask

1

1

DetailedTask::doit

7 3

UnifiedScheduler::execute

UnifiedScheduler::runTask

-73

-73

./sus

AMRSim::run+A

0

AMRSim::run+B

1

AMRSim::run+C

0

AMRSim::run+D

0

AMRSim::run+E

0

AMRSim::executeTimestep

0

AMRSim::doInitialTimestep

1

DW::override

0 00

16 9 -694 -4

Task::doit

0

Automatato Trigger

CSTGCollection

Static Analysis

HelpsRefineCSTGs

Task GraphCompilationto Generate

SalientHigh-Level

Eventsto

Cross-Check

Concluding Remarks

• Slaying bugs in HPC essential for Exascale!

• Need a mix of empirical to formal !

• Formal helps with concurrency coverage!

• Formal helps write clear unambiguous and validated specs!

• and educate sure-footedly

!82

thanks!

• www.cs.utah.edu/fv

• Thanks to my former students who have taught me everything I know about FV and its relevance in the industry

!83

The rest of the talk• Some results in GPU Data Race Checking!

• Demo of Symbolic Execution and GKLEE !

• Data Race Detection in GPU Programs!

• Computational Frameworks!

• Uintah !

• How Coalesced Stack Trace Graphs help debug !

• Other projects : Floating-Point Correctness and System Resilience !

• Concluding Remarks

!84

The rest of the talk• Some results in GPU Data Race Checking!

• Demo of Symbolic Execution and GKLEE !

• Data Race Detection in GPU Programs!

• Computational Frameworks!

• Uintah !

• How Coalesced Stack Trace Graphs help debug !

• Other projects : Floating-Point Correctness and System Resilience !

• Concluding Remarks

!85

The key to data race checking

• For the most part, CUDA code is synchronized via barriers (__syncthread)

• Thus, explore a “canonical” interleaving, hoping to detect the “first race” if there is any race

!86

Interleaving exploration

!87

For$Example:$If$the$green$dots$are$local$thread$ac6ons,$$then$all$schedules$$that$arrive$at$the$“cut$line”$$are$equivalent!$

Finding Representative Interleavings

!88

For$Example:$If$the$green$dots$are$local$thread$ac6ons,$$then$all$schedules$$that$arrive$at$the$“cut$line”$$are$equivalent!$

Finding Representative Interleavings

!89

For$Example:$If$the$green$dots$are$local$thread$ac6ons,$$then$all$schedules$$that$arrive$at$the$“cut$line”$$are$equivalent!$

GKLEE Examines Canonical Schedule

!90

Instead(of(considering(all(Schedules(and((All(Poten5al(Races…(

GKLEE Examines Canonical Schedule

!91

Instead(of(considering(all(Schedules(and((All(Poten5al(Races…(

Consider(JUST(THIS(SINGLE(CANONICAL(SCHEDULE(!!(

Folk(Theorem((proved(in(our(paper):(“We(will(find(A(RACE(If(there(is(ANY(race”(!!(

An Example with Two Data Races

!92

An Example with Two Data Races

!93

The “classic race”!Threads i and i+1 race

An Example with Two Data Races

!94

The “classic race”!Threads i and i+1 race

Not explained in any CUDA book as a race!This is the “porting race” (evaluation order between !

divergent warps is unspecified)

GKLEE’s steps

!95

!96

Symbolic Execution

GKLEE’s steps

!97

Compute!Conflicts!and solve!for races

Symbolic Execution

Compute!Conflicts!and solve!for races

GKLEE’s steps

GKLEE of PPoPP 2012

!98

LLVM$byte)code$

instruc2ons$

Symbolic$Analyzer$and$Scheduler$

Error$$Monitors$

C++$CUDA$Programs$with$Symbolic$Variable$

Declara2ons$

LLVM)GCC$

• "Deadlocks"• "Data"races"• "Concrete"test"inputs"• "Bank"conflicts"• "Warp"divergences"• "Non9coalesced""• $Test$Cases$

• $Provide$high$coverage$• $Can$be$run$on$HW$

The advantages of a symbolic-execution based GPU Race Checker: Produces concrete witnesses!

!99

__global__'void'histogram64Kernel(unsigned'*d_Result,'unsigned'*d_Data,'int'dataN)'{'''const'int'threadPos'='((threadIdx.x'&'(~63))'>>'0)'''''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'15)'<<'2)'''''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'48)'>>'4);''''...'''__syncthreads();'''for'(int'pos'='IMUL(blockIdx.x,'blockDim.x)'+'threadIdx.x;'pos'<'dataN;''''''''''''pos'+='IMUL(blockDim.x,'gridDim.x))''{'''''unsigned'data4'='d_Data[pos];''''''...'''''addData64(s_Hist,'threadPos,'(data4'>>'26)'&'0x3FU);'}'''''__syncthreads();'...'}'inline'void'addData64(unsigned'char'*s_Hist,'int'threadPos,'unsigned'int'data)'

{''s_Hist['threadPos'+'IMUL(data,'THREAD_N)']++;'}'

“GKLEE:'Is'there'a'Race'?”'

The advantages of a symbolic-execution based GPU Race Checker: Produces concrete witnesses!

!100

__global__'void'histogram64Kernel(unsigned'*d_Result,'unsigned'*d_Data,'int'dataN)'{'''const'int'threadPos'='((threadIdx.x'&'(~63))'>>'0)'''''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'15)'<<'2)'''''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'48)'>>'4);''''...'''__syncthreads();'''for'(int'pos'='IMUL(blockIdx.x,'blockDim.x)'+'threadIdx.x;'pos'<'dataN;''''''''''''pos'+='IMUL(blockDim.x,'gridDim.x))''{'''''unsigned'data4'='d_Data[pos];''''''...'''''addData64(s_Hist,'threadPos,'(data4'>>'26)'&'0x3FU);'}'''''__syncthreads();'...'}'inline'void'addData64(unsigned'char'*s_Hist,'int'threadPos,'unsigned'int'data)'

{''s_Hist['threadPos'+'IMUL(data,'THREAD_N)']++;'}'

Threads'5'and'and'13''have'a''WW'race''

when'd_Data[5]'='0x04040404'and'd_Data[13]'='0.''GKLEE''

GKLEE of SC’12 introduced the idea of Parametric Flows !GKLEEp tool introduced (for race-checking mostly)

!101

Idea behind Parametric flows:!Capitalize on Thread Symmetry !

Divide behavior into Flow Equivalence Classes!!

!102

Idea behind Parametric flows: Capitalize on Thread Symmetry !Divide behavior into Flow Equivalence Classes!

!103

Keep two symbolic threads per flow-group and race-check per flow.

Where Race-Checking Happens under Parameterized Flows!

!104

Where Race-Checking Happens under Parameterized Flows!

!105

Intra-Flow

Where Race-Checking Happens under Parameterized Flows!

!106

Inter-Flow

Favorable Results

!107

Yet, Unfavorable Results often…

!108

When parametric flow division happens inside a loop, we can get an exp # of flows.

Symbolic Execution with Static Analysis!(SC’14 accepted paper)

!109

How SESA works• Static Analysis Pass marks how vars affected by each flow in a barrier

interval may affect the generation of addresses in the next barrier interval

!110

Barrier

Barrier

Barrier

There are two !classes of flows:!(1) Flows that modify !Global or Shared Var!That flow into Control!Predicates or Array Indexing !Positions

(2) Flows that don’t do so

Within the next Barrier Interval,!“OR” the Green Flows into one flow

SESA Results• We have been able to run SESA on !

• Lonestar Benchmarks (UT)!

• Parboil Benchmarks (UIUC)!

• It scales well and finds issues!

• Races!

• Out of bounds accesses!

• Tool being integrated into Eclipse

!111