Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev · Relevant Personal History • PhD...

transcript

URL: http://www.cs.utah.edu/fvSupported by NSF awards SI2 (ACI-1148127), EAGER (CCF-1241849), Failure Resistant Systems (CCF 1255776)!

and SRC Task 2426.001, NSF Medium (CCF 7298529), EAGER (CCF 1346756) !SUPER Institute (for resilience research)!

and special thanks to Microsoft for funding (2006-2010) on getting established in this area!

Correctness Checking Concepts and Tools for High Performance Computing..or

Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev!School of Computing

University of Utah Salt Lake City, UT 84112

Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev!School of Computing

University of Utah Salt Lake City, UT 84112

URL: http://www.cs.utah.edu/fvSupported by NSF awards SI2 (ACI-1148127), EAGER (CCF-1241849), Failure Resistant Systems (CCF 1255776)!

and SRC Task 2426.001, NSF Medium (CCF 7298529), EAGER (CCF 1346756) !SUPER Institute (for resilience research)!

and special thanks to Microsoft for funding (2006-2010) on getting established in this area!

Correctness Checking Concepts and Tools for High Performance Computing

Bugs: Black Ice on the Road to Exascale..or

Relevant Personal History• PhD from Stony Brook : 1981 (when Mead/Conway : VLSI, Hennessy : MIPS, Patterson : Sparc)!

• Joined Utah 1986!

• Taught OS as my second class!

• Wrote to Tanenbaum!

• Got Minix on 5.25 inch floppy!

• Class did kernel hacking on dual 5.25 inch IBM PC!

• ……..!

• Worked on various aspects of concurrency!

• Self-timed Circuit Design!

• Pipelined Processor Verification!

• Cache Coherence Protocols!

• Shared Memory Consistency Models!

• Feel privileged to work on Formal Methods for Concurrency in Service of HPC !!

We have been fortunate to have built some tools in support of HPC FV

• Let us do some demos … so that you have some context to what I’ll be later saying

DEMO: Dynamic Execution based !Debugging of MPI Programs!

!Tool Name : ISP

DEMO: Symbolic Execution based debugging of!Sequential programs and!

GPU CUDA programs!!

Tool name : GKLEE

Brief History of Why We are Where We Are• CISC machines (70s)!

• Pipelining —> Clock Frequency growth + Compilers!

• Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!

• Pipelining —> Better ILP use!

• Moore’s law : afforded Pipelining tricks!

• Dennard’s law : allowed voltage scaling!

• POWER DENSITY stayed the same!

• Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!

• Tejas Project Write-off — NY Times !

• Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!

Brief History of Why We are Where We Are

• CISC machines (70s)!

Smart Phones (describe the shape of things to come in HPC)!!

(from Adve, http://www.cs.berkeley.edu/~bodik/ASPLOS13/Symposium/sarita-adve-12-asplos-pc-symposium.pdf

Today’s main HPC Mantra

• “Maximize the volume of computational results obtained per Watt”

But what about correctness…. ?

Industrial Flares

NvidiaNASA

Uintah (SCI Group, Utah)Marsden Lab, UCSD

Wikipedia

• “Maximize the volume of computational results obtained per Watt”!

• Subject to Moore’s and Dennard’s laws

(Courtesy Bob Colwell)

• “Maximize the volume of computational results obtained per Watt”!

• Subject to Moore’s and Dennard’s laws

So, how prepared are we to debug Heterogeneous Concurrent Systems?

What is the young many-core world already facing ?

• Multiple heterogeneous cores!

• Multiple concurrency models!

• Data Races!

• Dead Dennard —> Dark Silicon!

• Bit Flips!

• Floating-Point Uncertainties!

• OFTEN clueless (about concurrency) programming community — will provide examples!

• WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

• Data Races!

• Bit Flips!

• Data Races!

• Bit Flips!

• Data Races!

• Bit Flips!

• Data Races!

• Bit Flips!

• Data Races!

• Bit Flips!

• Data Races!

• Bit Flips!

• Data Races!

• Bit Flips!

• Data Races!

• Bit Flips!

Power-6 Studies

Getting Resilience Ground Truths (Power-6)

Power-7 Studies

A “feel” of HPC Correctness• Constant pressure : The “most science per dollar”!

• Many dimensions of correctness!

• HPC explores unknown aspects of Sciences!

• Algorithmic Approximations are often made!

• Growing heterogeneity in HPC platforms!

• Floating-point representation is inexact!

• “Bit flips” !

• Correctness training lacks!

• Busy-enough doing Science!

• Finding and keeping “Pi men” is difficult!

• Always makes sense to switch to latest HW!

• Often the poorest documented

RIKEN K machine

(Lazowka)

HPCSciences

A “feel” of HPC Correctness• Constant pressure : The “most science per dollar”!

• Many dimensions of correctness!

• HPC explores unknown aspects of Sciences!

• Algorithmic Approximations are often made!

• Growing heterogeneity in HPC platforms!

• Floating-point representation is inexact!

• “Bit flips” !

• Correctness training lacks!

• Busy-enough doing Science!

• Finding and keeping “Pi men” is difficult!

• Always makes sense to switch to latest HW!

• Often the poorest documented

!32 (Our twist)

RIKEN K machine

(Lazowka)

HPCSciences

A Heterogeneity-induced bug!(Berzins, Meng, Humphrey, XSEDE’12)

P"="0.421874999999999944488848768742172978818416595458984375""C"="0.0026041666666666665221063770019327421323396265506744384765625""

Compute:"floor("P"/"C")"

"P"/"C"="161.9999…"floor("P"/"C")"="161%

Xeon%Phi%

"P"/"C"="162"floor("P"/"C")"="162%

Expecting 161 msgs

Sent 162 msgs

P"="0.421874999999999944488848768742172978818416595458984375""C"="0.0026041666666666665221063770019327421323396265506744384765625""

"P"/"C"="161.9999…"floor("P"/"C")"="161%

Xeon%Phi%

"P"/"C"="162"floor("P"/"C")"="162%

Expecting 161 msgs

Sent 162 msgs

Authors’ fix : used double-precision for P/C!Question: Is there a more deft solution ?

P"="0.421874999999999944488848768742172978818416595458984375""C"="0.0026041666666666665221063770019327421323396265506744384765625""

"P"/"C"="161.9999…"floor("P"/"C")"="161%

Xeon%Phi%

"P"/"C"="162"floor("P"/"C")"="162%

Expecting 161 msgs

Sent 162 msgs

Authors’ fix : used double-precision for P/C!Question: Is there a more deft solution ?!More important question : What exactly went wrong ??! (the XSEDE’12 authors moved along…)

Resilience• ~7 B transistors per GPU (and many B for CPUs) and a ton of memory!

• 10^18 Transistors Throbbing at GHz for Weeks!

• Some bit changes MUST be unplanned ones!

• In HPC, results combine more (than, say, in “cloud”)!

• “Bit flip” is a catch-all term for !

• High speed-variability of devices coupled with DVFS jitter!

• Local hot spots develop, aging chip electronics!

• Particle strikes!

• Energy is the main currency!

• Some of the energy-saving “games” that must be played (this invites bit-flips)!

• Dynamic Slack Detection, followed by lowering voltage + frequency!

• One PNNL study (Kevin Baker) : 36KW -> 18KW

Our Position (1)

• Despite “bit flips” and such, it is amply clear that sequential and concurrency bugs still ought to be our principal focus!

• They occur quite predictably (unlike bit flips)!

• They are something we can control (and eliminate in many cases)

Our Position (2)

• Unless we can debug in the small, there is NO WAY we can debug in the large

Our Observations (3)

• There are SO MANY instances where experts are getting it wrong — and spreading the wrong

Example-1• IBM Documentation: “If you debug your MPI program

under zero Eager Limit (buffering for MPI sends), then adding additional buffering does not cause new deadlocks”

• It can

Example-1• IBM Documentation: “If you debug your MPI program

under zero Eager Limit (buffering for MPI sends), then adding additional buffering does not cause new deadlocks”

• It can

Example-2

• A reduction kernel given as an early-chapter example of a recent Cuda book is broken!

• Reason: Assumes that CUDA atomic-add has a “fence” semantics!

• Erratum has been issued on book website

Example-3

• A work-stealing queue in “GPU gems” is incorrect!

• Reason: Assumes “store store” ordering between two sequentially issued stores (must have used a fence in-between)

Feature of GPU programming

• Programmers face concurrency corner-cases quite frequently

• As opposed to (e.g.) OS where low-level concurrency is usually hidden within the kernel

Example-4

• If your code ran correctly in FORTRAN, it will also run correctly in C

Example-4 invalidated

Example-5 !Simple questions can’t be answered by today’s tools!

Does this program deadlock? (Yes.)

Example-5 !Simple questions can’t be answered by today’s tools!

Does this program deadlock? (Yes.)

__global__'void'kernel(int*'x,'int*'y)'{'''int'index'='threadIdx.x;'

''y[index]'='x[index]'+'y[index];'

''if'(index'!='63'&&'index'!='31)'''''y[index+1]'='1111;'

Ini$ally(:(x[i](==(y[i](==(i(

Warp1size(=(32(