Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | blaise-upshaw |
View: | 218 times |
Download: | 0 times |
You have exascale problems?◦ Load Balancing? ◦ Failure? ◦ Power Management?
My system software will solve these problems
System Software: It Slices, Dices, and makes Julienne Fries!
Coordinated checkpointing to the traditional parallel file system won’t scale
Checkpoint commit approaches node MTBF=> Application efficiency drops quickly
Example: Fault Tolerance
Each MPI process runs twice, only fail if both processes in a rank fail
Handle full MPI semantics at scale
rMPI: Replicated (not Ronco™) MPI
Ferreira, et al. SC 2011.
Your machine power budget and hardware acquisitionbudget (*)
Act now, and you’ll gettwice the capacity computing functionality for FREE!
(*) plus contracting and granting
Only two low, low payments of…
Costs and benefits are really easy to understand◦ Large and node-scalable reduction in system mean
time to interrupt (MTTI)◦ Using it as the primary fault tolerance technique
means twice the power consumption on capability problems
◦ Buying twice the number of nodes is also quite painful
SC13 Panel: “Replication is too expensive…We [as a community] will have failed if we can't do better than that. ” – Marc Snir
What are you trying to sell me?
Department of Computer Science
Everything’s A NailWhy you don’t want system software to solve your problems (if you can help it)
Patrick G. [email protected]
April 22, 2014
TheoremEvery individual complete system-level solution to an application exascale problem is “too expensive” for some real workload
Rationale◦ OS doesn’t know your application◦ General solutions are expensive◦ Specialized solutions have limited
power or applicability
I have a hammer!
Save us, vendors!◦ Adding reliability on the compute and control path
is potentially hardware-intensive◦ How much to pay in transistors, power, and $$?◦ While stepping off the commodity
price/performance curve… Burst Buffers
◦ How much budget to spend on the I/O system? ◦ Memory is a scarse resource at exascale◦ NVRAM and network bandwidth aren’t free in
power◦ Some nice recent work in this area
“Simple” Resilience Solutions
Idea: Each node checkpoints when most convenient and out of sync with other nodes
Benefit: get checkpointing off the peak B/W curve onto the sustained B/W curve
Has some (low) obvious costs, some less obvious costs
Asynchronous Checkpointing
Async. Checkpointing approaches highly application-dependent
Apps and Benchmarks Proxy Applications
Ferreira, et al. In submission.
• Note how bimodal these performance curves are!• Clustered asynchronous checkpointing may hold promise here
Checkpoint-avoidance Systems
Levy, et al. In submission.Cheap and powerful is here
No one inexpensive technique enough, but each solves part of the problem
System software must stop trying to “rescue” the application and work with the application◦ Application/runtime can cover part of the space◦ System software can provide “last resort”
solutions when the application cannot easily recover
◦ Right solution application and hardware dependent
◦ Like it is for linear solvers and load balancing Not just a resilience issue
Still have to solve the problems
Characterization of techniques at scale Continued development of new techniques Good decision support
◦ Yet more knobs someone needs to turn◦ Many of the tradeoffs are non-linear, stochastic,
etc◦ Different problem areas interact “interestingly”◦ Complex influence on acquisition decisions, too
Clean interfaces to runtime and application◦ “From a runtime developer’s perspective, the way
that current operating systems manage resources is fundamentally broken” – Mike Bauer, Legion project
What do we need to enable this?
Linux (like OSF/1) will solve all your problems for you
◦ Whether you like it or not◦ While making sure you can’t
do the things you (think you) should do
◦ Which is fine, as long as you don’t need to do anything interesting
Current Oses are Helicopter Parents
Runtimes: “…it is the OS's job to provide mechanism and stay out of the way…”
Sandia lightweight kernels: “The QK provides mechanism, PCT encapsulates policy”
Go ahead and try – if you fall, I’ll catch you
Exascale OS must be your partner
Applications more complex than when the LWK was originally designed◦ Users want more complex interfaces and services◦ Runtimes still want low-level hardware access◦ But we still have to provide some level of isolation◦ As well as backstop mechanisms in cooperation
with hardware Two predominant approaches:
◦ Composite OS (Fused OS, MAHOS, Argo OS/R, etc.)
◦ Virtualization (Kitten+Palacios VMM, Hobbes OS/R)
Lightweight OSes for Next-generation Systems
Safe low-level hardware access for runtime systems Supports bringing your own OS with you Don’t have to muck with the insides of Linux Can be very fast
Why Virtualization?
HPCC FFT over virtualized 10GbE
CTH on Palacios/Kitten on Red Storm
Virtualization in Hobbes OS/R
Multiple virtualization architectures, not just one Pick the point on the spectrum that provides the
mechanisms your application/runtime needs Interesting research challenges on the right mechanisms
and interfaces to provide at and between each point
LWKVirtual LinuxEvironment
(Kitten, CNK)
LWKCustom
(Catamount,HybridVM)
HeaviestWeight
Fused OSMultiple-native
OSes(Pisces, Argo)
Para-virtualImplicit,
VMM ChangesGuest OS
(Gears, GuardedModules)
Para-virtualExplicit,
Guest OS Modifiedor Augmented
(Orig. Xen,Device Drivers)
Full HW VMRuns Unmodified
Guest OSes, Passthru(Palacios, KVM, …)
Software VirtEmulate HW, Binary
Translation, …(Qemu, Vmware,
Emulate HW TransMemory pre-product)
LightestWeight
Assumption is that the runtime (and/or virtualized OS) will do this for the LWK
Is a semi-static policy + local (HW or runtime) adaptation sufficient?
Or global dynamic adaptive runtime system that sets policy and resource allocation for millions of cores? ◦ With low overhead and application interference?◦ “Burning a core” probably not viable at this problem size?◦ Heuristics vs. more disciplined methods?
I want to believe but I have yet to see it◦ Distributed, Decentralized◦ Must be robust and efficient◦ Can we tolerate imperfect and unfair?
Now that I’ve punted on policy…
No, the application and runtime really shouldn’t expect the OS to rescue it
System software can and shouuld provide a range of modest, inexpensive mechanisms ◦ Which can backstop app when it can’t rescue
itself◦ Need well-quantified performance for techniques ◦ On real legacy and next-generation workloads
Virtualization can give the runtime the low-level mechanisms it wants inexpensively
Conclusions
Colleagues, collaborators and students on this work◦ UNM: Dorian Arnold, Scott Levy, Cui Zheng◦ Sandia: Ron Brightwell, Kurt Ferreira, Kevin Pedretti,
Patrick Widener◦ Northwestern: Peter Dinda, Lei Xia◦ Oak Ridge: Barney Maccabe◦ Pittsburgh: Jack Lange
Acknowledgements
This work was supported in part by:◦ DOE Office of Science, Advanced Scientific Computing
Research, under award number DE-SC0005050, program manager Sonia Sachs
◦ Sandia National Labs including funding from the Hobbes project, which is funded by the 2013 Exascale Operating and Runtime Systems Program from the DOE Office of Science, Advanced Scientific Computing Research
◦ Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL85000
◦ U.S. National Science Foundation Awards CNS-0709168 and CNS-0707365
Acknowledgements (cont’d)