Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | hugo-price |
View: | 215 times |
Download: | 1 times |
Metacomputing:What’s in it for ME?
Legion 1.3
Greg Lindahl,
Andrew Grimshaw, Adam Ferrari, Katherine Holcomb
This work partially supported by DARPA(Navy) contract # N66001-96-C-8527, DOE grant DE-FD02-96ER25290, DOE contract Sandia LD-9391, Northrup-Grumman (for the DoD HPCMOD/PET program),DOE D459000-16-3C and DARPA (GA) SC H607305A
What is a Metasystem, anyway?
Geographically separated collection of people, computers, storage
A fast network to connect them all Software which makes this mess easy to use
It should be as easy to use as the single machine on your desktop
It should be easy to collaborate with other people all over the world
Fewer meterological centers are islands today
So what’s wrong with existing systems?
Different sites have different software environments, even for identical OSes
Different sites don’t share filesystems, wasting your time ftping files around
Security policies often make using multiple sites much less convenient
But the fundamental problem is...
Stretching the old model (interacting but autonomous computers) to larger and larger systems results in incomplete and incompatible solutions
These solutions don’t scale for the future, nor work together today.
High-High-performancperformancee
DistributedDistributed SecureSecure Fault-Fault-
toleranttolerant TransparentTransparent
Our Vision - One Transparent SystemOur Vision - One Transparent System
Metacomputing Benefits
More effective collaboration, by sharing a workspace video conferencing is not enough
Higher performance, from use of off-site resources and easier construction of parallel and coupled applications
Improved productivity from a simpler environment the holy grail that no one has delivered
Example Application
Multi-scale climate modeling El Nino: global O/AGCM coupled with regional
weather model
Features: complicated information exchange written by different groups in different languages
UCLA model uses Cray pointers, runs well on T3E. Regional model isn’t parallel.
run best on very different hardware (T3E, T90)
Issues in our example
Vendor software such as MPI doesn’t interoperate
Hard to start executing in 2 places at once (queues)
Fault-tolerance of whole system (more pieces to break)
Cross-site security Coupling to visualization tools
Legion philosophy (CS slide)
Provide flexible mechanisms, not fixed policies
Allow users/application designers to choose a point in this space, based on their own requirements.
level of servicelevel of service
kind o
f
kind o
f
serv
ice
serv
ice
cost
cost
Legion’s Concrete Benefits
Transparent, remote access to files Transparent, remote execution Wide-area parallel processing
bag of tasks large parallel apps
Meta applications still sounds like a bunch of jargon, doesn’t it?
Remote file access today
NFS. Requires super-user configuration, has awful security properties. Only one kind of synchronization, and it’s different from local files
SMBFS: Ditto. OK, so users can share files, but it’s a security hole.
Web: read-only, unless you use really bad security methods
File Access Tomorrow
Legion allows users to share persistent objects, securely, anywhere
Files just happen to have a particular interface All properties set on a per-file basis: security,
fault-tolerance, caching, special interfaces
Application-specific interfaces Today: only sequential and one type of array Legion supports anything you can think of,
including file-like objects which are actually whole simulations
Remote Execution Today
Pick which resource to use (by hand) check load averages, think about problem size,
guess turn-around time
Copy your files around (program, data) a big pain if you don’t share filesystems
Deal with the queuing system Complicated enough that most people just
stick with 1 resource, even if it isn’t right for much of their work
OK, OK, I really meant...
The system should help automate picking the right resource help you optimize your turn-around time
The system should get your files there even if it’s at a remote site
The system should provide uniform access to queuing systems
Wide-area Parallel Processing
Consider a parameter space study serial or parallel program, “big” or “small”, run
many times on slightly different input data Sometimes called “bag of tasks”
Great way to piss off your friends today queue systems don’t like it when you submit
10,000 jobs hard to pick the right resource, but picking the
wrong one gets you into trouble
Bag of Tasks Tomorrow
A metacomputing system can do this problem well: latency tolerant (by design) can use remote resources, or multiple resources
within one site
Big Parallel Apps
Consider a big, explicit ocean model MICOM, maybe NLOM
App is sufficiently latency-tolerant to use 2 machines for one run bigger problem sizes tolerate more latency across the room (few microseconds) across the country (80 milliseconds)
Metacomputing provides the pieces to make this happen… more than cross-box MPI
Meta Applications
Multi-component program Pieces used to be separate programs Programs have hardware affinities Programs have big datasets which live at
geographically-remote sites Today’s “couplers” are expensive to build,
expensive to run (human time) Tomorrow’s couplers will hopefully be easier
Authentication (login)
legion_login <userid> currently uses a password; other mechanisms can
be easily added (SecureID) the “login object” generates a certificate this certificate identifies you in the future ideally, one “login object” should be able to give
you access to all your MSRC accounts a goal of cross-domain Kerberos, but will it be
accomplished?
Unified Console
Run-time output flows back to one or more “tty objects”
One or more windows can “watch” the tty Dynamic connection and disconnection of
both writers and watchers Secured by usual Legion security methods Benefits: flexibility, fault-tolerant, sharing,
security
Location-Independent Objects/Files
Network wide, transparent filesystem Programs can read/write files regardless of
execution location minimal change: 1-2 lines of code per file
Benefits: transparent execution, sharing with others
Remote execution
non-Legion binaries, shell scripts, whatever copies binary, data files simplest way to do parameter space studies
legion_register_program <unix_path> <legion_path>legion_run <legion_path> <parameters>legion_run_multi -f spec <legion_path>
Binary Mangement
More than just remote file access Compile your code for each architecture
possibly using legion_run... don’t have to log in
Upload your binaries to Legion space legion_register_binary legion_path Unix_path arch register repeatedly for different architectures
Legion moves the right binary to wherever it’s needed caching
Parallel Computing
MPI and PVM requires relinking to our libraries legion_mpi_run -n 4 my_program
Parallel programs in Java, C++ (MPLC extension), BFS (Fortran dataflow!)
We expose the runtime system to compile writers and tool builders
Legion Status -- 1.3
Testbeds run cross-country continuously Glues our testbed shared-nothing cluster
together Transparent files, remote execution, MPI, one
style of security are here Scheduling in beta, fault-tolerance a work in
progress Deployment: NPACI, DoD MSRCs, NASA,
DoE