Date post: | 21-Jan-2016 |
Category: |
Documents |
Upload: | peter-hodge |
View: | 218 times |
Download: | 0 times |
Text
Distributed Applications:Examining the Past
Understanding the Present Preparing for the Future(Grid)
Shantenu Jha
Director, Cyber-Infrastructure Development, CCT
Computer Science
e-Science Institute, Edinburgh
http://www.cct.lsu.edu/~sjha
http://saga.cct.lsu.edu
Text
Outline
Critical Perspective on Large-Scale Distributed Applications and Production Cyber-Infrastructure (CI)
Understanding Distributed Applications (DA) Differ from HPC or || App, Challenges of DA DA Development Objectives (IDEAS)
Understanding SAGA Using SAGA to develop Distributed Applications
Frameworks Abstractions for Dynamic Execution Data-Intensive Applications
Discuss how IDEAS are met Derive (Initial) User Requirements/Requests for FutureGrid
Critical Perspectives Distributed CI: Is the whole > than the sum of the parts?
Several BIG Projects have success stories on TG But REAL Science happens at ALL SCALES Tools for the individual users to innovate and develop?
Infrastructure capabilities and policy determine Applications development, deployment and execution:
Proportion of App. that utilize multiple distributed sites sequentially, concurrently or asynchronously is low (~5%)
Not referring to tightly-coupled across multiple-sites TG (exclusively) supported legacy, static execution models
Move data to computing Compute where the data is? Distributed Data/Jobs vs Bringing it all into the Cloud
What novel applications & science has Distributed CI fostered?
Text
• Fundamentally a hard problem:• Dynamical Resource, Heterogeneous resources• Variable Control (or lack thereof)
• Add to it: Complex underlying infrastructure provisioning• Programming Systems for Distributed Applications:
• Incomplete? Customization? Extensibility?• Computational Models of Distributed Computing• Design Points: More than (peak) performance • Primary role of Usage Modes• Range of DA, no clear taxonomy
Understanding Distributed Applications Development Challenges
Understanding Distributed ApplicationsDevelopment Challenges
Distributed Applications Require: Coordination over Multiple & Distributed sites:
Scale-up and Scale-out Logically or physically Distributed
1st Gen of Peta/Exa/Zetta/Yotta -- Applications requiring multiple-runs, ensembles, workflows..
Core characteristics and challenges of logically and physically distributed applications are SAME
Inter-play of Requirements, Infrastructure, Usage Mode
Ability to develop simple, novel or effective distributed Applications lags behind other aspects of CI
General purpose Distributed Application Development Lacking in NSF/OCIs portfolio….
Understanding Distributed Applications Development Objectives
Interoperability: Ability to work across multiple distributed resources
Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently
Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure
Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data
Simplicity: Accommodate above distributed concerns at different levels easily…
Challenge: How to develop DA effectively and efficiently with the above as first-class objectives?
Text
SAGA: Basic Philosophy
There exists a lack of Programmatic approaches that: Provide general-purpose common grid functionality for
applications and thus hide underlying complexity, varying semantics..
The building blocks upon which to construct “consistent” higher-levels of functionality and abstractions
Hides “bad” heterogeneity, means to address “good” heterogeneity Meets the need for a Broad Spectrum of Application:
Simple scripts, Gateways, Smart Applications and Production Grade Tooling, Workflow…
Simple, integrated, stable, uniform and high-level interface Simple and Stable: 80:20 restricted scope and Standard Integrated: Similar semantics & style across Uniform: Same interface for different distributed systems
SAGA: Provides Application* developers with basic unit required to compose high-level functionality across (distinct) distributed systems
(*) One Person’s Application is another Person’s Tool
Text
SAGA: The Standard Landscape
SAGA: In a thousand words..
Text
SAGA: Job SubmissionRole of Adaptors (middleware binding)
SAGA Job API: Example
SAGA: Other Packages
SAGA and Distributed Applications
SAGA-based Frameworks: Types Frameworks: Logical structure for Capturing Application
Requirements, Characteristics & Patterns Runtime and/or Application Framework
Application Frameworks designed to either: Pattern: Commonly recurring modes of computation
Programming, Deployment, Execution, Data-access.. MapReduce, Master-Worker, H-J Submission
Abstraction: Mechanism to support patterns and application characteristics
Runtime Frameworks: Load-Balancing – Compute and Data Distribution
SAGA-based Framework: Infrastructure-independent
Abstractions for Dynamic Execution (1) Container Task
Adaptive:
Type A: Fix number of replicas; vary cores assigned
to each replica.
Type B: Fix the size of replica, vary number of replicas
(Cool Walking)
-- Same temperature range (adaptive sampling)
-- Greater temperature range (enhanced
dynamics)
Abstractions for Dynamic Execution (2)SAGA Pilot-Job (BigJob)
Coordinate Deployment & Scheduling of Multiple Pilot-Jobs
Distributed Adaptive Replica Exchange (DARE)Scale-Out, Dynamic Resource Allocation and Aggregation
Multi-Physics Runtime FrameworksExtensibility
Coupled Multi-Physics require two distinct, but concurrent simulations
Can co-scheduling be avoided?
Adaptive execution model: Yes
Load-balancing required. Pilot-Job facilitates LB! Across sites? (open Q)
First demonstrated multi-platform Pilot-Job:
MPI-based TG – Condor GI
Dynamic Execution Reduced Time to Solution
Ensemble Kalman FiltersHeterogeneous Sub-Tasks
Ensemble Kalman filters (EnKF), are recursive filters to handle large, noisy data; use the EnKF for history matching and reservoir characterization
EnKF is a particularly interesting case of irregular, hard-to-predict run time characteristics:
Results: Scale-Out Performance
Using more machines decreases the TTC and variation between experiments
Using BQP decreases the TTC & variation between experiments further
Lowest time to completion achieved when using BQP and all available resources Khamra & Jha, GMAC, ICAC’09
But Why does BQP Help? The Case for System Senors
Autonomic Integration of HPC Grids-Clouds EnKF: Extensibility and Interoperabilty
(work with M. Parashar et al. Accepted for e-Science 2009)
• Application Objectives:• Acceleration• Resilience• Conservation
• Pull vs Push Task map
Application-level InteroperabilityCloud-Cloud; Cloud-Grid
Application-level (ALI) vs. System-level Interoperability (SLI) Infrastructure Independence is Pre-requisite for ALI
The case for both Grids AND Clouds: Hybrid & Heterogeneous workload: data-compute affinity differ Availability zone, Data-transfer cost.. Complex data-flow dependency: need runtime determination
Just because you can use Grids AND Clouds, should you ?
Important Research Question: When should you? Runtime Decision: Mechanism to determine when/if ? Should be influenced by Application Objectives Programming Model should be Infrastructure independent
Same application, priced differently, for same performance Same application, priced same, for different performance
SAGA-based Frameworks: Examples SAGA-based Pilot-Job Framework (FAUST)
Extend to support Load-balancing for multi-components SAGA MapReduce Framework:
Control the distribution of Tasks (workers) Master-Worker: File-Based &/or Stream-Based Data-locality optimization using SAGA’s replica API
SAGA NxM Framework: Compute Matrix Elements, each is a Task
All-to-All Sequence comparison Control the distribution of Tasks and Data Data-locality optimization via external (runtime) module
Distributed Data Intensive ApplicationsResearch Challenges
Goal: Develop DDI scientific applications to utilize a broad range of distributed systems, without vendor lock-in, or disruption, yet with the flexibility and performance that scientific applications demand.
Frameworks as possible solutions Frameworks address some primary challenges in developing Distributed DI
Applications Coordination of distributed data & computing Runtime (Dynamic) scheduling, placement Fault-tolerance
Many Challenges in developing such Frameworks: What are the components? How are they coupled? Functionality
expressed/exposed? Coordination? Layering, Ordering, Encapsulations of Components
“Just because you use can’t use MPI (on distributed systems), doesn’t mean you can’t use other approaches”
Frameworks: Logical ordering
SAGA
Frameworks: Logical ordering
SAGA-MapReduce(Miceli, Jha et al CCGrid’09; Merzky, Jha et al GPC’09)
Interoperability: Use multiple infrastructure concurrently
Control the NW placement Simple staging of data
SAGA-Sphere-Sector: Open Cloud Consortium
Stream processing model Ongoing work Apply to all elements
(files) in a data-set (stream)
Ts: Time-to-solution, including data-staging for SAGA-MapReduce (simple file-based mechanism)
Controlling Relative Compute-Data Placement
SAGA All-Pairs: Runtime Data Placement
Classical: Place task on 4 LONI machines (512px Dell Clusters)
Simple data staging “Intelligent”: Map a task to a
resource based upon Cost Cost = Data Dependency +
transfer times (latency) “Ignoring Intelligent mapping is
no longer an option” Quote (undergraduate) Miceli
Classical Intelligent
0
100
200
300
400
500
600
Processing Time
"Intelligence" Time
Understanding Distributed Applications Development Objectives Redux
Interoperability: Ability to work across multiple distributed resources
SAGA: Middleware Agnostic Distributed Scale-Out: The ability to utilize multiple
distributed resources concurrently Support Multiple Pilot-Jobs: Ranger, Abe, QB
Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure
Pilot-Job also Coupled CFD-MD, Integrated BQP Adaptivity: Response to fluctuations in dynamic resource
and availability of dynamic data Simplicity: Accommodate above distributed concerns at
different levels easily…
Does SAGA Provide A Fresh Perspective?
Early User: An Environment that Supports
Echo what Andrew Grimshaw said!! e.g., test-bed for Standards interoperation
Trivial Remarks: Not obsessed with system utilization like TG Policies that support IDEAS as first-class concerns
Support Dynamic, First-Principles Explicitly Distributed App. Dynamic, Adaptive Applications:
Dynamic Resource Utilization: e.g BQP (Jha et al, GMAC, ICAC Barcelona 2009)
Grid Observatory (EGEE) – all kinds of Traces Dynamic Adaptive Data:
Network Aware Application (Jha et al, IEEE eScience ’07) Data Scheduler: Big Data, Frequent Data
Early User: An Environment that Supports
Autonomic Computational Science Applications Support the tuning of and by Applications
Platform for developing (SAGA) AF and RT Frameworks Design, Stand-up and Experiment with Frameworks
eg load-balancer for dynamic resource allocation SAGA-MapReduce, NxM
eg Control Relative Placement of Data/Compute
Supporting Distributed Abstractions – Development, Deployment and Execution-level
A controlled but realistic environment RAIN – Dynamic Provisioning (provide clean API) (Reproducible) Experimental Manager, VAMPIR
[Connection with Grid Observatory]
Text
SAGA-based Tools and ProjectsOne person’s Tool is another person’s Application
DESHL DEISA-based Shell and Workflow library
JSAGA from IN2P3 (Lyon) http://grid.in2p3.fr/jsaga/index.html
GANGA-DIANE gLite
XtreemOS (Based upon SAGA for the Distribution) NAREGI/KEK SD Specification
With gLite adaptors
Advantage of Standards
AcknowledgementsSAGA Team and DPA Team and the UK-EPSRC (UK EPSRC: DPA, OMII-UK , OMII-UK PAL)
People:SAGA D&D: Hartmut Kaiser, Ole Weidner, Andre Merzky, Joohyun Kim, Lukasz
Lacinski, João Abecasis, Chris Miceli, Bety Rodriguez-MillaSAGA Users: Andre Luckow, Yaakoub el-Khamra, Kate Stamou, Cybertools
(Abhinav Thota, Jeff, N. Kim), Owain KenwayGoogle SoC: Michael Miceli, Saurabh Sehgal, Miklos ErdelyiCollaborators and Contributors: Steve Fisher & Group, Sylvain Renaud
(JSAGA), Go Iwai & Yoshiyuki Watase (KEK)DPA: Dan Katz, Murray Cole, Manish Parashar, Omer Rana, Jon Weissman
Abstractions for Distributed Applications and Systems: A Computational Science Perspective Authors: S Jha, D Katz, M Parashar, O Rana, J
Weissman
Upcoming Book by Wiley (Summer 2010)
SAGA: Building the abstractions to Bridge the Infrastructure-Applications Gap
Focus on Application Development and
Characteristics, not infrastructure details
Interoperability
Application Development Phase
Generation & Exec. Planning Phase
Execution Phase
DAG based Workflow ApplicationsExtensibility Approach
SAGA-based DAG ExecutionPreserving Performance