Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | corey-campbell |
View: | 220 times |
Download: | 0 times |
1
Resource Management of Large-Scale Applications on a Grid
Laukik Chitnis and Sanjay Ranka(with Paul Avery, Jang-uk In and Rick
Cavanaugh)Department of CISE
University of Florida, [email protected]
352 392 6838(http://www.cise.ufl.edu/~ranka/)
2
Overview
High End Grid Applications and Infrastructure at University of Florida
Resource Management for Grids Sphinx Middleware for Resource
Provisioning Grid Monitoring for better meta-
scheduling Provisioning Algorithm Research for
multi-core and grid environments
3
The Evolution of High-End Applications
(and their system characteristics)
19801980 19901990 20002000
Data Intensive Data Intensive ApplicationsApplications
Compute Intensive Applications
• Geographicallydistributed datasets
• High speed storage• Gigabit networks
• Geographicallydistributed datasets
• High speed storage• Gigabit networks
• Large clusters• Supercomputers
• Large clusters• Supercomputers
MainFrame Applications
• Centralmainframes
• Centralmainframes
4
Some Representative Applications
HEP, Medicine, Astronomy, Distributed Data Mining
51- 1-10 petabytes
1000+ 20+ countries
Representative Application: High Energy Physics
6
Representative Application: Tele-Radiation Therapy
M A G N
E T O M
N e t S y sI m a g e s
D I C O M
D I C O M - R T
R T O G
• V i s u a l i z e ( 2 D 3 D ) , D V H , I s o - d o s e , C u t p l a n e , e t c .
• C a s e i n f o
• R e v i e w
• M o d i f y S t r u c t u r e s
• A n n o t a t e
• E t c .
R C E T S e r v e r
W e b B a s e d U p l o a d / D o w n l o a d T o o l
C l i n i c P C
R C E T D a t a b a s e
N e t S y s
D I C O M - R T
S e r v e r
N e t S y s
R T O G
R e a d e r
N e t S y s
I m a g e
R e a d e r
W e b B a s e d E l e c t r o n i c F o l d e r & R a p i d R e v i e w T o o l s
I m a g i n g D e v i c e
T r e a t m e n t P l a n n i n g
F i l m S c a n n e r
S O A N S
E x t e r n a l D a t a b a s e
I n v e s t i g a t o r P C
N e t S y s
RCET Center for Radiation OncologyRCET Center for Radiation Oncology
7
Representative Application: Distributed Intrusion Detection
NSF ITR Project:
Middleware for Distributed
Data Mining
(PI: Ranka
joint with Kumar and Grossman)
NSF ITR Project:
Middleware for Distributed
Data Mining
(PI: Ranka
joint with Kumar and Grossman)
Data Manageme
ntServices
Data Mining and Scheduling Services
Application
.
Application
.
Data Manageme
ntServices
Data Transport Services
8
Grid Infrastructure
Florida Lambda Rail and UF
9
Campus Grid (University of Florida)
NSF Major Research Instrumentation Project
(PI: Ranka, Avery et. al.)20 Gigabit/sec Network20+ Terabytes2-3 Teraflops10 Scientific and
Engineering Applications
NSF Major Research Instrumentation Project
(PI: Ranka, Avery et. al.)20 Gigabit/sec Network20+ Terabytes2-3 Teraflops10 Scientific and
Engineering Applications
Infiniband based Cluster
Infiniband based Cluster
Gigabit Ethernet Based Cluster
Gigabit Ethernet Based Cluster
10
Grid Services
The software part of the infrastructure!
11
Services offered in a Grid
Resource Managemen
t Services
Data Management
Services
Monitoring and
Information Services
Security Services
Note that all the other services use security services
12
Resource Management Services
Provide a uniform, standard interface to remote resources including CPU, Storage and Bandwidth
Main component is the remote job manager
Ex: GRAM (Globus Resource Allocation Manager)
13
Resource Management on a Grid
User
The Grid
Site 1Condor
PBS
LSF
fork
GRAM
Narration: note the different local schedulers
Site 3
Site 2
Site n
14
Scheduling your Application
15
Scheduling your Application
An application can be run on a grid site as a job The modules in grid architecture (such as
GRAM) allow uniform access to the grid sites for your job
But… Most applications can be “parallelized” And these separate parts of it can be
scheduled to run simultaneously on different sites
Thus utilizing the power of the grid
16
Modeling an Application Workflow
Many workflows can be modeled as a Directed Acyclic Graph
The amount of resource required (in units of time) is known to a degree of certainty
There is a small probability of failure in execution (in a grid environment this could happen due to resources no longer available)
Directed Acyclic Graph
17
Workflow Resource Provisioning
ResourcesResourcesResourcesResources
ApplicationsApplicationsApplicationsApplications
Policies
Policies
Policies
Policies
Priorit
y
Priorit
y
LargeLarge
Access
Access
Control
Control
PrecedencePrecedence
Quota
Quota
Multi
ple
Multi
ple
Ownersh
ip
Ownersh
ip
Executing multiple workflows
over distributed and adaptive (faulty)
resourceswhile managing
policies
Data Data IntensiveIntensive
Time Time ConstraintsConstraints
DistributedDistributed
Multi-coreMulti-core HeterogeneousHeterogeneous
FaultyFaulty
18
A Real Life Example from High Energy Physics
Merge two grids into a single multi-VO“Inter-Grid”
How to ensure that neither VO is harmed? both VOs actually benefit? there are answers to questions like:
“With what probability will my job be scheduled and complete before my conference deadline?”
Clear need for a scheduling middleware!
FNAL
Rice
UIMIT
UCSD
UF
UW
Caltech
UM
UTA
ANL
IU
UC
LBL
SMU
OU
BU
BNL
19
Typical scenario
VDT Server
VDT Server
VDT Server
VDT Client
??
?
20
Typical scenario
VDT Server
VDT Server
VDT Server
VDT Client
??
?
@#^%#%$@#@#^%#%$@#
21
Some Requirements for Effective Grid Scheduling
Information requirements Past & future
dependencies of the application
Persistent storage of workflows
Resource usage estimation
Policies Expected to vary slowly
over time Global views of job
descriptions Request Tracking and
Usage Statistics State information
important
Resource Properties and Status
Expected to vary slowly with time
Grid weather Latency of measurement
important Replica management
System requirements Distributed, fault-tolerant
scheduling Customisability Interoperability with
other scheduling systems Quality of Service
22
Incorporate Requirementsinto a Framework
VDT Server
VDT Server
VDT Server
VDT Client
Assume the GriPhyN Virtual Data Toolkit:
Client (request/job submission) Globus clients Condor-G/DAGMan Chimera Virtual Data System
Server (resource gatekeeper) MonALISA Monitoring Service Globus services RLS (Replica Location Service)
??
?
23
Incorporate Requirementsinto a Framework
Assume the Virtual Data Toolkit: Client (request/job submission)
Clarens Web Service Globus clients Condor-G/DAGMan Chimera Virtual Data System
Server (resource gatekeeper) MonALISA Monitoring Service Globus services RLS (Replica Location Service)
VDT Server
VDT Server
VDT Server
VDT Client
Framework design principles: Information driven Flexible client-server model General, but pragmatic and
simple Avoid adding middleware
requirements on grid resources
?
RecommendationEngine
24
System Adaptive Scheduling
Co-allocation
Fault-tolerant
Policy-based
QoS support
Flexible interface
Nimrod-GEconomy-drivenDeadline support
X O X X O X
Maui/SilverPriority-basedReservation
O O X O O X
PBSBatch job schedulingQueue-based
X O X X O X
EZ-GridPolicy-based
X O X O X O
ProphetParallel SPMD
X X X X O X
LSFInteractive,batch modes
X O O O O X
Related Provisioning Software
25
Innovative Workflow Scheduling Middleware Modular system
Automated scheduling procedure based on modulated service
Robust and recoverable system Database infrastructure Fault-tolerant and recoverable from internal failure
Platform independent interoperable system XML-based communication protocols
SOAP, XML-RPC Supports heterogeneous service environment
60 Java Classes 24,000 lines of Java code 50 test scripts, 1500 lines of script code
26
The Sphinx Workflow Execution Framework
Sphinx Server
VDT Client
VDT Server Site
MonALISA Monitoring Service
Globus Resource
Replica Location Service
Condor-G/DAGMan
Request Processing
Data Warehouse
Data Management
InformationGathering
Sphinx ClientChimera
Virtual Data System
Clarens
WS Backbone
27
Sphinx Workflow Scheduling Server
Functions as the Nerve Centre Data Warehouse
Policies, Account Information, Grid Weather, Resource Properties and Status, Request Tracking, Workflows, etc
Control Process Finite State Machine
Different modules modify jobs, graphs, workflows, etc and change their state
Flexible Extensible Sphinx Server
Control Process
Job Execution Planner
Graph Reducer
Graph Tracker
Job Predictor
Graph Data Planner
Job Admission Control
Message Interface
Graph Predictor
Graph Admission Control
Data Warehouse
Data Management
Information Gatherer
28
SPHINX
Scheduling in Parallel for Heterogeneous Independent
NetworXs
29
Policy Based Scheduling Sphinx provides “soft” QoS through
time dependent, global views of Submissions (workflows, jobs,
allocation, etc) Policies Resources
Uses Linear Programming Methods Satisfy Constraints
Policies, User-requirements, etc Optimize an “objective” function
Estimate probabilities to meet deadlines within policy constraints
J. In, P. Avery, R. Cavanaugh, and S. Ranka, "Policy Based Scheduling for Simple Quality of Service in Grid Computing", in Proceedings of the 18th IEEE IPDPS, Santa Fe, New Mexico, April, 2004
ResourcesS
ub
mis
sion
sTime
SubmissionsResources Time
Policy Space
30
Ability to tolerate task failures
Average Dag Completion Time (30 dags x 10 jobs/dag)
2000
2200
2400
2600
2800
3000
3200
3400
# of CPUs based Round-robin # of CPUs based-without feedback
Round-robin-without feedback
Scheduling Algorithms
Tim
e (S
eco
nd
s)
Timeout (120 dags x 10 jobs/dag)
125
386 327
154
2258
1
10
100
1000
10000
Completiontime based
Queue lengthbased
# of CPUsbased
Round robin # of CPUsbased-without
feedback
Scheduling Algorithms
# o
f jo
bs
• Significant Impact of using feedback information
Jang-uk In, Sanjay Ranka et. al. "SPHINX: A fault-tolerant system for scheduling in dynamic grid environments", in Proceedings of the 19th IEEE IPDPS, Denver, Colorado, April, 2005
31
Grid Enabled Analysis
SC|03
Distributed Services for GridEnabled Data Analysis
Distributed Services for GridEnabled Data Analysis
Sphinx
Scheduling Service Fermilab
FileService
VDT ResourceService
Caltech
FileService
VDT ResourceService
RLS
Replica LocationService
Sphinx/VDT
Execution Service
MonALISA
Monitoring Service
ROOT
Data AnalysisClient
Chimera
Virtual Data Service
Iowa
FileService
VDT ResourceService
Florida
FileService
VDT ResourceService
Clarens
Cla
ren
s
Cla
rens Globus
Globus
Gri
dF
TP
Claren
s
Globus
MonALISA
33
Evaluation of Information gathered from grid monitoring systems
AvgJobDelay : Turnaround time v/s value
0
100
200
300
400
500
600
700
800
0 200 400 600 800 1000
parameter value
turn
aro
un
d t
ime
(sec
)
queue_length : Turnaround time v/s rating value
0
100
200
300
400
500
600
700
800
900
0 0.5 1 1.5
site rating value
turn
aro
un
d t
ime
(sec
)
cluster_load : Turnaround time v/s value
0
100
200
300
400
500
600
700
800
0 0.2 0.4 0.6 0.8 1 1.2
cluster_load value
turn
aou
nd
tim
e (s
ec)
Correlation index
Turnaround time
Queue length -0.05818
Cluster load -0.20775
Average Job Delay 0.892542
34
Limitation of Existing Monitoring Systems for the Grid
Information aggregated across multiple users is not very useful in effective resource allocation.
An end-to-end parameter such as Average Job Delay - the average queuing delay experienced by a job of a given user at an execution site - is a better estimate for comparing the resource availability and response time for a given user.
It is also not very susceptible to monitoring latencies.
35
Effective DAG Scheduling
Average Dag Completion Time (120 dags x 10 jobs/dag)
4500
5000
5500
6000
6500
7000
Completion timebased
Queue lengthbased
# of CPUs based Round robin
Scheduling Algorithms
Tim
e (S
eco
nd
s)
The completion time based algorithm here uses the Average Job Delay parameter for scheduling
As seen in the adjoining figure, it outperforms the algorithms tested with other monitored parameters.
36
Work in Progress: Modeling Workflow Cost and developing efficient provisioning algorithms
Directed Acyclic Graph
1. Developing an objective measure of completion timeIntegrating performance and reliability of workflow execution P (Time to complete >=T) <= epsilon
2. Relating this measure to the properties of the longest path of the DAG based on the mean and uncertainty of time required for underlying tasks due to 1) variable time requirements due to different parameter values2) failure due to change of the underlying resources etc.
3. Developing novel scheduling and replication techniques to optimize allocation based on these metrics.
37
Work in Progress: Provisioning algorithms for multiple workflows (Yield Management)
• Quality of Service guarantees for each workflow• Controlled (a cluster of multi-core processors) versus uncontrolled
(grid of multiple clusters owned by multiple units) environment
Level 1
Level 2
Level 3
Level 4Dag 1 Dag 2 Dag 3 Dag 5Dag 4
Level 1
Level 2
Level 3
Level 4Dag 1 Dag 2 Dag 3 Dag 5Dag 4
Multiple Workflows
38
CHEPREO - Grid Education and Networking
E/O Center in Miami area Tutorial for Large Scale
Application Development
39
Grid Education
Developing a Grid tutorial as part of CHEPREO Grid basics Components of a Grid Grid Services OGSA …
OSG summer workshop South Padre island, Texas. July 11-15, 2005 http://osg.ivdgl.org/twiki/bin/view/SummerGridWorkshop/
Lectures and Hands-on sessions Building and Maintaining a Grid
40
Acknowledgements
CHEPREO project, NSF GriPhyN/iVDgL, NSF Data Mining Middleware, NSF Intel Corporation
41
Thank You
May the Force be with you!
42
Additional slides
43
Effect of latency on Average Job Delay
AvgJobDelay(-10) : Turnaround time v/s parameter value
0100200300400500600700800900
1000
0 200 400 600 800 1000
parameter value
turn
aro
un
d t
ime
(sec
)
AvgJobDelay(-5) : Turnaround time v/s parameter value
0
100
200
300
400
500
600
700
800
0 200 400 600 800
parameter value
turn
aro
un
d t
ime
(sec
)
Latency is simulated in the system by purposely retrieving old values for the parameter while making scheduling decisions
The correlation indices with added latencies are comparable, though lower as expected, to the correlation indices of ‘un-delayed’ Average Job Delay parameter. The amount of correlation is still quite high.
Average Job Delay correlation index with turnaround time
Added latency = 5 minutes
Added latency = 10 minutes
Site rank 0.688959 0.754222
Raw value 0.582685 0.777754
Learning period 29 jobs 48 jobs
44
SPHINX Scheduling Latency
0
5
10
15
20
25
30
35
40
45
0.5 1 2 4 11 13 17
# jobs / m inute
Sec
on
ds
20 DAG's
40 DAG's
80 DAG's
100 DAG's
Average scheduling latency for various number of
DAG’s (20, 40 , 80 and 100) with different arrival rate per minute.
45
Graphical user interfacefor data analysis
ROOT
Virtual data service
Chimera
Grid scheduling service
Sphinx
Grid enabledexecution
service
VDT client
Grid resourcemanagement
service
VDT server
Grid enabledWeb service Clarens
Clarens Clarens
Clarens Clarens
Grid resource monitoring system
MonALISA
Replica location service
RLS
Demonstration at Supercomputing Conference:Distributed Data Analysis in a Grid Environment
The architecture has been implemented and demonstrated in SC03 and SC04, Arizona, USA, 2003.
46
Scheduling DAGs: Dynamic Critical Path Algorithm
The DCP algorithm executes the following steps iteratively:
1. Compute the earliest possible start time (AEST) and the latest possible start time (ALST) for all tasks on each processor.
2. Select a task which has the smallest difference between its ALST and AEST and has no unscheduled parent task. If there are tasks with the same differences, select the one with a smaller AEST.
3. Select a processor which gives the earliest start time for the selected task
47
Scheduling DAGs: ILP- Novel algorithm to support heterogeneity (work supported by Intel Corporation)
There are two novel features: Assign multiple independent
tasks simultaneously – cost of task assigned depends on the processor available, many tasks commence with a small difference in start time.
Iteratively refine the scheduling - refines the scheduling by using the cost of the critical path based on the assignment in the previous iteration.
Directed Acyclic Graph
48
Comparison of different algorithms
10600
10650
10700
10750
10800
10850
10900
ICP (Th=3) ICP (Th=5) ICP (Th=7) DCP HEFT
2000 Tasks
Sch
edul
ing
Leng
th
Number of processors = 30.Number of Tasks = 2000.
0102030405060708090
100
1000 2000 3000 4000
Number of Tasks
Bes
t
ICP (Th=5)
DCP
HEFT
Number of processors = 30.
49
Time for Scheduling
0
500000
1000000
1500000
2000000
2500000
1000 2000 3000 4000
Number of Tasks
Sch
edul
ing
Tim
e ICP (Th=3)
ICP (Th=5)
ICP (Th=7)
DCP
HEFT
0100002000030000400005000060000700008000090000
100000
10 20 30 40 60 80
Number of ProcessorsS
ched
ulin
g T
ime ICP (Th=3)
ICP (Th=5)
ICP (Th=7)
DCP
HEFT