Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | clyde-short |
View: | 214 times |
Download: | 0 times |
Computing and Brokering
Grid Middleware 5
David Groep, lecture series 2005-2006
Grid Middleware V 2
Outline Classes of computing services
MPP SHMEM Clusters with high-speed interconnect Conveniently parallel jobs Through the hourglass: basic functionalities
Representing computing services resource availability, RunTimeEnvironment Software installation and ESIA Jobs as resources, or ?
Brokering brokering models: central view, per-user broker, ‘neighbourhood’ P2P brokering job farming and DAGs: Condor-G, gLite WMS, Nimrod-G, DAG man resource selection: ERT, freeCPUs, …?
Prediction techniques and challenges colocating jobs and data, input & output sandboxes, LogicalFiles
Specialties Supporting interactivity
Computing Service
resource variability and the hourglass model
Grid Middleware V 4
The Famous Hourglass Model
Grid Middleware V 5
Types of systems
Very different models and pricing; suitability depends on application
shared memory MPP systems vector systems cluster computing with high-speed interconnect
can perform like MPP, except for the single memory image e.g. Myrinet, Infiniband
course-grained compute clusters ‘conveniently parallel’ applications without IPC can be built of commodity components
specialty systems visualisation, systems with dedicated co-processors, …
Grid Middleware V 6
Quick, cheap, or both: how to run an app?
Task: how to run your application the fastest, or the most cost-effective (this argument usually wins )
Two choices to speed up an application Use the fastest processor available
but this gives only a small factor over modest (PC) processors
Use many processors, doing many tasks in parallel and since quite fast processors are inexpensive we can think of
using very many processors in parallel but the problem must first be decomposed
“fast, cheap, good – pick any two”
Grid Middleware V 7
High Performance – or – High Throughput?
Key question: max. granularity of decomposition:
Have you got one big problem or a bunch of little ones? To what extent can the “problem” be decomposed into sort-of-
independent parts (‘grains’) that can all be processed in parallel?
Granularity fine-grained parallelism –
the independent bits are small, need to exchange information, synchronize often
coarse-grained – the problem can be decomposed into large chunks that can be processed independently
Practical limits on the degree of parallelism – how many grains can be processed in parallel? degree of parallelism v. grain size grain size limited by the efficiency of the system at synchronising
grains
Grid Middleware V 8
High Performance – v. – High Throughput?
fine-grained problems need a high performance system that enables rapid synchronization between the bits that can be
processed in parallel and runs the bits that are difficult to parallelize as fast as possible
coarse-grained problems can use a high throughput system that maximizes the number of parts processed per minute
High Throughput Systems use a large number of inexpensive processors, inexpensively interconnected
High Performance Systems use a smaller number of more expensive processors expensively interconnected
Grid Middleware V 9
High Performance – v. – High Throughput?
There is nothing fundamental here – it is just a question of financial trade-offs like:
how much more expensive is a “fast” computer than a bunch of slower ones?
how much is it worth to get the answer more quickly? how much investment is necessary to improve the degree of
parallelization of the algorithm?
But the target is moving - Since the cost chasm first opened between fast and slower computers
12-15 years ago an enormous effort has gone into finding parallelism in “big” problems
Inexorably decreasing computer costs and de-regulation of the wide area network infrastructure have opened the door to ever larger computing facilities –
clusters fabrics (inter)national grids
demanding ever-greater degrees of parallelism
Grid Middleware V 10
But the fact is:
Graphic: Network of Workstations, Berkeley IEEE Micro, Feb, 1995, Thomas E. Anderson, David E. Culler, David A. Patterson
‘the food chain has been reversed’, and supercomputer vendors are struggling to make a living.
Grid Middleware V 11
Using these systems
As both clusters and capability systems are both ‘expensive’ (i.e. not on your desktop), they are resources that need to be scheduled
interface for scheduled access is a batch queue job submit, cancel, status, suspend sometimes: checkpoint-restart in OS, e.g. on SGI IRIX allocate #processors
(and amount of memory, these may be linked!) as part of the job request
systems usually also have smaller interactive partition not intended for running production jobs …
Grid Middleware V 12
Cluster batch system model
Grid Middleware V 13
Some batch systems
Batch systems and schedulers Torque (OpenPBS, PBS Pro) Sun Grid Engine (that’s not a Grid ) Condor LoadLeveller Load Share Facility (LSF)
Dedicated schedulers: MAUI can drive scheduling for Torque/PBS, SGE, LSF, … support advanced scheduling features, like:
reservation, fair-shares, accounts/banking, QoS
head node or UI system can usually be used for test jobs
Grid Middleware V 14
Torque/PBS job description
# PBS batch job script
#PBS -l walltime=36:00:00
#PBS -l cput=30:00:00
#PBS -l vmem=1gb
#PBS -q qlong
# Executing user job
UTCDATE=`date -u '+%Y%m%d%H%M%SZ'`
echo "Execution started on $UTCDATE"
echo "*****"
printenv
date
sleep 3
date
id
hostname
Grid Middleware V 15
PBS queue
bosui:tmp:1010$ qstat -an1|head -10
tbn20.nikhef.nl:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
823302.tbn20.nikhef. biome034 qlong STDIN 20253 1 -- -- 60:00 R 20:58 node15-11
824289.tbn20.nikhef. biome034 qlong STDIN 6775 1 -- -- 60:00 R 15:25 node15-5
824372.tbn20.nikhef. biome034 qlong STDIN 10495 1 -- -- 60:00 R 15:10 node16-21
824373.tbn20.nikhef. biome034 qlong STDIN 3422 1 -- -- 60:00 R 14:40 node16-32
...
827388.tbn20.nikhef. lhcb031 qlong STDIN -- 1 -- -- 60:00 Q -- --
827389.tbn20.nikhef. lhcb031 qlong STDIN -- 1 -- -- 60:00 Q -- --
827390.tbn20.nikhef. lhcb002 qlong STDIN -- 1 -- -- 60:00 Q -- --
Grid Middleware V 16
Example: Condor – clusters of idle workstations
Central Manager
master
collector
negotiator
schedd
startd
= ClassAd Communication Pathway
= Process Spawned
Desktop
schedd
startd
master
Desktop
schedd
startd
master
Cluster Node
master
startd
Cluster Node
master
startd
The Condor Project, Miron Livny et al. University of Wisconsin, Madison. See http://www.cs.wisc.edu/condor/
Grid Middleware V 17
Condor example
Write a submit file: Executable = dowork Input = dowork.in Output = dowork.out Arguments = 1 alpha beta Universe = vanilla Log = dowork.log Queue
Give it to Condor: condor_submit <submit-file>
Watch it run: condor_q
} Files: on shared fs
From: Alan Roy, IO Access in Condor and Grid, UW Madison. See http://www.cs.wisc.edu/condor/
in a cluster at least,for other options see later
Grid Middleware V 18
Matching jobs to resources
For ‘homogeneous’ clusters mainly policy-based FIFO credential-based policy fair-share queue wait time banks & accounts QoS specific
For heterogeneous clusters (like condor pools) matchmaking based on resource & job characteristics see later in grid matchmaking
Grid Middleware V 19
Example: scheduling policies - MAUI
RMTYPE[0] PBSRMHOST[0] tbn20.nikhef.nl...NODEACCESSPOLICY SHAREDNODEAVAILABILITYPOLICY DEDICATED:PROCSNODELOADPOLICY ADJUSTPROCS
FEATUREPROCSPEEDHEADER xpsBACKFILLPOLICY ONBACKFILLTYPE FIRSTFITNODEALLOCATIONPOLICY FASTEST
FSPOLICY DEDICATEDPESFSDEPTH 24FSINTERVAL 24:00:00FSDECAY 0.99
GROUPCFG[users] FSTARGET=1 PRIORITY=10 MAXPROC=50GROUPCFG[dteam] FSTARGET=2 PRIORITY=5000 MAXPROC=32GROUPCFG[alice] FSTARGET=9 PRIORITY=100 MAXPROC=200 QDEF=lhcaliceGROUPCFG[alicesgm] FSTARGET=1 PRIORITY=100 MAXPROC=200 QDEF=lhcaliceGROUPCFG[atlas] FSTARGET=54 PRIORITY=100 MAXPROC=200 QDEF=lhcatlas
QOSCFG[lhccms] FSTARGET=1- MAXPROC=10
MAUI is an open source product from ClusterResources, Inc. http://www.supercluster.org/
Grid Interface to Computing
Grid Middleware V 21
Grid Interfaces to the compute services
Need common interface for job management for test jobs in ‘interactive’ mode: fork
like the interactive partition in clusters and supers batch system interface:
executable arguments #processors memory environment stdin/out/err
Note: batch system usually doesn’t manage local file space assumes executable is ‘just there’, because of shared FS or JIT copying
of the files to the worker node in job prologue local file space management needs to be exposed as part of the grid
service and then implemented separately
Grid Middleware V 22
Expectations?
What can a user expect from a compute service? Different user scenarios are all valid:
paratrooper mode: come in, take all your equipment (files, executable &c) with you, do your thing and go away
you’re supposed to clean up, but the system will likely do that for you if you forget. In all cases, garbage left behind is likely to be removed
two-stage ‘prepare’ and ‘run’ extra services to pre-install environment and later request it see later on such Community Software Area services
don’t think but just do it blindly assume the grid is like your local system expect all software to be there expect your results to be retained indefinitely … realism of this scenario is quite low for ‘production’ grids, as it
does not scale to larger numbers of users
Grid Middleware V 23
Basic Operations
Direct run/submit useless unless you have an environment already set up
Cancel Signal Suspend Resume List jobs/status Purge (remove garbage)
retrieve output first …
Other useful functions Assess submission (eligibility, ERT) Register & Start (needed if you have sandboxes)
Grid Middleware V 24
A job submission diagram for a single CE
diagram from: DJRA1.1 EGEE Middleware Architecture
Example explicit interactions
Grid Middleware V 25
WS-GRAM: Job management using WS-RF
same functionalitymodelled with jobs represented as resources
for input sandbox leverages an existing (GT4) data movement service exploit re-useable components
Grid Middleware V 26
GRAMservices
GT4 Java Container
GRAMservices
Delegation
RFT FileTransfer
Transferrequest
GridFTPRemote storage element(s)
Localscheduler
Userjob
Compute element
GridFTP
sudoGRAMadapter
FTPcontrol
Local job control
Delegate
FTP data
Cli
ent Job
functions
Delegate
Service host(s) and compute element(s)
SEGJob events
GT4 WS GRAM Architecture
diagram from: Carl Kesselman, ISI, ISOC/GFNL masterclass 2006
Grid Middleware V 27
GT2 GRAM
Informational & historical: so don’t blame the current Globus for this …
single job submission flow chart
Grid Middleware V 28
GRAM GT2 Protocol
RSL over http-g target to a single specific resource
http-g is like https modified protocol (one one byte) to specify delegation no longer interoperable with standard https delegation implicit in job submission
RSL Resource Specification Language Used in the GRAM protocol to describe the job required some (detailed) knowledge about target system
Grid Middleware V 29
GT2 RSL
&(executable="/bin/echo")
(arguments="12345")
(stdout=x-gass-cache://$(GLOBUS_GRAM_JOB_CONTACT)stdout anExtraTag)
(stderr=x-gass-cache://$(GLOBUS_GRAM_JOB_CONTACT)stderr anExtraTag)
(queue=qshort)
Grid Middleware V 30
GT2 Job Manager interface
One job manager per running or queued job provide control interface: cancel, suspend, status GASS ‘Grid Access to Secondary Storage’:
stdin, stdout, stderr selected input/output files
listens on a specific TCP port on the Gatekeeper host
Some issues protocol does not provide two-phase commit
know way to know if the job really made it too many open ports one process for each queued job, i.e. too many processes
Workaround don’t submit a job, but instread a grid-manager process
Grid Middleware V 31
Performance ?
Time to submit a basic GRAM job Pre-WS GRAM: < 1 second WS GRAM (in Java): 2 seconds
so GT2-style GRAM did have one significant advantage …
Concurrent jobs Pre-WS GRAM: 300 jobs WS GRAM: 32,000 jobs
Grid Middleware V 32
Scaling scheduling
load on the CE head node per VO cannot be controlled with a single common job manager
1. with many VOs might need to resolve inter-VO resource contention different VOs may want different policies
2. make the CE ‘pluggable’3. and provide a common CE interface, irrespective of the
site-specific job submission mechanism as long as the site supports a ‘fork’ JM
Grid Middleware V 33
gLite job submission model
site
one grid CEMON per VO or user
Grid Middleware V 34
Unicore CE
Other design and concept: eats JSDL (GGF standard) as a description
described job requirements in detail
security model cannot support dynamic VOs yet grid-wide coordinated UID space (or shared group accounts for all grid users) no VO management tools (DEISA added a directory for that) intra-site communication not secured
one big plus: job management uses only 1 port for all ommunications (including file transfer), and is thus firewall-friendly
Grid Middleware V 35
Unicore CE Architecture
Batch Subsystem
AJO/UPL
User Certificate
Job preparation/control Plugins
Unsafe Internet (SSL)
User authentication
UNICORESite List
UNICOREPro Client
Target System Interface (TSI)
Incarnated job
Commands
User mapping,job incarnation,job scheduling
TSI TSI
Any clustermanagement system
UNICORE SiteFZJ
...
Preparation andControl of jobs
Network Job Supervisor(NJS)
Safe Intranet (TCP)
IDB
Jobs and data transfer to other UNICORE sites
Status request
SV1 Blade files
UUDB IDBIDB
NJS
UNICORE Gateway
optional firewall
optional firewall
AJO/UPL
Runtime Interface
Arcon Client Toolkit User Certificate
UNICORESite List
Graphic from: Dave Snelling, Fujitsu Labs Europe, “Unicore Technology”, Grid School July 2003
Grid Middleware V 36
Unicore programming model
Abstract Job Object Collection of classes representing Grid functions Encoded as Java objects (XML encoding possible)
Where to build AJOs Pallas client GUI - The user’s view Client plugins - Grid deployer Arcon client tool kit - Hard core
What can’t the AJO do Application level Meta-computing ???
from: Dave Snelling, Fujitsu Labs Europe, “Unicore Technology”, Grid School July 2003
Batch Subsystem
AJO/UPL
User Certificate
Job preparation/control Plugins
Unsafe Internet (SSL)
User authentication
UNICORESite List
UNICOREPro Client
Target System Interface (TSI)
Incarnated job
Commands
User mapping,job incarnation,job scheduling
TSI TSI
Any clustermanagement system
UNICORE SiteFZJ
...
Preparation andControl of jobs
Network Job Supervisor(NJS) Safe Intranet
(TCP)IDB
Jobs and data transfer to other UNICORE sites
Status request
SV1
Blade files
UUDB IDBIDB
NJS
UNICORE Gateway
optional firewall
optional firewall
AJO/UPL
Runtime Interface
Arcon Client Toolkit User Certificate
UNICORESite List
Grid Middleware V 37
Interfacing to the local system
Incarnation Data Base Maps abstract representation to concrete jobs Includes resource description
Prototype auto-generation from MDS
Target System Interface Perl interface to host platform Very small system specific module for easy porting Current: NQS (several versions), PBS, Loadleveler, UNICOS,
Linux, Solaris, MacOSX, PlayStation-2 Condor: Under development (& probably done by now)
from: Dave Snelling, Fujitsu Labs Europe, “Unicore Technology”, Grid School July 2003
Batch Subsystem
AJO/UPL
User Certificate
Job preparation/control Plugins
Unsafe Internet (SSL)
User authentication
UNICORESite List
UNICOREPro Client
Target System Interface (TSI)
Incarnated job
Commands
User mapping,job incarnation,job scheduling
TSI TSI
Any clustermanagement system
UNICORE SiteFZJ
...
Preparation andControl of jobs
Network Job Supervisor(NJS) Safe Intranet
(TCP)IDB
Jobs and data transfer to other UNICORE sites
Status request
SV1
Blade files
UUDB IDBIDB
NJS
UNICORE Gateway
optional firewall
optional firewall
AJO/UPL
Runtime Interface
Arcon Client Toolkit User Certificate
UNICORESite List
Resource Representation
CE attributesobtaining metricsGLUE CE
Grid Middleware V 39
Describing a CE
Balance between completeness and timeliness Some useful metrics almost impossible to obtain
‘when will this job of mine be finished if I submit now?’cannot be answered!
depends on system load need to predict runtime for already running & queued jobs simultaneous submission in a non-FIFO scheduling model (e.g. fair
share, priorities, pre-emption &c)
Grid Middleware V 40
GlueCE: a ‘resource description’ viewpoint
From: the GLUE Information Model version 1.2, see document for details
Grid Middleware V 41
Through the Glue Schema: Cluster Info
Performance info: SI2k, SF2k Max wall time, CPU time: secondstogether these determine if a job completes in time
but clusters are not homogeneous solve at the local end (scale mas{CPU,wall} time on each node
to the system speed)CAVEAT: when doing cross-cluster grid-wide scheduling, this can make you choose the wrong resource entirely!
solve (i.e. multiply) at the broker endbut now you need a way to determine on which subcluster your job will run… oops.
Grid Middleware V 42
Cluster Info: total, free and max JobSlots
FreeJobSlots is the wrong metric to use for scheuling (a good cluster is always 100% full)
these metrics may be VO, user and job dependent if a cluster have free CPUs, that does not mean that you can
use them… even if there are thousands of waiting jobs, you might get to
the front of the queue because of your prio or fair-share
Grid Middleware V 43
Cluster info: ERT and WRT
Estimated/worst response time when will my job start to run if I submit now
Impossible to pre-determine in case of simultaneous submissions
Best to do is to estimate
Possible approaches simulation – good but very, very slow
“Predicting Job Start Times on Clusters”, Hui Li et al. 2004 historical comparisons
template approach – need to discover the proper template look for ‘similar system states’ in the past learning approach – adapt the estimation algorithm to the actual
load and ‘learn’ the best approach
see the many other papers by Hui, bundle on Blackboard!
Brokering
Grid Middleware V 45
Brokering models
All current grid broker systems use global brokering consider all known resources when matching requests brokering takes longer as the system grows
Models Bubble-to-the-top-information-system based
current Condor-G, gLite WMS
Ask the world for bids Unicore Broker
Grid Middleware V 46
Some grid brokers
Condor-G uses Condor schedd (matchmaker) to match resources a Condor submitter has a number of backends to talk to
different CEs (GT2, GT4-GRAM, Condor (flocking)) supports DAG workflows schedd is ‘close’ to the user
gLite WMS separation between broker (based on Condor-G) and the UI additional Logging and Bookkeeping (generic, but actually only
used for the WMS) does job-data co-location scheduling
Grid Middleware V 47
Grid brokers (contd.)
Nimrod-G parameter sweep engine cycles through static list of resources automatically inspects the job output and uses that to drive
automatic job submission minimisation methods like simulated annealing built in
Unicore broker based on a pricing model asks for bids from resources
no large information system needed full of useless resources, but instead ask bids from all resources for every job
shifts, but does nothing to resolve, the info-system explosion
Grid Middleware V 48
Alternative brokering
Alternatives could be ‘P2P-style’ brokering look in the ‘neighbourhood’ for ‘reasonable’ matches, if none
found, give the task to a peer super-scheduler scheduler only considers ‘close’ resources (has no global
knowledge) job submission pattern may or may not follow brokering
pattern if it does, it needs recursive delegation for job submission, which
opens the door for worms and trojans trust is not very transitive
(this is not a problem in sharing ‘public’ files, such as in the popular P2P file sharing applications)
Grid Middleware V 49
Broker detailed example: gLite WMS
Job services in the gLite architecture Computing Element (just discussed) Workload Management System (brokering, submission control) Accounting (for EGEE comes in two flavours: site or user) Job Provenance (to be done) Package management (to be done)
continuous matchmaking solution persistent list of pending jobs, waiting for matching resources
WMS task akin to what the resources did in Unicore
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 50
Enabling Grids for E-sciencE
INFSO-RI-508833
WMS
Services
UI
ReplicaCatalog
Inform.System
StorageElement
Resource Broker Node(Workload Manager, WM)
Architecture Overview
Logging &Bookkeeping
Job status
Grid InterfaceComputing Element
LRMS
LCG
Match Maker
JobAdapter
NetworkServer
WorkloadManager
Job Contr.-
CondorG
Match Maker
Task Queue
Information Supermarket
NetworkServer
JobSubmission
gLite
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 51
Enabling Grids for E-sciencE
INFSO-RI-508833
WMS’s Architecture
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 52
Enabling Grids for E-sciencE
INFSO-RI-508833
WMS’s Architecture
Job managementJob managementrequests (submission, requests (submission, cancellation) expressedcancellation) expressed
via a Job Descriptionvia a Job DescriptionLanguage (JDL)Language (JDL)
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 53
Enabling Grids for E-sciencE
INFSO-RI-508833
WMS’s Architecture
Keeps submission Keeps submission requestsrequests
Requests are keptRequests are kept for a whilefor a while
if no matchingif no matchingresources availableresources available
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 54
Enabling Grids for E-sciencE
INFSO-RI-508833
WMS’s Architecture
Repository of resourceRepository of resource informationinformation
available to matchmakeravailable to matchmaker
Updated via notifications Updated via notifications and/or active and/or active
polling on sourcespolling on sources
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 55
Enabling Grids for E-sciencE
INFSO-RI-508833
WMS’s Architecture
Finds an appropriateFinds an appropriateCE for each submission CE for each submission
request, taking into account request, taking into account job requests and preferences, job requests and preferences, Grid status, utilization policies Grid status, utilization policies
on resources on resources
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 56
Enabling Grids for E-sciencE
INFSO-RI-508833
WMS’s Architecture
Performs the actual Performs the actual job submission job submission and monitoring and monitoring
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 57
Enabling Grids for E-sciencE
INFSO-RI-508833
The Information Supermarket
• ISM represents one of the most notable improvements in the WM as inherited from the EU DataGrid (EDG) project– decoupling between the collection of information concerning
resources and its use allows flexible application of different policies
• The ISM basically consists of a repository of resource information that is available in read only mode to the matchmaking engine– the update is the result of
the arrival of notifications active polling of resources some arbitrary combination of both
– can be configured so that certain notifications can trigger the matchmaking engine
improve the modularity of the software support the implementation of lazy scheduling policies
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 58
Enabling Grids for E-sciencE
INFSO-RI-508833
The Task Queue
• The Task Queue represents the second most notable improvement in the WM internal design– possibility to keep a submission request for a while if no
resources are immediately available that match the job requirements technique used by the AliEn and Condor systems
• Non-matching requests – will be retried either periodically
eager scheduling approach
– or as soon as notifications of available resources appear in the ISM lazy scheduling approach
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 59
Enabling Grids for E-sciencE
INFSO-RI-508833
Job Logging & Bookkeeping
• L&B tracks jobs in terms of events– important points of job life
submission, finding a matching CE, starting execution etc• gathered from various WMS components
• The events are passed to a physically close component of the L&B infrastructure– locallogger
avoid network problems• stores them in a local disk file and takes over the responsibility to deliver them
further
• The destination of an event is one of bookkeeping servers – assigned statically to a job upon its submission
processes the incoming events to give a higher level view on the job states• Submitted, Running, Done
various recorded attributes• JDL, destination CE name, job exit code
• Retrieval of both job states and raw events is available via legacy (EDG) and WS querying interfaces– user may also register for receiving notifications on particular job state
changes
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 61
Enabling Grids for E-sciencE
INFSO-RI-508833
Job Preparation
• Information to be specified when a job has to be submitted:
• Job characteristics
• Job requirements and preferences on the computing resources• Also including software dependencies
• Job data requirements
• Information specified using a Job Description Language (JDL)
• Based upon Condor’s CLASSified ADvertisement language (ClassAd)• Fully extensible language
• A ClassAd
•Constructed with the classad construction operator []
•It is a sequence of attributes separated by semi-colons.
•An attribute is a pair (key, value), where value can be a Boolean, an Integer, a list of strings, …
• <attribute> = <value>;
Grid Middleware V 62
ClassAds: matchmaking
Brokering based on ‘advertisements’ by both jobs and resources
Grid Middleware V 63
ClassAds matchmaking
Allow customers to set provide requirements and preferences on the resources
Allow resources to impose constraints on the customers they wish to service.
Separation between matchmaking and claiming.
The matchmake is stateless and thus can scale to very large systems without complex failure recovery.
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 64
Enabling Grids for E-sciencE
INFSO-RI-508833
Job Description Language (JDL)
• The supported attributes are grouped into two categories:
• Job Attributes • Define the job itself
• Resources• Taken into account by the Workload Manager for carrying out the
matchmaking algorithm (to choose the “best” resource where to submit the job)
• Computing Resource•Used to build expressions of Requirements and/or Rank attributes by the user
•Have to be prefixed with “other.”
• Data and Storage resources •Input data to process, Storage Element (SE) where to store output data, protocols spoken by application when accessing SEs
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 65
Enabling Grids for E-sciencE
INFSO-RI-508833
JDL: Relevant Attributes (1)• JobType
• Normal (simple, sequential job), DAG, Interactive, MPICH, Checkpointable
• Executable (mandatory)• The command name
• Arguments (optional)• Job command line arguments
• StdInput, StdOutput, StdError (optional)• Standard input/output/error of the job
• Environment• List of environment settings
• InputSandbox (optional)• List of files on the UI’s local disk needed by the job for running
• The listed files will be staged automatically to the remote resource
• OutputSandbox (optional)• List of files, generated by the job, which have to be retrieved
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 66
Enabling Grids for E-sciencE
INFSO-RI-508833
JDL: Relevant Attributes (2)• Requirements
• Job requirements on computing resources
• Specified using attributes of resources published in the Information Service
• If not specified, default value defined in UI configuration file is considered• Default: other.GlueCEStateStatus == "Production" (the resource has to be able
to accept jobs and dispatch them on WNs)
• Rank
• Expresses preference (how to rank resources that have already met the Requirements expression)
• Specified using attributes of resources published in the Information Service
• If not specified, default value defined in the UI configuration file is considered
• Default: - other.GlueCEStateEstimatedResponseTime (the lowest estimated traversal time)
• Default: other.GlueCEStateFreeCPUs (the highest number of free CPUs) for parallel jobs (see later)
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 67
Enabling Grids for E-sciencE
INFSO-RI-508833
JDL: Relevant Attributes (3)
• InputData• Refers to data used as input by the job: these data are published
in the Replica Catlog and stored in the Storage Elements)• LFNs and/or GUIDs
• InputSandbox• Execuable, files etc. that are sent to the job
• DataAccessProtocol (mandatory if InputData has been specified)
• The protocol or the list of protocols which the application is able to speak with for accessing InputData on a given Storage Element
• OutputSE• The Uniform Resource Identifier of the output Storage Element• RB uses it to choose a Computing Element that is compatible with
the job and is close to Storage Element
Details in Data Management lecture
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 68
Enabling Grids for E-sciencE
INFSO-RI-508833
Example of JDL File
[
JobType=“Normal”;
Executable = “gridTest”;
StdError = “stderr.log”;
StdOutput = “stdout.log”;
InputSandbox = {“/home/mydir/test/gridTest”};
OutputSandbox = {“stderr.log”, “stdout.log”};
InputData = {“lfn:/glite/myvo/mylfn” };
DataAccessProtocol = “gridftp”;
Requirements = other.GlueHostOperatingSystemNameOpSys == “LINUX”
&& other.GlueCEStateFreeCPUs>=4;
Rank = other.GlueCEPolicyMaxCPUTime;
]
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 69
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs State Machine (1/9)
Submitted: job is entered by the user to the User Interface but not yet transferred to Network Server for processing
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 70
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs State Machine (2/9)
Waiting: job accepted by NS and waiting for Workload Manager processing or being processed by WMHelper modules.
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 71
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs State Machine (3/9)
Ready: job processed by WM and its Helper modules (CE found) but not yet transferred to the CE (local batch system queue) via JC and CondorC. This state does not exists for a DAG as it is not subjected to matchmaking (the nodes are) but passed directly to DAGMan.
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 72
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs State Machine (4/9)
Scheduled: job waiting in the queue on the CE. This state also does not exists for a DAG as it is not directly sent to a CE (the node are).
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 73
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs State Machine (5/9)
Running: job is running. For a DAG this means that DAGMan has started processing it.
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 74
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs State Machine (6/9)
Done: job exited or considered to be in a terminal state by CondorC (e.g., submission to CE has failed in an unrecoverable way).
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 75
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs State Machine (7/9)
Aborted: job processing was aborted by WMS (waiting in the WM queue or CE for too long, over-use of quotas, expiration of user credentials).
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 76
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs State Machine (8/9)
Cancelled: job has been successfully canceled on user request.
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 77
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs State Machine (9/9)
Cleared: output sandbox was transferred to
the user or removed due to the timeout.
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 78
Enabling Grids for E-sciencE
INFSO-RI-508833
Directed Acyclic Graphs (DAGs)
• A DAG represents a set of jobs:
Nodes = Jobs Edges = Dependencies
NodeA
NodeB
NodeC
NodeDNodeE
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 79
Enabling Grids for E-sciencE
INFSO-RI-508833
DAG: JDL Structure
• Type = “DAG”• VirtualOrganisation = “yourVO”• Max_Nodes_Running = int >0• MyProxyServer = “…”• Requirements = “…”• Rank = “…”• InputSandbox = more later!• OutSandbox = “…”• Nodes = nodeX more later!
Dependencies = more later!
Mandatory
Mandatory
Optional
Optional
Optional
Optional
Optional
Mandatory
Mandatory
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 80
Enabling Grids for E-sciencE
INFSO-RI-508833
Attribute: Nodes
The Nodes attribute is the core of the DAG description;….
Nodes = [ nodefilename1 = [...]
nodefilename2 = […]
…….
dependencies = …
]
Nodefilename1 = [ file = “foo.jdl”; ]
Nodefilename2 =
[ file = “/home/vardizzo/test.jdl”;
retry = 2; ]
Nodefilename1 = [
description = [ JobType = “Normal”;
Executable = “abc.exe”;
Arguments = “1 2 3”;
OutputSandbox = […];
InputSandbox = […];
….. ]
retry = 2;
]
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 81
Enabling Grids for E-sciencE
INFSO-RI-508833
Attribute: Dependencies
• It is a list of lists representing the dependencies between the nodes of the DAG.
….
Nodes = [ nodefilename1 = [...]
nodefilename2 = […]
…….
dependencies = …
]
dependencies =
{nodefilename1, nodefilename2}
{ nodefilename1, nodefilename2 }
{ { nodefilename1, nodefilename2 }, nodefilename3 }
{ { { nodefilename1, nodefilename2}, nodefilename3}, nodefilename4 }
MANDATORY : YES!
dependencies = {};
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 82
Enabling Grids for E-sciencE
INFSO-RI-508833
Type = “DAG”
VirtualOrganisation = “yourVO”
Max_Nodes_Running = int >0
MyProxyServer = “…”
Requirements = “…”
Rank = “…”
InputSandbox = { };
Nodes = [ nodefilename =[];
…..
dependencies = …;
];
NodeA= [
description = [
JobType = “Normal”;
Executable = “abc.exe”;
OutputSandbox = {“myout.txt”};
InputSandbox = {
“/home/vardizzo/myfile.txt”,
root.InputSandbox; };
]
]
InputSandbox & Inheritance
• All nodes inherit the value of the attributes from the one specified for the DAG.
• Nodes without any InputSandbox values, have to contain in their description an empty list:
InputSandbox = { };
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 83
Enabling Grids for E-sciencE
INFSO-RI-508833
Interactive Jobs
• It is a job whose standard streams are forwarded to the submitting client.
• The DISPLAY environment variable has to be set correctly, because an X window may be opened.
UI
Listener Process
X window or std no-gui
WN
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 84
Enabling Grids for E-sciencE
INFSO-RI-508833
Interactive Jobs
• Specified setting JobType = “Interactive” in JDL
• When an interactive job is executed, a window for the stdin, stdout, stderr streams is opened
• Possibility to send the stdin to
• the job
• Possibility the have the stderr
• and stdout of the job when it
• is running
• Possibility to start a window for
• the standard streams for a
• previously submitted interactive
• job with command glite-job-attach
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 85
Enabling Grids for E-sciencE
INFSO-RI-508833
Interactive Jobs: JDL Structure
• Type = “job”;• JobType = “interactive”;• Executable = “…”;• Argument = “…”; • ListenerPort = “int > 0”;• OutputSandbox = “”;• Requirements = “…”;• Rank = “”;
Mandatory
Mandatory
Mandatory
Optional
Optional
Optional
Mandatory
Mandatory
gLite Commands:
glite-job-attach [options] <jobID>
Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 86
Enabling Grids for E-sciencE
INFSO-RI-508833
gLite Commands
• JDL Submission: glite-job-submit –o guidfile jobCheck.jdl
• JDL Status: glite-job-status –i guidfile
• JDL Output: glite-job-output –i guidfile
• Get Latest Job State: glite-job-get-chkpt –o statefile –i guidfile
• Submit a JDL from a state: glite-job-submit -chkpt statefile –o guidfile jobCheck.jdl
• See also [options] typing –help after the commands.
Economy based brokering
Unicore
Grid Middleware V 88
Unicore Broker
Distributed brokering Sites Know the State of their Resources Best Sites Can Conceal their Resource Configuration Different VOs Need Different Selection Algorithms
Preferred site sets will vary Different applications have different performance characteristics
Uses an economic model cost-based evaluation, like in the real world
broker developed by University of Manchester, UK
Unicore is a open source product coordinated by the Unicore Forum, see www.unicore.org
Grid Middleware V 89
Unicore Broker
graphic from: Brokering in Unicore, John Brooke and Donal Fellows, UoM, Unicore Summit October 2005
Grid Middleware V 90
Job description ontology
graphic from: Brokering in Unicore, John Brooke and Donal Fellows, UoM, Unicore Summit October 2005
Grid Middleware V 91
Unicore Broker hierarchy
graphic from: Brokering in Unicore, John Brooke and Donal Fellows, UoM, Unicore Summit October 2005
Grid Middleware V 92
Unicore Broker in the system
UnicoreGateway
Unicore ClientNetwork
JobSupervisor
ResourceDatabase
UserDatabase
Condor
NQS
GT
ResourceBroker
Multiple firewalllayouts possible
Alternative Client
Ext. AuthService
UoM Broker Architecture, from: Dave Snelling, Fujitsu Labs Europe, Unicore Technology, Grid School July 2003
Grid Middleware V 93
Unicore Broker
ComputeResourceComputeResource
BrokerBroker
NJSNJSIDBIDB UUDBUUDB
ExpertBrokerExpertBroker
DWDLMExpertDWDLMExpert OtherOther
LocalResourceCheckerLocalResourceChecker
UnicoreRCUnicoreRC GlobusRCGlobusRC
TranslatorTranslator
OntologicalTranslatorOntologicalTranslator
OntologyOntology
SimpleTranslatorSimpleTranslator
MDSGRAMTSI
ICMExpertICMExpert
Look up staticresources
Look upconfiguration
Verify delegatedidentities
Delegate to application-domain expert codeDelegate to Grid architecture-specificengine for local resource check
Pass untranslatable resources to Unicore resource checker
Look up resourcesLook updynamicresources
Delegate resource domain translation
Look up translations appropriateto target Globus resource schema
Broker hosted in NJS
To outside world
Get back set ofresource filters and set ofuntranslatable resources
TicketManagerTicketManager
UNICORE Components
EUROGRID BrokerGlobus Components
GRIP Broker
Key:
Inheritance relation
Get signed ticket (contract)
Look up signing identity
UoM Broker Architecture, from: Dave Snelling, Fujitsu Labs Europe, Unicore Technology, Grid School July 2003
VO Schedulers
Pilot jobs and overlay networks
Grid Middleware V 95
Towards a multi-scheduler world
expressing scheduling policies (priorities and usage shares) for multiple complex VOs in a single scheduler is proving difficult resource owner does not want to know about VO internal
structure, but assign the VO just a single share VO wants to set fine-grained intra-VO shares local schedulers (such as MAUI) are not geared towards non-
admin defined policies: there is no ‘grid-aware’ scheduler
possible solutions develop an interface to manage the local scheduling policies stack the schedulers, i.e. introduce a per-VO scheduler
Grid Middleware V 96
traditional job submission models
There are three ‘traditional’ deployment models:
1. direct per-user job submission to a ‘gatekeeper’ running with root privileges (GT2GK, today’s model)
2. a non-privileged dedicated CE or scheduler, accepting authenticated user jobs and submitting to the batch system
3. on-demand CE, submitted by VO or user to a front-end system, that then receives user jobs and submits these to the batch system
in order to not have complex schedulers run as root, a sudo-component glexec is introducted
Submitting user’s identity & job
VO identity/process or VO placeholder manager
Site managed and trusted services
Grid Middleware V 97
What is glexec?
glexec
a thin layerto change unix credentials
based on grid identity and attribute information
you can think of it as: ‘a replacement for the gatekeeper’ ‘a griddy version of Apache’s suexec(8)’ ‘a program wrapper around LCAS, LCMAPS or GUMS’
Grid Middleware V 98
What glexec does
Input1. a certificate chain, possibly with VOMS extensions2. a user program name & arguments to run
Action1. check authorization (LCAS, GUMS)
• user credentials, proper VOMS attributes, executable name
2. acquire local credentials local (uid, gid) pair, possibly across a cluster
3. enforce the local credential on the process
Result1. user program is run with the mapped credentials
Grid Middleware V 99
Jobs submission today (GT2 GK)
Deployment model without glexec (‘mode GT2GK’) jobs are submitted with an identity (hopefully the original user’s one)
to the site Gatekeeper running as root one job manager is run for each user on the head node with the user’s (uid,gid) as set by the gatekeeper
Grid Middleware V 100
Glexec in a one-per-site mode
Deployment model with a CE ‘service’ running in a non-privileged account or with a CE run (maybe one per VO) on a single front-end per site
examples• CREAM• GT4 WS-GRAM
Grid Middleware V 101
glexec with an on-demand CE
Deployment model with on-demand CEs (‘mode on-demand CEs’) The user or the VO start their own scheduler on a front-end system All these on-demand schedulers are resource-limited by a site-
managed master scheduler (via a GT2GK or Condor) the on-demand schedulers eat jobs for their VO or user and set the proper identity before the job gets submitted to the site
batch system
Grid Middleware V 102
glexec with on-demand CE
Deployment model with on-demand CEs (‘mode on-demand for VOs’ with native interface)
Grid Middleware V 103
Traditional model summary
In all three models, the submission of the user job to the batch system is done with the original job owner’s mapped (uid, gid) identity
grid-to-local identity mapping is done only on the front-end system (CE)
batch system accounting provides per-user records inspection of Unix process on worker nodes are per-user
Grid Middleware V 104
Pilot jobs
A pilot job is basically just a small script which downloads a real job from a repository once it starts executing, hence it is not committed to any particular task, or perhaps even
a particular user, until that point. If there are no tasks waiting the pilot job exits immediately. In principle, if the time limits on the queue are long enough
a single pilot job could run more than one real job, although I'm not sure if anyone is actually doing that at the moment.
Grid Middleware V 105
From the VO side
Background: some large VOs develop and prefer to use their own scheduling & job management framework
late binding of jobs to job slots first establishing an overlay network subsequent scheduling and starting of jobs is faster
hide details between the various grid flavours implement VO priorities full use of allocated slots, up to max wall clock time
but these VOs will need their ‘own’ scheduler some of them do have it already, but then others don’t and most never will, so the use of pilots should not be the
only option (or even the default) way of things
Grid Middleware V 106
Situation today
‘VO-type’ pilot jobs submitted as if regular user jobs run with the identity of one or a few individuals from a VO obtain jobs from any user (within the VO) and run that payload
on the WN allocated site ‘sees’ only a single identity, not the true owner of the
workload
no effective mechanisms today can deny this use model
note that this does not apply to the regular ‘per-user’ pilot jobs
Grid Middleware V 107
Issues
Issues that drove the original glexec-on-WN scenario:
VO supplied pilot jobs must observe and honour the same policies the site uses for normal job execution
preferably without requiring alternate mechanisms to describe the policies be continuously in synch with the site policies
again, ‘per-user’ pilot jobs satisfy these rules by design
Grid Middleware V 108
Pieces of a solution
Three pieces that go together:
glexec on the worker-node deployment mechanism for pilot job to submit themselves and their
payload to site policy control give incontrovertible evidence of who is running on which node
at any one time needed at selected sites for regulatory compliance ability to nail individual culprits by requiring the VO to present a valid delegation from each user
VO should want this to keep user jobs from interfering with each other honouring site ban lists for individuals may help in not banning the
entire VO in case of an incident
Grid Middleware V 109
Pieces of the solution
glexec on the worker-node deployment way to keep the pilot jobs submitters to their word
system-level auditing of the pilot jobs, to see they are not doing the user job by themselves or evading the controls
relies on advanced auditing features of the OS (from EAL3+) but auditing data on the WN is useful for incident investigations only
internal accounting should be done by the VO the regular site accounting mechanisms are via the batch system, and
will see the pilot job identity the site can easily show from those logs the usage by the pilot job
(for which wall-clock-time accounting should be used) making a site do accounting based glexec jobs is non-standard, requires
effort, may be intrusive, and messes up normal accounting ‘a VO capable of writing their own submission framework, ought to be
able to write their own accounting system as well …’
Grid Middleware V 110
glexec on WN deployment model
VO submits a pilot job to the batch system the VO ‘pilot job’ submitter is responsible for the pilot behaviour
this might be a specific role in the VO, or a locally registered ‘badged’ user at each site
Pilot job is subject to normal site policies for jobs Pilot job obtains the true user job,
and presents the user credentials and the job (executable name) to the site (glexec) to request a decision
Submitting user’s identity & job
VO identity/process or VO placeholder manager
Site managed and trusted services
Grid Middleware V 111
VO pilot job on the node
Note: proper uid change by Gatekeeper or Condor-C/BLAHP on head node should remain default
• On success: the site will set the uid/gid of the new user’s job• On failure: glexec will return with an error, and pilot job can terminate or obtain other job
Grid Middleware V 112
What is needed in this model?
1. Agreement on the three ingredients• deployment of glexec on the WN to do setuid• detailed auditing on the head node and the WNs• site accounting done at the VO (i.e. pilot job) level
2. glexec• needs feature enhancements compared to single-CE version• see status of glexec on the next slide
3. Inspection of the audit logs• detect abuse patterns in the system-call auditing logs
4. Grid job logging capabilities• glexec will log (uid, user/system/real time usage) via syslog• credential mapping framework (LCMAPS) will log mapping
(also via syslog)• centralisation of glexec mappings, e.g. via JobRepository
Grid Middleware V 113
Notes and alternatives
glexec, like any site-managed ingress point, trusts the submitter not to have mixed up the user credentials and the jobs we trust the RB today do this correctly, and RBs are unknown
quantities to the receiving site
a longer term solution is to have the job request singed by the submitting user since the description is modified by intermediaries (brokers), the
signature can only be to the original content, and the site would have to evaluate whether the job received matches the signed JDL
or use an inheritance model for the job description, and treat the job like you would, e.g., a CIM entity
Grid Middleware V 114
Summary
Realize that today some VOs are doing ‘pilot’ jobs today there is no effective enforcement against this some sites may even just don’t care yet, whilst others have hard
requirements on auditability and regulatory compliance
The glexec-on-WN model gives the VOs tools to comply with site requirements at least makes it ‘better’ than it is today but you, as a site, will miss that warm and fuzzy feeling of trust
a glexec-on-WN is always replaceable by the ‘null operation’ for sites that don’t care or want it but realize this is for just one of the glexec deployment models