Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | conrad-barnett |
View: | 215 times |
Download: | 0 times |
Condor Tutorial for UsersINFN-Bologna, 6/29/99
Derek WrightComputer Sciences DepartmentUniversity of Wisconsin-Madison
2
Conventions Used In This Presentation
A slide with an all-yellow background is the beginning of a new “chapter”• The slides after it will describe each entry
on the yellow slide in great detail A Condor tool that users would use will
be in red italics A ClassAd attribute name will be in blue A UNIX shell command or file name will
be in courier font
3
What is Condor?
A system for “High-Throughput Computing”
Lots of jobs over a long period of time, not a short burst of “high-performance”
Condor manages both resources (machines) and resource requests (jobs)
Supports additional features for jobs that are re-linked with Condor libraries:• checkpointing• remote system calls
4
What’s Condor Good For?
Managing a large number of jobs• You specify the jobs in a file and submit
them to Condor, which runs them all and sends you email when they complete
• Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc.
• Condor can handle inter-job dependencies (DAGMan)
5
What’s Condor Good For? (cont’d)
Robustness• Checkpointing allows guaranteed forward
progress of your jobs, even jobs that run for weeks before completion
• If an execute machine crashes, you only loose work done since the last checkpoint
• Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover
6
What’s Condor Good For? (cont’d)
Giving you access to more computing resources• Checkpointing allows your job to run on
“opportunistic resources” (not dedicated)• Checkpointing also provides “migration” -
if a machine is no longer available, move!• With remote system calls, you don’t even
need an account on a machine where your job executes
7
What is a Condor Pool?
“Pool” can be a single machine, or a group of machines
Determined by a “central manager” - the matchmaker and centralized information repository
Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself
8
What Kind of Job Do You Have?
You must know some things about your job to decide if and how it will work with Condor:• What kind of I/O does it do?• Does it use TCP/IP? (network sockets)• Can the job be resumed?• Is the job multi-process (fork(),
pvm_addhost(), etc.)
9
What Kind of I/O Does Your Job Do?
Interactive TTY “Batch” TTY (just reads from STDIN
and writes to STDOUT or STDERR, but you can redirect to/from files)
X Windows NFS, AFS, or another network file
system Local file system TCP/IP
10
What Does Condor Support?
Condor can support various combinations of these features in different “Universes”
Different Universes provide different functionality for your job:• Vanilla• Standard• Scheduler• PVM
11
What Does Condor Support?
I nteractive TTY
X windowsNFS/AFS
Local fi les
TCP/ I P ResumeMulti-
process
Vanilla X X X X
Standard X X XScheduler X X X X XPVM X X X X X
12
Condor Universes A Universe specifies a Condor
runtime environment:• STANDARD
– Supports CheckpointingSupports Checkpointing– Supports Remote System CallsSupports Remote System Calls– Has some limitations (Has some limitations (nono fork()fork(), , socket()socket(), etc.), etc.)
• VANILLA– Any Unix executable (shell scripts, etc)Any Unix executable (shell scripts, etc)– No Condor Checkpointing or Remote I/ONo Condor Checkpointing or Remote I/O
13
Condor Universes (cont’d)
• PVM (Parallel Virtual Machine)– Allows you to run parallel jobs in Condor Allows you to run parallel jobs in Condor
(more on this later)(more on this later)
• SCHEDULER– Special kind of Condor job: the job is run on Special kind of Condor job: the job is run on
the the submitsubmit machine, not a remote execute machine, not a remote execute machinemachine
– Job is automatically restarted is the Job is automatically restarted is the condor_schedd is shutdowncondor_schedd is shutdown
– Used to schedule jobs (e.g. DAGMan)Used to schedule jobs (e.g. DAGMan)
14
Submitting Jobs to Condor Choosing a “Universe” for your job (already
covered this) Preparing your job
• Making it “batch-ready”• Re-linking if checkpointing and remote system
calls are desired (condor_compile) Creating a submit description file Running condor_submit
• Sends your request to the User Agent (condor_schedd)
15
Preparing Your Job Making your job “batch-ready”
• Must be able to run in the background: no interactive input, windows, GUI, etc.
• Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices
• If your job expects input from the keyboard, you have to put the input you want into a file
16
Preparing Your Job (cont’d)
If you are going to use the standard universe with checkpointing and remote system calls, you must re-link your job with Condor’s special libraries
To do this, you use condor_compile• Place “condor_compile” in front of the
command you normally use to link your job:
condor_compile gcc -o myjob myjob.c
17
Creating a Submit Description File A plain ASCII text file Tells Condor about your job:
• Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)
Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.
18
Example Submit Description File
# Example condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = standardExecutable = /home/wright/condor/my_job.condorInput = my_job.stdinOutput = my_job.stdoutError = my_job.stderrLog = my_job.logArguments = -arg1 -arg2InitialDir = /home/wright/condor/run_1Queue
19
Example Submit Description File Described
Submits a single job to the standard universe, specifies files for STDIN, STDOUT and STDERR, creates a UserLog defines command line arguments, and specifies the directory the job should be run in
Equivalent to (for outside of Condor):% cd /home/wright/condor/run_1% /home/wright/condor/my_job.condor -arg1 -arg2 \ > my_job.stdout 2> my_job.stderr \ < my_job.stdin
20
“Clusters” and “Processes”
If your submit file describes multiple jobs, we call this a “cluster”
Each job within a cluster is called a “process” or “proc”
If you only specify one job, you still get a cluster, but it has only one process
A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”)
Process numbers always start at 0
21
Example Submit Description File for a Cluster
# Example condor_submit input file that defines# a whole cluster of jobs at onceUniverse = standardExecutable = /home/wright/condor/my_job.condorInput = my_job.stdinOutput = my_job.stdoutError = my_job.stderrLog = my_job.logArguments = -arg1 -arg2InitialDir = /home/wright/condor/run_$(Process)Queue 500
22
Example Submit Description File for a Cluster - Described
Now, the initial directory for each job is specified with the $(Process) macro, and instead of submitting a single job, we use “Queue 500” to submit 500 jobs at once
$(Process) will be expaned to the process number for each job in the cluster (from 0 up to 499 in this case), so we’ll have “run_0”, “run_1”, … “run_499” directories
All the input/output files will be in different directories!
23
Running condor_submit
You give condor_submit the name of the submit file you have created
condor_submit parses the file and creates a “ClassAd” that describes your job(s)
Creates the files you specified for STDOUT and STDERR
Sends your job’s ClassAd(s) and executable to the condor_schedd, which stores the job in its queue
24
Monitoring Your Jobs
Using condor_q Using a “User Log” file Using condor_status Using condor_rm Getting email from Condor Once they complete, you can use
condor_history to examine them
25
Using condor_q
To view the jobs you have submitted, you use condor_q
Displays the status of your job, how much compute time it has accumulated, etc.
Many different options:• A single job, a single cluster, all jobs that
match a certain constraint, or all jobs• Can view remote job queues (either
individual queues, or “-global”)
26
Using a “User Log” file
A UserLog must be specified in your submit file:• Log = filename
You get a log entry for everything that happens to your job:• When it was submitted, when it starts
executing, if it is checkpointed or vacated, if there are any problems, etc.
Very useful! Highly recommended!
27
Using condor_status
To view the status of the whole Condor pool, you use condor_status
Can use the “-run” option to see which machines are running jobs, as well as:• The user who submitted each job• The machine they submitted from
Can also view the status of various submitters with “-submitter <name>”
28
Using condor_rm
If you want to remove a job from the Condor queue, you use condor_rm
You can only remove jobs that you own (you can’t run condor_rm on someone else’s jobs unless you are root)
You can give specific job ID’s (cluster or cluster.proc), or you can remove all of your jobs with the “-a” option.
29
Getting Email from Condor
By default, Condor will send you email when your jobs completes
If you don’t want this email, put this in your submit file:notification = never
If you want email every time something happens to your job (checkpoint, exit, etc), use this:notification = always
30
Getting Email from Condor (cont’d)
If you only want email if your job exits with an error, use this:notification = error
By default, the email is sent to your account on the host you submitted from. If you want the email to go to a different address, use this:notify_user = [email protected]
31
Using condor_history
Once your job completes, it will no longer show up in condor_q
Now, you must use condor_history to view the job’s ClassAd
The status field (“ST”) will have either a “C” for “completed”, or an “X” if the job was removed with condor_rm
32
Any questions?
Nothing is too basic If I was unclear, you probably are not
the only person who doesn’t understand, and the rest of the day will be even more confusing
Hands-On Exercise #1 Submitting and Monitoring a Simple
Test Job
34
Hands-On Exercise #1 Login to your machine as user “condor” You will see two windows:
• Netscape, with instructions• An xterm, where you execute commands
To begin, click on Simple Test Job Please follow the directions carefully Any lines beginning with % are
commands that you should execute in your xterm
If you accidentally exit Netscape, click on “Tutorial” in the Start menu
Lunch break
Please be back by 13:30
Welcome Back
37
Classified Advertisements ClassAds
• Language for expressing attributes• Semantics for evaluating them
Intuitively, a ClassAd is a set of named expressions• Each named expression is an attribute
Expressions are similar to C …• Constants, attribute references, operators
38
Classified Advertisements: Example
MyType = "Machine"
TargetType = "Job"
Name = "froth.cs.wisc.edu"
StartdIpAddr="<128.105.73.44:33846>"
Arch = "INTEL"
OpSys = "SOLARIS26"
VirtualMemory = 225312
Disk = 35957
KFlops = 21058
Mips = 103
LoadAvg = 0.011719
KeyboardIdle = 12
Cpus = 1
Memory = 128
Requirements = LoadAvg <= 0.300000 && KeyboardIdle > 15 * 60
Rank = 0
39
Classified Advertisements: Matching
ClassAds are always considered in pairs:• Does ClassAd A match ClassAd B (and vice
versa)?• This is called “2-way matching”
If the same attribute appears in both ClassAds, you can specify which attribute you mean by putting “MY.” or “TARGET.” in front of the attribute name
40
Classified Advertisements: Examples
ClassAd AMyType = "Apartment" TargetType =
"ApartmentRenter" SquareArea = 3500RentOffer = 1000HeatIncluded = FalseOnBusLine = TrueRank = UnderGrad==False +
TARGET.RentOfferRequirements = MY.RentOffer
- TARGET.RentOffer < 150
ClassAd BMyType =
"ApartmentRenter"TargetType = "Apartment"UnderGrad = FalseRentOffer = 900Rank = 1/(TARGET.RentOffer
+ 100.0) + 50*HeatIncluded
Requirements = OnBusLine &&
SquareArea > 2700
41
ClassAds in the Condor System
ClassAds allow Condor to be a general system• Constraints and ranks on matches
expressed by the entities themselves• Only priority logic integrated into the
Match-Maker All principal entities in the Condor
system are represented by ClassAds• Machines, Jobs, Submitters
42
ClassAds in Condor: Requirements and Rank(Example for Machines)
Friend = Owner == "tannenba" || Owner == "wright"
ResearchGroup = Owner == "jbasney" || Owner == "raman"
Trusted = Owner != "rival" && Owner != "riffraff"
Requirements = Trusted && ( ResearchGroup || (LoadAvg < 0.3 && KeyboardIdle > 15*60) )
Rank = Friend + ResearchGroup*10
43
Requirements for Machine Example Described
Machine will never start a job submitted by “rival” or “riffraff”
If someone from ResearchGroup (“jbasney” or “raman”) submits a job, it will always run, regardless of keyboard activity or load average
If anyone else submits a job, it will only run here if the keyboard has been idle for more than 15 minutes and the load average is less than 0.3
44
Machine Rank Example Described If the machine is running a job submitted
by owner “foo”, it will give this a Rank of 0, since foo is neither a friend nor in the same research group
If “wright” or “tannenba” submits a job, it will be ranked at 1 (since Friend will evaluate to 1 and ResearchGroup is 0)
If “raman” or “jbasney” submit a job, it will have a rank of 10
While a machine is running a job, it will be preempted for a higher ranked job
45
ClassAds in Condor: Requirements and Rank
(Example for Jobs)
Requirements = Arch == “INTEL” && OpSys == “LINUX” && Memory > 20
Rank = (Memory > 32) * ( (Memory * 100) + (IsDedicated * 10000) + Mips )
46
Job Example Described
The job must run on an Intel CPU, running Linux, with at least 20 megs of RAM
All machines with 32 megs of RAM or less are Ranked at 0
Machines with more than 32 megs of RAM are ranked according to how much RAM they have, if the machine is dedicated (which counts a lot to this job!), and how fast the machine is, as measured in Million Instructions Per Second
47
Finding and Using the ClassAd Attributes in your Pool
Condor defines a number of attributes by default, which are listed in the User Manual (“About Requirements and Rank”)
To see if machines in your pool have other attributes defined, use:• condor_status -long <hostname>
A custom-defined attribute might not be defined on all machines in your pool, so you’ll probably want to use “meta-operators”
48
ClassAd “Meta-Operators” Meta operators allow you to compare
against “UNDEFINED” as if it were a real value:• =?= is “meta-equal-to”• =!= is “meta-not-equal-to”• Color != “Red” (non-meta) would
evaluate to UNDEFINED if Color is not defined
• Color =!= “Red” would evaluate to True if Color is not defined, since UNDEFINED is not “Red”
Hands-On Exercise #2 Submitting Jobs with Requirements
and Rank
50
Hands-On Exercise #2
Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Requirements and Rank• Again, read the instructions carefully
and execute any commands on a line beginning with % in your xterm
If you exited Netscape, just click on “Tutorial” from your Start menu
51
Priorities In Condor Two kinds of priorities:
• User Priorities– Priorities between users in the pool to ensure Priorities between users in the pool to ensure
fairnessfairness– The The lowerlower the value, the better the priority the value, the better the priority
• Job Priorities – Priorities that users give to their own jobs to Priorities that users give to their own jobs to
determine the order in which they will rundetermine the order in which they will run– The The higherhigher the value, the better the priority the value, the better the priority– Only matters within a given user’s jobsOnly matters within a given user’s jobs
52
User Priorities in Condor Each active user in the pool has a user
priority Viewed or changed with
condor_userprio The lower the number, the better A given user’s share of available
machines is inversely related to the ratio between user priorities.• Example: Fred’s priority is 10, Joe’s is 20. Fred
will be allocated twice as many machines as Joe.
53
User Priorities in Condor, cont.
Condor continuously adjusts user priorities over time• machines allocated > priority, priority worsens• machines allocated < priority, priority improves
Priority Preemption• Higher priority users will grab machines away from
lower priority users (thanks to Checkpointing…)• Starvation is prevented• Priority “thrashing” is prevented
54
Job Priorities in Condor
Can be set at submit-time in your description file with:prio = <number>
Can be viewed with condor_q Can be changed at any time with
condor_prio The higher the number, the more
likely the job will run (only among the jobs of an individual user)
55
Managing a Large Cluster of Jobs
Condor can manage huge numbers of jobs
Special features of the submit description file make this easier
Condor can also manage inter-job dependencies with condor_dagman• For example: job A should run first, then, run
jobs B and C, when those finish, submit D, etc…
• We’ll discuss DAGMan later
56
Submitting a Large Cluster
Anywhere in your submit file, if you use $(Process), that will expand to the process number of each job in the cluster: input = my_input.$(process) arguments = $(process)
It is common to use $(Process) to specify InitialDir, so that each process runs in its own directory: InitialDir = dir.$(process)
57
Submitting a Large Cluster (cont’d)
Can either have multiple Queue entries, or put a number after Queue to tell Condor how many to submit: Queue 1000
A cluster is more efficient: Your jobs will run faster, and they’ll use less space
Can only have one executable per cluster: Different executables must be different clusters!
Hands-On Exercise #3 Submitting a
Large Cluster of Jobs
59
Hands-On Exercise #3
Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Large Clusters• Again, read the instructions carefully
and execute any commands on a line beginning with % in your xterm
If you exited Netscape, just click on “Tutorial” from your Start menu
10 Minute Break
Questions are welcome….
61
Inter-Job Dependencies with DAGMan
DAGMan can be used to handle a set of jobs that must be run in a certain order
Also provides “pre” and “post” operations, so you can have a program or script run before each job is submitted and after it completes
Robust: handles errors and submit-machine crashes
62
Using DAGMan
You define a DAG description file, which is similar in function to the submit file you give to condor_submit
DAGMan restrictions:• Each job in the DAG must be in its own
cluster (this is a limitation we will remove in future versions)
• All jobs in the DAG must have a User Log and must share the same file
63
Format of the DAGMan Description File
# is a comment First section names the jobs in your
DAG and associates a submit description file with each job
Second (optional) section defines PRE and POST scripts to run
Final section defines the job dependencies
64
Example DAGMan Description File
# Example DAGMan input fileJob A A.submitJob B B.submitJob C C.submitJob D D.submitScript PRE D d_input_checkerScript POST A a_output_processor A.outPARENT A CHILD B CPARENT B C CHILD D
65
Setting up a DAG for Condor
Must create the DAG description file Must create all the submit description
files for the individual jobs Must prepare any executables you plan
to use If you want, you can have a mix of
Vanilla and Standard jobs Must setup any PRE/POST commands
or scripts you wish to use
66
Submitting a DAG to Condor
Once you have everything in place, to submit a DAG, you use condor_submit_dag and give it the name of your DAG description file
This will check your input file for errors and submit a copy of condor_dagman as a scheduler universe job with all the necessary command-line arguments
67
Removing a DAG
Removing a DAG is easy:• Just use on the scheduler universe job
(condor_dagman)• On shutdown, DAGMan will remove any
jobs that are currently in the queue that are associated with its DAG
• Once all jobs are gone, DAGMan itself will exit, and the scheduler universe job will be removed from the queue
Hands-On Exercise #4 Using DAGMan
69
Hands-On Exercise #4
Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Using_DAGMan• Again, read the instructions carefully
and execute any commands on a line beginning with % in your xterm
If you exited Netscape, just click on “Tutorial” from your Start menu
70
What’s Wrong with my Vanilla Job?
Special requirements expressions for vanilla jobs
You didn’t submit it from a directory that is shared
Condor isn’t running as root (more on this later)
You don’t have your file permissions setup correctly (more on this later)
71
Special Requirements Expressions for Vanilla Jobs
When you submit a vanilla job, Condor automatically appends two extra Requirements:• UID_DOMAIN == <submit_uid_domain>• FILESYSTEM_DOMAIN == <submit_fs>
Since there are no remote system calls with Vanilla jobs, they depend on a shared file system and a common UID space to run as you and access your files
72
Special Requirements Expressions for Vanilla Jobs
By default, each machine in your pool is in its own UID_DOMAIN and FILESYSTEM_DOMAIN, so your pool administrator has to configure your pool specially if there really is a common UID space and a network file system
If you don’t have an account on the remote system, Vanilla jobs won’t work
73
Shared File Systems for Vanilla Jobs
Just because you have AFS or NFS doesn’t mean ALL files are shared• Initialdir = /tmp will probably
cause trouble for Vanilla jobs! You must be sure to set Initialdir to a
shared directory (or cd into it to run condor_submit) for Vanilla jobs
74
Why Don’t My Jobs Run?
Try using condor_q -analyze Try specify a User Log for your job Look at condor_userprio: maybe you
have a bad priority and higher priority users are being served
Problems with file permissions or network file systems
Look at the SchedLog
75
Using condor_q -analyze
condor_q -analyze will analyze your job’s ClassAd, get all the ClassAds of the machines in the pool, and tell you what’s going on:• Will report errors in your Requirements
expression (impossible to match, etc.)• Will tell you about user priorities in the
pool (other people have better priority)
76
Looking at condor_userprio
You can look at condor_userprio yourself
If your priority value is a really high number (because you’ve been running a lot of Condor jobs), other users will have priority to run jobs in your pool
77
File Permissions in Condor
If Condor isn’t running as root, the condor_shadow process runs as the user the condor_schedd is running as (usually “condor”)
You must grant this user write access to your output files, and read access to your input files (both STDOUT, STDIN from your submit file, as well as files your job explicitly opens)
78
File Permissions in Condor (cont’d)
Often, there will be a “condor” group and you can make your files owned and write-able by this group
For vanilla jobs, even if the UID_DOMAIN setting is correct, and they match for your submit and execute machines, if Condor isn’t running as root, your job will be started as user Condor, not as you!
79
Problems with NFS in Condor
For NFS, sometimes the administrators will setup read-only mounts, or have UIDs remapped for certain partitions (the classic example is root = nobody, but modern NFS can do arbitrary remappings)
80
Problems with NFS in Condor (cont’d)
If your pool uses NFS automounting, the directory that Condor thinks is your InitialDir (the directory you were in when you ran condor_submit) might not exist on a remote machine• E.g. you’re in /mnt/tmp/home/me/...
With automounting, you always need to specify InitialDir explicitly • InitialDir = /home/me/...
81
Problems with AFS in Condor
If your pool uses AFS, the condor_shadow, even if it’s running with your UID, will not have your AFS token• You must grant an unauthenticated AFS
user the appropriate access to your files• Some sites provide a better alternative that
world-writable files– Host ACLsHost ACLs– Network-specific ACLsNetwork-specific ACLs
82
Looking at the SchedLog
Looking at the log file of the condor_schedd, the “SchedLog” file can possibly give you a clue if there are problems• Find it with:
condor_config_val schedd_log
• You might need your pool administrator to turn on a higher “debugging level” to see more verbose output
83
Other User Features
Submit-Only installation Heterogeneous Submit PVM jobs
84
Submit-Only Installation
Can install just a condor_master and condor_schedd on your machine
Can submit jobs into a remote pool Special option to condor_install
85
Heterogeneous Submit The job you submit doesn’t have to be
the same platform as the machine you submit from• Maybe you have access to a pool that’s full
of Alphas, but you have a Sparc on your desk, and moving all your data is a pain
You can take an Alpha binary, copy it to your Sparc, and submit it with a requirements expression that says you need to run on ALPHA/OSF1
86
Parallel Jobs in Condor
Condor can run parallel applications • Written to the popular PVM message
passing library• Future work includes support for MPI
Master-Worker Paradigm What does Condor-PVM do? How to compile and submit Condor-
PVM jobs
87
Master-Worker Paradigm
Condor-PVM is designed to run PVM applications which follow the master-worker paradigm.
Master• has a pool of work, sends pieces of work to
the workers, manages the work and the workers
Worker• gets a piece of work, does the computation,
sends the result back
88
What does Condor-PVM do?
Condor acts as the PVM resource manager. All pvm_addhost requests get re-mapped
to Condor. • Condor dynamically constructs PVM virtual
machines out of non-dedicated desktop machines.
When a machine leaves the pool, the user gets notified via the normal PVM notification mechanisms.
89
How to compile and submit Condor-PVM jobs
Binary Compatible• Compile and link with PVM library just as
normal PVM applications. No need to link with Condor.
Submit In the submit description file, set:universe = PVMmachine_count = <min>..<max>
90
Obtaining Condor Condor can be downloaded from the
Condor web site at:http://www.cs.wisc.edu/condor
Complete Users and Administrators manual available
http://www.cs.wisc.edu/condor/manual Contracted Support is available Questions? Email: