Date post: | 30-May-2018 |
Category: |
Documents |
Upload: | anand-vaidya |
View: | 218 times |
Download: | 0 times |
of 32
8/14/2019 Job Management Systems SGE v1.4
1/32
Job Management Systems
SGEv1.4Author: Anand Vaidya
8/14/2019 Job Management Systems SGE v1.4
2/32
Why use SGE? Maintain order in a shared resource like queing
up at a movie ticket counter rather than mobbingthe counter
Apply different usage policies PhDs and Profsget better treatment than first year grads
Everyone gets a fair (!) share of the computingresource.
8/14/2019 Job Management Systems SGE v1.4
3/32
What is SGE?
SGE is a distributed resource managementsoftware Provides users the means to submitcomputationally demanding tasks to the
SGE system for transparent distribution ofthe associated workload.
8/14/2019 Job Management Systems SGE v1.4
4/32
What is SGE? Layman Terms
You have a collection of mostly idle Macs,Windows, Linux and Solaris machinesYou have plenty of computations orsimulations to run.
Can we just use these machines to runthose computations?
Who will manage this herd? SGE will...
8/14/2019 Job Management Systems SGE v1.4
5/32
SGE Overview
Users and theirdesktop/laptops
SGEConfigsRules
Users' jobs run here
Users' jobs run here
8/14/2019 Job Management Systems SGE v1.4
6/32
How does SGE work?
Users submit jobs to the Grid Engine. Unless resources are immediatelyavailable non-interactive jobs are kept inqueues until resources to execute them
become available.Jobs are passed onto the availableexecution hosts
Records of each jobs progress through thesystem are kept and reported whenrequested.
8/14/2019 Job Management Systems SGE v1.4
7/32
Sge master,
shadows
Sge master,
shadows
execd
execd
execd
execdJob requestsResults,errors
DRMAA client(applications)
8/14/2019 Job Management Systems SGE v1.4
8/32
Supported OS Linux 32 and 64 bit
Solaris (Sparc and x64)
Windows (exec only)
OSX
AIX
HPUX/IRIX etc
8/14/2019 Job Management Systems SGE v1.4
9/32
SGE Components Hosts
Master (coordinate activities, hold queues)
Shadow Master
Execution (workers)
Administration (sets up system, queues etc)
Submit (users can submit jobs from these)
8/14/2019 Job Management Systems SGE v1.4
10/32
SGE Components Usually the master and admin host are the same
machines Queues (defined by the administrator)
User and Administrator Commands
Daemons:
sge_qmaster (Master Daemon),
sge_schedd (Scheduler Daemon), sge_execd (Execution Daemon)
sge_commd (Communication Daemon)
8/14/2019 Job Management Systems SGE v1.4
11/32
4 Job Types Interactive jobs - user gets back a shell window
Batch jobs just run once and store output forreview later
Array jobs (aka parametric eg image rendering )
Parallel (MPI) jobs Can't describe in one line :-(
8/14/2019 Job Management Systems SGE v1.4
12/32
Accessing...
GUI (qmon) Command Line / textual (qsub etc)
Programmatic (DRMAA)
DRMAA= Distributed Resource Management Application API where,
API = Application Programming InterfaceCan you see the duplication? DRMA should have been sufficient...
8/14/2019 Job Management Systems SGE v1.4
13/32
What is a job? Describes:
What to run (program name) What environment is needed?
What resources are needed (how many cpu, how
much RAM etc) Email on completion?
Send output of job to another file?
8/14/2019 Job Management Systems SGE v1.4
14/32
Queues and Instances Queues are logical constructs, shared by all hosts
attached to the queue and cannot run jobs Queue Instances actually reside on hosts and
contain jobs
Queue config shared by all instances Each instance can have unique properties,
different from Queue
I t lli
8/14/2019 Job Management Systems SGE v1.4
15/32
Installing... Determine archs you will support and download
appropriate packages.
Unpack tarballs
Write auto-install script
ssh $MASTER ; $SGE_ROOT/inst_sge -m -auto
sge-auto.conf ; /etc/init.d/sgemaster start ssh $SHADOW ; $SGE_ROOT/inst_sge -sm -auto
sge-auto.conf; /etc/init.d/sgemaster -shadowd start
$SGE_ROOT/inst_sge -x -auto sge-auto.conf ;psh compute /etc/init.d/sgeexecd start
Check : qhost
Done!
SGE C d h t
8/14/2019 Job Management Systems SGE v1.4
16/32
SGE Commands - qhost What is the state of the cluster? How many nodes,
type, load? What is my chance of getting a node?[root@shark ~]# qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTOSWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
shark-c00 lx24-amd64 2 2.02 3.9G 240.8M 4.0G 0.0
shark-c02 lx24-amd64 2 2.00 3.9G 214.9M 4.0G 0.0
shark-c03 lx24-amd64 2 1.76 3.9G 215.9M 4.0G 0.0
SGE C d b
8/14/2019 Job Management Systems SGE v1.4
17/32
SGE Commands - qsub Create a jobscripts (myjob.sh)
Submit for execution$ qsub myjob.sh
Your job 742 ("myjob.sh") has been submitted.
Simplest Job:[vaidya@shark ~]$ cat myjob.sh
#!/bin/sh
sleep 10
date > /tmp/test1.out.txt
Variations: qsub -cwd myjob.sh
SGE C d t t
8/14/2019 Job Management Systems SGE v1.4
18/32
SGE Commands - qstat check status of your job:
qstat ; qstat -f ;
qstat -u username ; qstat -j job_id
[root@shark ~]# qstat job-ID prior name user state submit/start at queueslots ja-task-ID
-----------------------------------------------------------------------------------------------------------------639 0.55500 HCPDIV7 test1 r 05/17/2006 10:16:31 all.q@shark-c00
1658 0.55500 HCPDIV1 test1 r 05/17/2006 13:37:35 all.q@shark-c00
1
694 0.55500 FCCDVI test1 r 05/17/2006 23:52:19 all.q@shark-c021695 0.55500 FCCDVI1 test1 r 05/17/2006 23:52:19 all.q@shark-c02
1
SGE C d t t
8/14/2019 Job Management Systems SGE v1.4
19/32
SGE Commands - qstat Status of the job is indicated by letters as:
qw - waiting t - transferingr - running s,S - suspended
R- restarted T - threshold
SGE Commands qdel
8/14/2019 Job Management Systems SGE v1.4
20/32
SGE Commands - qdel Delete your job, if you wish
qdel 743vaidya has deleted job 743
SGE Commands qmon
8/14/2019 Job Management Systems SGE v1.4
21/32
SGE Commands - qmon qmon is a XWindows GUI tool to
submit/delete/view jobs, configure SGE system Example: Submit a job using qmon
Click the Job Submission icon. Click the Job Script file selection icon to open a file selection
box and select your script file. Then, click OK. Click the Submit button at the bottom of the Job Submission
dialog. After a couple of seconds, you should be able to monitor your
job in the Job Control dialog. Click the Job Control icon in theQMON control panel.
You first see it under Pending Jobs, and it quickly moves toRunning Jobs after it gets started.
SGE Commands qsh qtcsh
8/14/2019 Job Management Systems SGE v1.4
22/32
SGE Commands qsh, qtcsh Submit a Interactive session request:
qloginqrsh
Ensure you have a valid XServer running on
your desktop. Allow remote xclients to display onyour desktop.
Submit an Interactive session request:
qshqtcsh
Note: using this feature needs additional configuration, maynot work otherwise.
SGE Commands jobscript
8/14/2019 Job Management Systems SGE v1.4
23/32
SGE Commands jobscript sample job script:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
date
sleep 10
env
date
SGE Commands jobscript
8/14/2019 Job Management Systems SGE v1.4
24/32
SGE Commands jobscript sample job script:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
$MPI_DIR/mpirun -np $NSLOTS -machinefile
$TMPDIR/machines myparallelprog.exe {infile.txt outfile.txt}
Jobscript useful directives
8/14/2019 Job Management Systems SGE v1.4
25/32
Jobscript useful directives -cwd = change to current dir before running job
-j y = merge error with stdout
-r y = code is re-runnable
-N jname = set the job name
-l h_rt = 00:30:00 run job for max of 30mins
-pe mpich Invoke parallel environment
-pe mpich-ib use infiniband parallel environment
-pe mpich-eth use ethernet parallel env
-V = carry all env variable settings -M [email protected] send email
-m bes
Jobscript useful directives
mailto:[email protected]:[email protected]8/14/2019 Job Management Systems SGE v1.4
26/32
Jobscript useful directives -A acctname_to_charge
-a [[CC]yy]MMDDhhmm[.SS] when to run
Ad i C d
8/14/2019 Job Management Systems SGE v1.4
27/32
Admin CommandsNext few slides show commands useful for SGE
admins (not users/researchers)
Ad i C d f
8/14/2019 Job Management Systems SGE v1.4
28/32
Admin Commands - qconfIn general,
qconf -s** to show config qconf -m** to modify config
qconf -M** to import config from text file
qconf -d** to delete config
SGE Commands qconf
8/14/2019 Job Management Systems SGE v1.4
29/32
SGE Commands qconf Show:
complexes: qconf -sc queues: qconf -sql
PE: qconf -spl
exec host: qconf -sel qconf -se c35
submit hosts: qconf -ss
admin hosts: qconf -sh
list calendars qconf -scall
configuration qconf -sconf user list: qconf -suserl
Scheduler conf: qconf -ssconf
SGE Commands qping
8/14/2019 Job Management Systems SGE v1.4
30/32
SGE Commands qping[anand@shark-c02 ~]$ qping -info shark-c01 537 execd 1
05/24/2006 21:57:34:
SIRM version: 0.1
SIRM message id: 1
start time: 05/24/2006 21:31:37(1148477497)
run time [s]: 1768
messages in read buffer: 0
messages in write buffer: 0
nr. of connected clients: 2status: 0
info: dispatcher: R (0.04) | OK
Monitor: disabled
Acknowledgements & Copying
8/14/2019 Job Management Systems SGE v1.4
31/32
Acknowledgements & Copying This material is based on my experience as well as material
collected from SGE documentation.
This presentation can be redistributed as follows:
No commercial re-distribution: eg, as part of a for-profitCDROM or as part of your sales pitch. Seek my permission
first. Must attribute the document creator.
Share alike: If you use this document and enhance it ormodify, share the modifications or the modified document
Which means I apply: Creative Commons License,http://creativecommons.org/licenses/by-nc-sa/2.5/
The End
8/14/2019 Job Management Systems SGE v1.4
32/32
The End Thanks for your time. If you have any feedback, corrections
or questions please contact me: Anand Vaidya,
[email protected] This document was created with OpenOffice on Linux. email me if
you want the odp file instead of the pdf