+ All Categories
Home > Documents > Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All...

Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All...

Date post: 27-Feb-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
13
Batch system usage The Zeuthen Farm J. Bazo 10.11.2010
Transcript
Page 1: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

Batch system usageT

he Z

euth

en F

arm

J. B

azo

10.

11.2

010

Page 2: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

● Computing wikipage:● http://dvinfo.ifh.de

● Central email address for questions & requests:[email protected]

● Data storage:● AFS ( /afs/ifh.de/group/amanda/scratch/ your scratch area)● Lustre: fast parallel storage, use in batch jobs● dCache (acs): mass storage, files on tape

General stuff

J. Bazo The Zeuthen Farm

1

Page 3: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

The Zeuthen Farm

J. Bazo 1

2

Resources:

Batch farm: ~700 cores, all Intel Xeon2.33-3.0 GHz, RAM: 2-4 GB/coreAll nodes run 64-bit SL5SUN Grid Engine (SGE) 6.2Shared between all groups (nic, that, amanda, z_nuastr, cta, pitz, atlas)

Cluster: 1024 cores, (accessible for theory groups)

Other farms:

GRIDZN: atlas, lhcb, cta (9%), icecube (6%), etc (you need a grid certificate)

NAF (National Analysis Facility: Physics at the Terascale: LHC experiments: cms, atlas, lhcb and ILC

Resources:Batch farm: ~700 cores, all Intel Xeon

2.33-3.0 GHz, RAM: 2-4 GB/core

Cluster: 1024 cores, 2.8 GHz, 3 GB/core All nodes run 64-bit SL5 Shared between all groups

Batch jobs: qsub: simulation, data processing, ... Interactive access: qrsh

heavy ROOT sessions, moving data, ... Most common mistake: failure to request resources

Other farms in Zeuthen: GridZn:atlas, lhcb, cta (9%), icecube (6%), etc NAF (National Analysis Facility: Physics at the Terascale, Strategic Helmholtz Alliance): cms, atlas, ilc, lhcb

Page 4: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

Using the SGE Batch Farm

3

J. Bazo 1 All nodes run 64-bit SL5 Resources:

Batch farm: ~700 cores, all Intel Xeon2.33-3.0 GHz, RAM: 2-4 GB/core

Cluster: 1024 cores, 2.8 GHz, 3 GB/core Shared between all groups Batch jobs: simulation, data processing, ... Interactive access: qrsh

heavy ROOT sessions, moving data, ... Most common mistake: failure to request resources

Usage:

● 1. Split task into small jobs

● 2. Script them

● 3. Submit the job scripts

(qsub your_farm_script.sh)

#$ ... #$ ... #$ ... #$ ...

------------------------------

ordinary shell script

script to submit

interpreted by batch system

Commands:

qrsh : Interactive access, for heavy ROOT sessions, moving data, ...

qsub: Batch jobs, for simulation, data processing, ...

WorkingGroupServers only for small jobs. If more computing power is needed, use the farm.

Page 5: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

● WorkingGroupServers only for small jobs, if more computing power is needed use the farm.● Usage:

● 1. Split task into small jobs● 2. Script them● 3. Submit the job scripts

What every script needs

J. Bazo 1

4

Batch Job Comment

#!/bin/zsh

#$ -S /bin/zsh shell to be used

#$ -l h_cpu=00:29:00 cpu time for this job

#$ -l h_vmem=850M maximum memory usage of this job

#$ -j y stderr and stdout are merged

#$ -m bea send mail on job's begin, end and abort (bea)

#$ -o /afs/ifh.de/.../FarmMessages

redirect output -o

#$ -cwd execute job from current directory

#$ -l os=sl5 operating system

#$ -P amanda project name

discouraged, only for testing

regularly delete filesIf directory is full, job will crash

obsolete since all systems have SL5 (64bit)

amanda has less priority than z_nuastr

Page 6: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

What every script needs ... advices

5

J. Bazo 1

CPU time, 3 queues:

Short: 30minMedium: 12 hoursLong: 48 hours

There is no difference for time given inside time range: always give upper limit!(e.g. 29min)

If job last longer that requested, it will crash. Time your scripts prior to sending!

Use the short queue, it is usually empty!

Memory:

Less memory requested will give higher priority to your job.

If you request less memory than needed, your job will crash.

Always test!

Requesting resources

Page 7: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

What every script needs ... advicesJ. Bazo 1

6

CPU time, 3 queues:Short: 30minMedium: 12 hoursLong: 48 hours

There is no difference for time given inside time range: always give upper limit(e.g. 29min)If job last longer that requested, it will crash. Time your scripts prior to sending!Use the short queue, it is usually empty!

Memory:Less memory requested will give higher priority to your job.If you request less memory than needed, your job will crash.

Shell script part

Shell script CommentSIG_EVT=$1 ... Input parameters

hostname; date some info you want in the stdout file

cd $TMPDIR always $TMPDIR, NOT /tmp !

cp .../infile ./ fetch input file(s)

your_program run the actual job, output to $TMPDIR

cp outfile /lustre/... store output file(s)

your_program example:

${WORK}/IceRec_v03/build64/env-shell.sh my_analysis $START $END $SIG_EVT $TIME

Page 8: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

7

J. Bazo 1

Batch commands

● qsub : submit a job

● qstat : shows running/waiting jobsqstat -u jlbazo

[oreade30] ~ % qstat -u jlbazojob-ID prior name user state submit/start at queue slots ja-task-ID -----------------------------------------------------------------------------------------------------------------4095495 0.50003 JobAnalysi jlbazo r 11/08/2010 11:38:32 [email protected] 1

qstat -ext -u jlbazo (extended information: e.g. project )

● qhost : shows status of execution hosts

● qdel : delete submitted jobs qdel job_ID (delete job) qdel -u jlbazo (delete all jobs from user)

Page 9: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

J. Bazo 1

8

Monitoring the farm activity

Useful script from Adam:myjobs.awk (you can copy it from: http://www-zeuthen.desy.de/~jlbazo/myjobs.awk)alias myjobs="qstat -g d|awk -f ~/myjobs.awk"

Fast look at the farm usage from others and your own jobs

Page 10: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

9

J. Bazo 1

Monitoring the farm activity

Useful script from Adam:myjobs.awkalias myjobs="qstat -g d|awk -f ~/myjobs.awk"

Monitoring and ACcounting in the SGE BATch Farm

https://www-zeuthen.desy.de/dv-bin/batch/stat/sge

In October In 2010

Page 11: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

● WorkingGroupServers only for small jobs, if more computing power is needed use the farm.● Usage:

● 1. Split task into small jobs● 2. Script them● 3. Submit the job scripts

Batch Scheduler Strategy

J. Bazo 1

11

Scheduling order

● SGE assigns tickets to each job.

● Job with most tickets is sent first.

● If requested resources are not available, next job in turn is tried.

● Project has big influence on scheduling policy.

● Number of tickets depends on the resources requested (mem, cpu).

● cpu parameter has a much bigger weight than mem parameter.

Request less resources for a faster scheduling (short queue, less mem)

When in need, use project z_nuastr , but think about others

Page 12: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

J. Bazo The Zeuthen Farm

What every script needs ... advicesJ. Bazo 1

10

CPU time, 3 queues:Short: 30minMedium: 12 hoursLong: 48 hours

There is no difference for time given inside time range: always give upper limit(e.g. 29min)If job last longer that requested, it will crash. Time your scripts prior to sending!Use the short queue, it is usually empty!

Memory:Less memory requested will give higher priority to your job.If you request less memory than needed, your job will crash.

Batch final recommendations

● Always send a few test jobs first

● Make sure you have sufficient filesystem quota for all job output

● Avoid:● jobs writing same file● too many jobs working in same directory● writing too much to stdout/err

● Usually, transfer data at beginning/end of job only

● Most of the time, work on the local disk ($TMPDIR)

● Avoid mass failures, they cause mail storms

● Most common mistake: failure to request resources

Page 13: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0

● Farm info:● https://dvinfo.ifh.de/Batch_System_Usage

● Introduction to DESY-computing:● http://www-zeuthen.desy.de/~wiesand/intro/intro10p1.pdf● http://www-zeuthen.desy.de/~wiesand/intro/intro10p2.pdf

J. Bazo The Zeuthen Farm

12

Resources


Recommended