+ All Categories
Home > Documents > High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Date post: 22-Dec-2015
Category:
View: 216 times
Download: 2 times
Share this document with a friend
Popular Tags:
31
High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009
Transcript
Page 1: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

High Throughput Computingwith Condor at Notre Dame

Douglas Thain

30 April 2009

Page 2: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Today’s Talk

• High Level Introduction (20 min)– What is Condor?– How does it work?– What is it good for?

• Hands-On Tutorial (30 min)– Finding Resources– Submitting Jobs– Managing Jobs– Ideas for Scaling Up

Page 3: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

The Cooperative Computing Lab• We create software that enables the reliable

sharing of cycles and storage capacity between cooperating people.

• We conduct research on the effectiveness of various systems and strategies for large scale computing.

• We collaborate with others that need to use large scale computing, so as to find the real problems and make an impact on the world.

• We operate systems like Condor that directly support research and collaboration at ND.

http://www.cse.nd.edu/~ccl

Page 4: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

What is Condor?• Condor is software from UW-Madison that

harnesses idle cycles from existing machines. (Most workstations are ~90% idle!)

• With the assistance of CSE, OIT, and CRC staff, Condor has been installed on ~700 cores in Engineering and Science since early 2005.

• The Condor pool expands the capabilities of researchers in to perform both cycle and storage intensive research.

• New users and contributors are welcome to join!

http://condor.cse.nd.edu

Page 5: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.
Page 6: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.
Page 7: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Condor Distributed Batch System (~700 cores)

ccl 8x1cclsun16x2

loco32x2

sc032x2

netscale16x2

cvrl32x2

iss44x2

compbio1x8

netscale1x32

Fitzpatrick130

CSE170

CHEG25

EE10

Nieu20

DeBart10

MPIHadoopBiometrics

StorageResearch

NetworkResearch

NetworkResearch

TimesharedCollaboration

PersonalWorkstations

StorageResearch

BatchCapacity

wwwportals

loginnodes

dbserver

Primary Interactive Users

Batch Users

centralmgr

Purdue~10k cores

Wisconsin~5k cores

“flocking” to other condor pools

greenhouse

Page 8: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

http://www.cse.nd.edu/~ccl/viz

Page 9: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.
Page 10: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

The Condor Principle• Machine Owners Have Absolute Control

– Set who, what, and when can use machine.– Can kick jobs off at any time manually.

• Default policy that satisfies most people:• Start job if console idle > 15 minutes• Suspend job if console used or CPU busy.• Kick off job if suspended > 10 minutes.

– After that, jobs run in this order: owner, research group, Notre Dame, elsewhere.

For the full technical details, see:http://www.cse.nd.edu/~ccl/operations/condor/policy.shtml

Page 11: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

What’s the value proposition?

• If you install Condor on your workstations, servers, or clusters, then:– You retain immediate, preemptive priority on

your machines, both batch and interactive.– You gain access to the unused cycles

available on other machines.– By the way, other people get to use your

machines when you are not.

Page 12: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

http://condor.cse.nd.edu

Page 13: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

http://condor.cse.nd.edu

Page 14: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

http://condor.cse.nd.edu

Page 15: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Condor Architecture

matchmaker

schedd startd

I want an INTELCPU with > 3GB RAM

I prefer to run jobsowned by user “joe”.

You two shouldtalk to each other.

Run job with files X, Y.

Represents auser with jobs to run.

Representsan available

machine.

jobX

YY

Y

Page 16: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

~700 CPUs at Notre Dame

matchmaker

startdstartdstartdstartdstartdstartd

scheddscheddscheddscheddscheddscheddschedd

Page 17: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Flocking to Other Sites

2000 CPUsUniversity

of Wisconsin

20,000 CPUsPurdue

University

700 CPUsNotreDame

Page 18: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

What is Condor Good For?

• Condor works well on large workflows of sequential jobs, provided that they match the machines available to you.

• Ideal workload:– One million jobs that require one hour each.

• Doesn’t work at all:– An 8-node MPI job that must run now.

• Many workloads can be converted into the ideal form, with varying degrees of effort.

Page 19: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

High Throughput Computing

• Condor is not High Performance Computing– HPC: Run one program as fast as possible.

• Condor is High Throughput Computing– HTC: Run as many programs as possible before

my paper deadline on May 1st.

Page 20: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Intermission and Questions

Page 21: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Getting Started:

If your shell is tcsh:% setenv PATH

/afs/nd.edu/user37/condor/software/bin:$PATH

If your shell is bash:% export PATH=/afs/nd.edu/user37/condor/software/bin:

$PATH

Then, create a temporary working space:% mkdir /tmp/YOURNAME% cd /tmp/YOURNAME

Page 22: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Viewing Available Resources

• Condor Status Web Page:– http://condor.cse.nd.edu

• Command Line Tool:– condor_status– condor_status –constraint ‘(Memory>2048)’– condor_status –constraint ‘(Arch==“INTEL”)’– condor_status –constraint ‘(OpSys==“LINUX”)’– condor_status -run– condor_status –submitters– condor_status -pool boilergrid.rcac.purdue.edu

Page 23: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

A Simple Script Job

#!/bin/sh

echo $@

date

uname –a

% vi simple.sh

% chmod 755 simple.sh

% ./simple.sh hello world

Page 24: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

% vi simple.submit

A Simple Submit File

universe = vanillaexecutable = simple.sharguments = hello condoroutput = simple.stdouterror = simple.stderrshould_transfer_files = yeswhen_to_transfer_output = on_exitlog = simple.logfilequeue

Page 25: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Submitting and Watching a Job

• Submit the job:– condor_submit simple.submit

• Look at the job queue:– condor_q

• Remove a job:– condor_rm <#>

• See where the job went:– tail -f simple.logfile

Page 26: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

% vi simple.submit

Submitting Lots of Jobs

universe = vanillaexecutable = simple.sharguments = hello $(PROCESS)output = simple.stdout.$(PROCESS)error = simple.stderr.$(PROCESS)should_transfer_files = yeswhen_to_transfer_output = on_exitlog = simple.logfilequeue 50

Page 27: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

What Happened to All My Jobs?• http://condorlog.cse.nd.edu

Page 28: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Setting Requirements

• By default, Condor will only run your job on a machine with the same CPU and OS as the submitter.

• Use requirements to send your job to other kinds of machines:– requirements = (Memory>2084)– requirements = (Arch==“INTEL” || Arch==“X86_64”)– requirements = (MachineGroup==“fitzlab”)– requirements = (UidDomain!=“nd.edu”)

• (Hint: Try out your requirements expressions using condor_status as above.)

Page 29: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Setting Requirements

• By default, Condor will assume any machine that satisfies your requirements is sufficient.

• Use the rank expression to indicate which machines that you prefer:– rank = (Memory>1024)– rank = (MachineGroup==“fitzlab”)– rank = (Arch==“INTEL”)*10

+ (Arch==“X86_64”)*20

Page 30: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

File Transfer

• Notes to keep in mind:– Condor cannot write to AFS. (no creds)– Not all machines in Condor have AFS.

• So, you must specify what files your job needs, and Condor will send them there:– transfer_input_files = x.dat, y.calib, z.library

• By default, all files created by your job will be sent home automatically.

Page 31: High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

In Class Assignment

• Execute 50 jobs that run on a machine not at Notre Dame that has >1GB RAM.


Recommended