+ All Categories
Home > Documents > CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz...

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz...

Date post: 19-Dec-2015
Category:
View: 214 times
Download: 1 times
Share this document with a friend
24
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz [email protected] Research Computing Services University of Calgary IT University of Calgary EcoGrid
Transcript
Page 1: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Presenter:

Dave [email protected]

Research Computing Services

University of Calgary IT

University of Calgary EcoGrid

Page 2: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Example Job Types

General purpose – arbitrary linux apps

Rendering video and still images

Charmm

Matlab

Maple

Parameter sweeps

etc.

Page 3: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Rendered Pictures Examples

http://hpc.ucalgary.ca/EcoGrid/pics– Note: Hi bandwidth – even for on campus

Page 4: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

What is EcoGrid?

Cycle scavenging system -- using otherwise idle CPU cycles to perform useful work

Most lab computers are powered on but idle for most of the night.

Page 5: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Consider:

– Assumptions: • Idle from 6pm to 6am – 12h / day• Idle all weekend – 48h/week• 2000 EcoGrid Nodes

– Calculation:• Idle Time = 12h*5 + 24h * 2 = 108hours / week• 108h/week = 6480 CPU Minutes per week• 2000 nodes * 6480 minutes/week =

12 960 000 CPU Minutes / Week!• Or 4 600 000 000 CPU Minutes / Year

To Contrast, The Westgrid Matrix cluster (128 nodes) running at 100% for one year would only have 135 000 000 CPU minutes.

Page 6: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Goals (by July ‘09)

1000+ nodes

Enough demand to consume 100% of the cluster

Full web based reporting and statistics

Other clusters connected– Origin– Terminus– Matrix– Lattice

Page 7: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Benefits

Huge untapped computing resource

Compute cycles available to the campus without the need to purchase more equipment

Cluster will always have some fairly current hardware

Efficiently using power already wasted by idle computers

Little – if any – impact on lab users

Page 8: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Drawbacks

More network utilization

Possible heat capacity of lab environmental systems

Somewhat increased electrical power draw– Lab power system should be able to supply

this power but may not under normal lab conditions

Page 9: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

How is it done?

Using Condor and Innotek VirtualBox (Windows Platforms)

Next build will use QEMU– Checkpointing the machine– Jobs survive nightly reboots

Using Condor natively (MAC/Linux Platforms)

Page 10: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

What is Condor?

Developed by the University of Wisconsin-MadisonRuns on many common operating systems – but the jobs must be designed for that operating systemWindows is supported but little demand for HPC applicationsCondor project started in 1988

Page 11: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Why Condor?

Users retain full control over their computersProvides job checkpointing, migration, and restart (with certain restrictions)DAGMAN – Directed Acyclic Graph job MANager– Takes care of job dependencies– Even allows portions of jobs to be run on completely

dis-similar clusters.– Very easy to express job dependencies

Very resilient to Network Problems– Jobs finish and wait until the network is restored to

complete.

Page 12: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

What is QEMU?

Processor Emulator with extensions to quickly run code built for the host processor

Open Source

Runs Linux Guest on Windows™

Virtually undetectable to the Windows User– Runs as a service – only visible in task

manager as a running task.

Page 13: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Image Size

Small node filesystem image ~20Mb which kickstarts a full system upon first bootup.

Reinstalls can be triggered from the headnode, so software updates and fixes can be pulled in RPM form at the next reboot.

Page 14: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Networking

Central manager is unable to initiate direct TCP/IP connections to the nodes so something else is required.

Options– VPN– IP Tunneling– Connection Brokering

Page 15: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Networking Cont’d

We have chosen to use GCB – Generic Connection Brokering – which is a part of the Condor distribution.The compute nodes establish and maintain a connection to the GCB at startup.When the Central Manager needs to open a connection to the node, it contacts it via the GCB machine.

Page 16: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Networking Cont’d

Condor Client(Virtual Box)

Host OSNAT /

DHCP Server

Physical PC

Campus Network

ecogrid.ucalgary.caGCB Server vulture.ucalgary.ca

Vulture (a type of Condor) is the central manager. It coordinates all of the machines and the jobs they run.Ecogrid is the submit machine, the one where the users login to and submit their jobs.

Page 17: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Scalability

GCB nodes can be created as network load requires.

Page 18: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Where do we plan on using this?

Lab Computers (via VirtualBox/QEMU)

DTP Desktop Computers (via VirtualBox/QEMU)

Linux Labs (natively)

Other Clusters (via the Globus Interface)

Will provide one common interface to many clusters

Page 19: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Ideal Workload

• Serial jobs -- Possibly 2 processor depending on the available hosts

• Jobs that can be broken into smaller jobs• Parameter sweeps• Self Compiled (To take advantage of

checkpointing and restart)• COMING SOON: Matlab jobs!!

Page 20: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Timeline / Currently

• Currently:• 80 IT Labs machines in the Elbow Room • Hoping to roll out a number of Linux labs

which have Matlab installed before the end of summer

Page 21: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Going Forward

Phase II – Expansion of project to non UCIT labs

Page 22: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

Web Portal Demo

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Page 23: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Team Members

• Stephen Cartwright• Robert Fridman• Eric Merth• David Schulz• Carol Sin

Page 24: CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary.

CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

Information Resources

• Condor Website: www.cs.wisc.edu/condor• QEMU Website: bellard.org/qemu • VirtualBox Website: www.virtualbox.org• Local Website: hpc.ucalgary.ca/EcoGrid


Recommended