Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Presenter:
Dave [email protected]
Research Computing Services
University of Calgary IT
University of Calgary EcoGrid
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Example Job Types
General purpose – arbitrary linux apps
Rendering video and still images
Charmm
Matlab
Maple
Parameter sweeps
etc.
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Rendered Pictures Examples
http://hpc.ucalgary.ca/EcoGrid/pics– Note: Hi bandwidth – even for on campus
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
What is EcoGrid?
Cycle scavenging system -- using otherwise idle CPU cycles to perform useful work
Most lab computers are powered on but idle for most of the night.
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Consider:
– Assumptions: • Idle from 6pm to 6am – 12h / day• Idle all weekend – 48h/week• 2000 EcoGrid Nodes
– Calculation:• Idle Time = 12h*5 + 24h * 2 = 108hours / week• 108h/week = 6480 CPU Minutes per week• 2000 nodes * 6480 minutes/week =
12 960 000 CPU Minutes / Week!• Or 4 600 000 000 CPU Minutes / Year
To Contrast, The Westgrid Matrix cluster (128 nodes) running at 100% for one year would only have 135 000 000 CPU minutes.
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Goals (by July ‘09)
1000+ nodes
Enough demand to consume 100% of the cluster
Full web based reporting and statistics
Other clusters connected– Origin– Terminus– Matrix– Lattice
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Benefits
Huge untapped computing resource
Compute cycles available to the campus without the need to purchase more equipment
Cluster will always have some fairly current hardware
Efficiently using power already wasted by idle computers
Little – if any – impact on lab users
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Drawbacks
More network utilization
Possible heat capacity of lab environmental systems
Somewhat increased electrical power draw– Lab power system should be able to supply
this power but may not under normal lab conditions
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
How is it done?
Using Condor and Innotek VirtualBox (Windows Platforms)
Next build will use QEMU– Checkpointing the machine– Jobs survive nightly reboots
Using Condor natively (MAC/Linux Platforms)
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
What is Condor?
Developed by the University of Wisconsin-MadisonRuns on many common operating systems – but the jobs must be designed for that operating systemWindows is supported but little demand for HPC applicationsCondor project started in 1988
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Why Condor?
Users retain full control over their computersProvides job checkpointing, migration, and restart (with certain restrictions)DAGMAN – Directed Acyclic Graph job MANager– Takes care of job dependencies– Even allows portions of jobs to be run on completely
dis-similar clusters.– Very easy to express job dependencies
Very resilient to Network Problems– Jobs finish and wait until the network is restored to
complete.
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
What is QEMU?
Processor Emulator with extensions to quickly run code built for the host processor
Open Source
Runs Linux Guest on Windows™
Virtually undetectable to the Windows User– Runs as a service – only visible in task
manager as a running task.
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Image Size
Small node filesystem image ~20Mb which kickstarts a full system upon first bootup.
Reinstalls can be triggered from the headnode, so software updates and fixes can be pulled in RPM form at the next reboot.
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Networking
Central manager is unable to initiate direct TCP/IP connections to the nodes so something else is required.
Options– VPN– IP Tunneling– Connection Brokering
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Networking Cont’d
We have chosen to use GCB – Generic Connection Brokering – which is a part of the Condor distribution.The compute nodes establish and maintain a connection to the GCB at startup.When the Central Manager needs to open a connection to the node, it contacts it via the GCB machine.
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Networking Cont’d
Condor Client(Virtual Box)
Host OSNAT /
DHCP Server
Physical PC
Campus Network
ecogrid.ucalgary.caGCB Server vulture.ucalgary.ca
Vulture (a type of Condor) is the central manager. It coordinates all of the machines and the jobs they run.Ecogrid is the submit machine, the one where the users login to and submit their jobs.
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Scalability
GCB nodes can be created as network load requires.
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Where do we plan on using this?
Lab Computers (via VirtualBox/QEMU)
DTP Desktop Computers (via VirtualBox/QEMU)
Linux Labs (natively)
Other Clusters (via the Globus Interface)
Will provide one common interface to many clusters
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Ideal Workload
• Serial jobs -- Possibly 2 processor depending on the available hosts
• Jobs that can be broken into smaller jobs• Parameter sweeps• Self Compiled (To take advantage of
checkpointing and restart)• COMING SOON: Matlab jobs!!
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Timeline / Currently
• Currently:• 80 IT Labs machines in the Elbow Room • Hoping to roll out a number of Linux labs
which have Matlab installed before the end of summer
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Going Forward
Phase II – Expansion of project to non UCIT labs
Web Portal Demo
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Team Members
• Stephen Cartwright• Robert Fridman• Eric Merth• David Schulz• Carol Sin
CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary
Information Resources
• Condor Website: www.cs.wisc.edu/condor• QEMU Website: bellard.org/qemu • VirtualBox Website: www.virtualbox.org• Local Website: hpc.ucalgary.ca/EcoGrid