+ All Categories
Home > Documents > Ian C. Smith

Ian C. Smith

Date post: 06-Jan-2016
Category:
Upload: cece
View: 34 times
Download: 0 times
Share this document with a friend
Description:
Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs. Ian C. Smith. Overview. Quick description of the University of Liverpool Condor Pool Power saving at Liverpool A home-grown approach to dealing with power-saving PCs Power management using Condor 7.4.X - PowerPoint PPT Presentation
Popular Tags:
16
Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs
Transcript
Page 1: Ian C. Smith

Ian C. Smith

Towards a greener Condor pool: adapting Condor for

use with energy-efficient PCs

Page 2: Ian C. Smith

Overview

Quick description of the University of Liverpool Condor Pool

Power saving at Liverpool

A home-grown approach to dealing with power-saving PCs

Power management using Condor 7.4.X

Implementing Condor power management

Results

Future directions

Page 3: Ian C. Smith

University of Liverpool Condor Pool Contains around 300 machines running the University’s Managed

Windows (XP, soon Windows 7) Service.

Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine.

Single combined submit host / central manager running on Sun V445 SMP server.

Currently running Condor 7.0.2 on execute hosts (moving to 7.2.x soon).

Policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours

Jobs are killed rather than suspended

Page 4: Ian C. Smith

Power saving at Liverpool We have around 2 000 centrally managed PCs across campus

which were powered up overnight, at weekends and during vacations.

Original power saving policy was to “power-off” machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity

Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.

Makes extensive use of PowerMAN system from Data Synergy comprising: service which forces machines into a low-power state and reports machine

activity to Management Reporting Platform Management Reporting Platform - central server from where usage stats

can be retrieved and viewed via a web browser

Page 5: Ian C. Smith

Typical monthly Condor activity

Page 6: Ian C. Smith

A home grown approach to power management Two main problems to deal with:

how to ensure Condor jobs are not evicted by hibernating PCs how to wake up dormant PCs to run Condor jobs on-demand

PowerMAN service prevents job eviction: can provide PowerMAN with a list of “protected programs” which ensures

that the machine remains active if running include condor_starter process as a protected program (only present while

a Condor job is running).

Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power: NICs must be remain powered-up during hibernation NICs must be capable of waking machines on receipt of a “magic packet” network must be able to route “magic packets” – not a problem for us but

YMMV

Page 7: Ian C. Smith

Adapting Condor for use with power-saving PCs cron runs on the submit host which periodically examines the state of the

queue (condor_status -schedd) and the pool (condor_status) if more idle jobs in queue than Unclaimed machines then need to wake up

hibernating machines find out the number of powered up machines machines in each “teaching

centre” (classroom) estimate the number of hibernating machines in each teaching centre from

total number of machines in each sort centres from highest number of available machines to lowest wake up centres in turn until sufficient machines woken to meet the

demand (or all centres woken up) MAC addresses of machines are stored in files sorted according to

teaching centre (needed for Wake-on-LAN)

Page 8: Ian C. Smith

Problems with the home-grown approach Assumes that any job can run on any machine:

users cannot choose particular teaching centres or machines in their job Requirements

ideally, pool needs to be homogenous errors in Requirements specification can cause severe problems

(machines repeatedly wake up then hibernate again) cron includes a “sanity check” for this

Can only estimate number of hibernating machines in each centre

Same machines get woken up first

Page 9: Ian C. Smith

Power management in Condor 7.4.X Condor daemons can now place an execute host in a low-power

state according to a given policy

Execute hosts signals it is about to enter low-power state to the Condor central manager

Central manager records persistent offline ClassAds for hibernating machines

Negotiator can perform matchmaking with offline ClassAds

Matches are passed to condor_rooster

condor_rooster pipes information to condor_power which wakes up machines using WoL

Page 10: Ian C. Smith

Implementing Condor power management Still use PowerMAN to power-down inactive PCs rather than using

Condor

Need a way of advertising available offline machines to the condor_collector

If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are form the subset O = P – A

cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay)

Finding P (the total set of machines which are out there) turns out to be a very difficult problem

Page 11: Ian C. Smith

How do we determine which machines are available to Condor Try waking them up !

Wake up all machines in each teaching centre once a week using WoL

After wakeup call, wait a few minutes and test each machine in turn with:

condor_status –direct <hostname>

Sanity check similar to UNIX ping

Record which machines respond and publish ClassAds for them

Page 12: Ian C. Smith

Unforeseen problems Not all woken up machines begin to run jobs

number of wakeups is limited by our “roll-your-own” version of condor_power

condor_rooster originally attempted to wake up all offline machines which matched job requirements

Included another limit in our condor_power script (number of wakeups must be < no of idle jobs)

Condor 7.4.3 should fix this, 7.5.3 adds ROOSTER_MAX_UNHIBERNATE configuration option

Wanted to wake up machines in random order so same machines not used repeatedly

Found that condor_negotiator ignored Rank values

Used condor_power script to implement this (“shuffles the deck”)

Should be fixed in 7.5.3 using ROOSTER_UNHIBERNATE_RANK config option

Need a way of advertising available offline machines to the condor_collector

If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are the subset O = P – A

cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay)

Finding P (the machines which are out there) turns out to be a very difficult problem

Page 13: Ian C. Smith

Unforeseen problems / cont’d Condor continued to wakeup machines after jobs removed (or complete)

Use

Unhibernate = CurrentTime – MachineLastMatchTime < 300

not

Unhibernate =!= Undefined

Difficult to distinguish Unclaimed offline machines from online ones in condor_status:

Also difficult to distinguish in Condor View graphs

to see all offline machines $ condor_status –constraint Offline==True

to see all powered-up machines $ condor_status –constraint Offline=!=True

Page 14: Ian C. Smith

Results – wakeup test

Page 15: Ian C. Smith

Future Directions Condor power management will allow us to expand the pool to include

even low-spec machines

If machines are not needed or are unsuitable they need not be woken up

Rank can be used so that newer (more energy efficient machines) used first

We would like a more accurate way of determining which machines are available. One possible method:

Record the amount of time since each machine last appeared in the pool and/or ran a job

Confidence in waking a PC can be described by a monotonically decreasing function of this

May still need to wake machines for testing occasionally

Encourage users to incorporate their own checkpointing code to reduce “badput” and energy wastage (see Liverpool Condor website for details).

Page 16: Ian C. Smith

Further Information

http://www.liv.ac.uk/e-science/condor

[email protected]


Recommended