+ All Categories
Home > Documents > Resource Management and Accounting Working Group

Resource Management and Accounting Working Group

Date post: 06-Jan-2016
Category:
Upload: ondrea
View: 25 times
Download: 2 times
Share this document with a friend
Description:
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL. Resource Management and Accounting Working Group. Working group scope Progress since last face-to-face Future Work Other issues. Working Group Scope. - PowerPoint PPT Presentation
23
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL
Transcript
Page 1: Resource Management and Accounting Working Group

Scalable Systems Software Center

Resource Management and Accounting Working Group

Face-to-Face MeetingAug 26-27, 2004

Argonne, IL

Page 2: Resource Management and Accounting Working Group

Resource Management and Accounting Working Group

• Working group scope

• Progress since last face-to-face

• Future Work

• Other issues

Page 3: Resource Management and Accounting Working Group

Working Group Scope

The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting.

This working group will focus on the following software components:• Queue Manager• Scheduler• Accounting and Allocation Manager• Meta Scheduler

Other critical resource management components are being developed in the Process Management and Monitoring Working Group:

• Process Manager• Cluster Monitor

Page 4: Resource Management and Accounting Working Group

Resource Management Component Architecture

QueueManager

AllocationManager

NodeMonitor

GridScheduler

ClusterScheduler

NodeManager

ProcessManager

SecuritySystem

InformationService

DiscoveryService

Color Key

Working Group

Resource Management and Accounting

Execution Management and Monitoring

Node Configuration and Infrastructure

Infrastructure Services

EventManager

Page 5: Resource Management and Accounting Working Group

Resource Management Prototype Demonstration

QueueManager

AllocationManager

NodeMonitor

ClusterScheduler

ProcessManager

DiscoveryService

Color Key

Working Group

Resource Management and Accounting

Execution Management and Monitoring

Node Configuration and Infrastructure

JobSubmission

Client

1 Submit-Job

3 Query-N

ode6

Exe

c-Pr

oces

s

4 Create-Reservation

2 Query

-Job

5 Run-Jo

b

8 Dele

te-Job

0 Service

-Lookup

7 Query

-Job

9 Withdraw-Allocation

This demo runs a simple end-to-end

test with a job being submitted running past it’s wallclock limit

Page 6: Resource Management and Accounting Working Group

General Progress

• Updated and implemented SSSRMAP v3 specifications– SSSRMAP Wire Protocol v3.0.3

• Uses chunked HTTP transfer encoding

– SSSRMAP Message Format v3.0.3• Moved condition, assignment and option values into body of

Element (instead of in value attribute)

– SSS Job Object v3.0.3• Added job properties in support of input/output, interactive

jobs, dynamic jobs, suspend/resume, checkpoint/restart, resource limit enforcement, partitions, charges

Page 7: Resource Management and Accounting Working Group

General Progress

• Completed system testing for Second Alpha Release– on xtorc-sss, a RedHat 9.0 System– Included Maui, Bamboo, Warehouse, Gold, Process Manager, etc.

• Released second alpha versions of RMWG components– Fully implements version 3 of the SSSRMAP specification

• Bamboo Queue Manager v0.9.6• Maui Scheduler v3.2.6p9 (production version)• Gold Accounting and Allocation Manager v1.0.a2.1• Warehouse System Monitor v0.7.0

• RMWG Webpage updated with Second Alpha release– Updated info, docs, downloads, etc.– Added an interactive FAQ engine (FAQOMATIC)

Page 8: Resource Management and Accounting Working Group

Cluster Scheduler Progress

• Completed merger of Maui 3.2 and Maui SSS• Further added intrinsic support for SSS messages

– client-server, allocation manager, queue manager, resource manager interfaces, callbacks

– Status object– Error codes

• Enhanced support for SSS node and job objects– allocation manager, queue manager, resource manager interfaces– extended MCom library to support additional node and job object

attributes• improved socket and XML call reliability and security (added

buffer checking and detailed failure reporting)• Built the SSS integration guide and updated Maui

documentation

Page 9: Resource Management and Accounting Working Group

Queue Manager Progress

• Third release of Bamboo made available• Supports basic SSSRMAP v3 message format• Interactive job support finished and tested• New submission client to handle LoadLeveler job

scripts• Packaging updated to separate out components

required on the execution nodes.• Added support for job dependencies (ie chained

jobs are now supported)

Page 10: Resource Management and Accounting Working Group

Queue Manager Progress

• PM interface updated to use scoping of signal– Job termination code changed to implement a “soft”

kill. (ie SIGTERM followed later by a SIGKILL, if needed)

• SSS suite was updated on cluster in Ames in July – Appears to resolve most known problems.

Page 11: Resource Management and Accounting Working Group

Accounting and Allocation Manager Progress

• Completed rewrite of Gold server and all business logic in Perl

• Significantly improved account/allocation design• Created an account statement report• Implemented hierarchical account nesting and tested

trickle down deposits and trickle up charges• Implemented and tested credit accounts• Added support for auto-creation of users, projects and

machines• Implemented automatic recursive association

deletion/undeletion• Added support for query row limit, object aliases

Page 12: Resource Management and Accounting Working Group

Accounting and Allocation Manager Progress

• Made compliant with SSSRMAP v3 specification• Fully implemented response chunking • Updated clients and Gold User’s Guide• Completed Allocation, Reservation, Quotation, and

ChargeRates portions of GUI• Further simplified dependent module installation• Updated Component and Application Binding docs (v3.0.3) • Released Second Alpha release of Gold

– Regression and system tested on RedHat 9.0 (xtorc-sss)• Upgraded Gold on PNNL SGI cluster to the latest second

alpha version

Page 13: Resource Management and Accounting Working Group

Grid Scheduler Progress

• migrated grid scheduler interface to use SSS message format for all scheduler-grid scheduler interface calls

• migrated silver client commands to utilize SSS MCom XML library

• enhanced global queue management• Added diagnostic clients• Verified new job management state machine

Page 14: Resource Management and Accounting Working Group

Grid Scheduler Progress

• Introduced three new SSS objects– developed new SSS time range object– defined and implemented support for cluster to grid

scheduler interface reservation object– proposed new cluster/machine object for exchanging

high level policy and resource availability information

Page 15: Resource Management and Accounting Working Group

Future Work

• Beta release of all components– Including new Silver Meta-scheduler

• Portability testing for new components– Tier 1: Linux::RedHat (9.0)– Tier 2: Linux::Sousa, AIX, Tru-64– Tier 3: OS-X, Unicos– Tier 4: HP-UX, IRIX, Solaris

• Fault Tolerance supporting 25% cluster loss• Complete Design Specification documents for new

components

Page 16: Resource Management and Accounting Working Group

Future Work

Cluster Scheduler

• Convert to using SSS job object for job submission and resource queries

• Integrate/test Checkpoint-Restart support

• Extend and mature the resource manager and grid scheduler interfaces

Page 17: Resource Management and Accounting Working Group

Future Work

Queue manager

• Add job group support (mainly for submission)

• Add Task Group support (in progress)

• Add Job Submission filter

Page 18: Resource Management and Accounting Working Group

Future Work

Accounting and Allocation manager• Complete and test design for distributed accounting and multi-

organizational involvement in job startup• Add support for multi-site authentication/authorization (each site

having its own symmetric key)• Complete alpha version of GUI (fully featured)• Beta release of Gold (fully functional multi-site version with GUI)• Production deployment of Gold on 11.8TF Linux cluster (as primary

allocation system) and several other sites as beta testers• Documentation to include roles and custom objects• Port Gold to other OS’s (Tiers 1 and 2)• Create regression test suite (w/ APITest when ready)• Performance and scalability testing

Page 19: Resource Management and Accounting Working Group

Future Work

Grid Scheduler• First SSS release of Silver Grid Scheduler• Add additional statistics clients (global

information gathering and global policies)• Fault tolerance improvements• Add improved cluster level job start time

estimations • Initiate evaluation of peer-to-peer grid scheduling

model• Test support for Globus 3.x

Page 20: Resource Management and Accounting Working Group

Resource Limit Enforcement

• Bamboo: PBS JDL Specification, add support to PM

• Maui: Scheduler policies• PM: Specification language and setting OS limits

at job launch (Thanks!)• Warehouse: Measure the metrics by session and

job• PM: Need session id/process id mapping• Maui-Bamboo: Initialization Phase

Page 21: Resource Management and Accounting Working Group

Dynamic Jobs

Maleable Jobs – Ability to change size and duration up until startDynamically Modifiable Jobs – Change attributes while idle or runningDynamic Jobs – Job changes its size and duration itself while running

• Bamboo: Needs to add support for opaque extension attributes and QOS as well as dynamically modifiable jobs

• Maui: Policy support (growth bounds, QOS/queue support)• PM: For dynamic jobs, MPI needs to handle growth/shrinkage and

have that information reported to QM• Warehouse: Aggregate statistics by session id, job id and process id

• (We need to know the model for dynamic job support with MPI)

Page 22: Resource Management and Accounting Working Group

Checkpoint/Restart

{Suspend/Resume, Preempt/Restart, Checkpoint/Continue}?{System Initiated, User Initiated}?

• Bamboo: How specify in JDL that a job is checkpointable (also maybe specify other parameters like filesystem, etc)

• Bamboo-Maui: Needs to be able to keep track of how much walltime was used up before checkpoint and not count checkpoint idle time

• Maui: Policy handling– needs to know which resources released when suspended

• Checkpoint Manager: Status from Berkeley? Can we reattempt checkpoint/restart test Thursday evening?

Page 23: Resource Management and Accounting Working Group

Other Issues

• Supercomputing demos


Recommended