+ All Categories
Home > Documents > GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases,...

GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases,...

Date post: 04-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
19
Marco Mambelli DUNE Computing Model Workshop 9 September2019 GlideinWMS
Transcript
Page 1: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

Marco MambelliDUNE Computing Model Workshop9 September2019

GlideinWMS

Page 2: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

GlideinWMS is a pilot based resource provisioning tool for distributed High Throughput Computing• Provides reliable and uniform virtual clusters • Submits Glideins to unreliable heterogeneous resources• Leverages HTCondor- Provides HTCondor

pools- Uses HTCondor

capabilities

GlideinWMS

9/9/19 Marco Mambelli | GlideinWMS2

JobQueue

Worker

Worker

Worker

Worker

Frontend

FactoryGlidein

Glidein

Cluster

AWS

CE

Glidein

Glidein

Page 3: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• Scouts for resources and validates the Worker node – Cores, memory, disk, GPU, …– OS, software installed– CVMFS– VO specific tests

• Customizes the Worker node– Environment, GPU libraries, …– Starting containers (Singularity, …)– VO specific setup

• Provides a reliable and customized execute node to HTCondor

Glidein: node testing and customization

9/9/19 Marco Mambelli | GlideinWMS3

Page 4: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• A Glidein Factory knows how to submit to sites– Sites are described in a local configuration – Only trusted and tested sites are included

• Each site entry in the configuration contains– Contact info (hostname, resource type, queue name) – Site configuration (startup dir, OS type, …)– VOs authorized/supported– Other attributes (Site name, core count, max memory, ...) – Glideins can also auto-detect resources

• Configuration can be auto-generated (e.g. from CRIC), admin curated, stored in VCS (e.g. GitHub)

• Condor does the heavy lifting of submissions.

Factory

9/9/19 Marco Mambelli | GlideinWMS4

Page 5: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• Remote or local clusters:– Can have batch systems other than HTCondor: PBS,

SGE, Slurm, all supported.• Grid sites (CREAM, ARC, HTCondor-CE) • Hosted CEs• Commercial cloud (AWS, Google)• Open Source Cloud (OpenStack, OpenNebula)• HPC sites

– Uses an ssh-based system to ssh into HPC sites and submit directly from their login nodes.

Factory: Supported resources

9/9/19 Marco Mambelli | GlideinWMS5

Page 6: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• Monitors jobs to see how many Glideins are needed• Compares what entries (sites) are available• Requests Glideins from the Factory• Requests Factory to kill Glideins if there are too many• Pressure-based system

– Works keeping a certain number of Glideins running or idle at the sites

– Glideins requests are gradual to avoid spikes and overloads

• Manages credentials and delegates them to the Factory.

Frontend

9/9/19 Marco Mambelli | GlideinWMS6

Page 7: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• N-to-M relationship– Each Frontend can talk to many Factories – Each Factory may serve many Frontends

• Multiple User Pools• High Availability replicas

Distributed

9/9/19 Marco Mambelli | GlideinWMS7

S.Timm - FNAL-UKPlanningMeetingGlideinWMS

Page 8: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• Beta version was called “GlideCAF” in CDF– Began testing in 2005

• CMS Global Pool—regularly 200000+ cores– Redundant master nodes at CERN and Fermilab– Combines production and analysis jobs

• FIFEBATCH / FermiGrid– Integrates 18000 on-site cores of FermiGrid with up to 12000 offsite

cores. – This is what DUNE is using now for standard production and analysis– Demonstrated a pool with 2.01 million cores (NOVA 2018)

• Open Science Grid– Multi-VO structure shares the same Factory at UCSD

• HEPCloud– More in next talk

Major GlideinWMS Deployments

9/9/19 Marco Mambelli | GlideinWMS8

Page 9: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• Can be used directly- HTCondor

• Integrates well in hybrid systems - OSG-Connect- FermiGrid

• Used by workload/workflow managers:- JobSub- ProdAgent, CRAB- POMS- Pegasus- HEPCloud

How it is used?

9/9/19 Marco Mambelli | GlideinWMS9

Page 10: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• Add support for tokens– Use HTCondor token-auth for Glideins authentication– Support submission of Glideins to sites using sci-token

• Support of both Python 3 and Python 2• Support more Frontend-like services• Better and wider HPC support (e.g. multi-node jobs, LCF)• Improved monitoring• Strong collaboration with HTCondor

– Blackhole prevention – Singularity invocation via HTCondor and condor_ssh_to_job– Use of tokens (security without x509 certificates)

Development and collaborations

9/9/19 Marco Mambelli | GlideinWMS10

Page 11: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• Factories are shared– Common resource on-boarding– Operators know the sites– Can solve site-dependent problems (network, middleware)

• Glideins are a common framework– Improvement for/from one VO are passed onto others

• Adapting to new architectures or technologies• New types of containers• Black-hole prevention• Use of GPUs

– The same Glidein can run multiple Jobs

Shared operations

9/9/19 Marco Mambelli | GlideinWMS11

Page 12: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• Source on Github: https://github.com/glideinWMS/glideinwms• Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements• OSG product, distributed via Open Science Grid yum repos• Multiple project stakeholders (incl. OSG, Fermilab, CMS)

• Development team at Fermilab and UCSD/CERN• Several groups have expertise in operating GlidenWMS

infrastructure (OSG, Fermilab, CMS)

• Contact info:• Email: [email protected]• Doc: http://glideinwms.fnal.gov/

GlideinWMS software

9/9/19 Marco Mambelli | GlideinWMS12

Page 13: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

Thank you

• Acknowledgements– Some of the material was contributed by Steve Timm

(GlideinWMS presentation for the FNAL-UK Planning Meeting)

• Additional slides follow– HTCondor intro– Technical description

9/9/19Marco Mambelli | GlideinWMS13

Page 14: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

Refresher - HTCondor

9/9/19 Marco Mambelli | GlideinWMS14

• A HTCondor pool is composed of 3 pieces

S.Timm - FNAL-UKPlanningMeetingGlideinWMS

Page 15: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

What is a Glidein?

9/9/19 Marco Mambelli | GlideinWMS15

• A Glidein is a properly configured execution node submitted as a Grid* job

S.Timm - FNAL-UKPlanningMeetingGlideinWMS

Page 16: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

What is GlideinWMS?

9/9/19 Marco Mambelli | GlideinWMS16

S.Timm - FNAL-UKPlanningMeetingGlideinWMS

• GlideinWMS is an automated tool for submitting Glideins on demand

• Can increase or shrink the number of Glideins based on demand.

Page 17: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

GlideinWMS Components

9/9/19 Marco Mambelli | GlideinWMS17

• Glidein – Configures and starts HTCondorexecution daemons at sites

• Factory – Knows about the sites anddoes the submission

• Frontend – Knows about user jobs andthe user requests to increaseor decrease Glideins.

Runtime environmentdiscovery and validation

Grid knowledge andtrouble shooting

Site selection logic andjob monitoring

Page 18: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• The Factory works with an HTCondor pool, WMS pool, to submit Glideins to different resources

• The HTCondor Glideinsare pilots that launch a startd that registers on a second HTCondor pool, User pool

• User jobs are matched and execute on the resources

• The Frontend monitors the user schedds and notifies the Factory about the need for more Glideins

HTCondor building blocks in Glidein WMS

9/9/19 Marco Mambelli | GlideinWMS18

Page 19: GlideinWMS - indico.fnal.gov€¦ · • Packaged as RPMs for RHEL6 and RHEL7 • Regular releases, compatibility and support statements • OSG product, distributed via Open Science

• Dimensions: Cores, Memory, Disk, Lifetime• The resource partitions its Execute nodes• GlideinWMS further partitions the resources it receives

• E.g. 64 Cores machine, 16 or 32 cores cluster slots, 16 or 12 cores Glideins in4 or 2 cores slots, 2 or 1 core jobs

• Issues where Glidein can help– Fragmentation (unused resources)– Flexibility (vs Complexity)– Under or over provisioning (overbooking or be prudent)– Scaling (big slots, fewer slices)

Partitioning in an overlay system

9/9/19 Marco Mambelli | GlideinWMS19


Recommended