+ All Categories
Home > Documents > Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Date post: 18-Jan-2016
Category:
Upload: nicholas-webb
View: 216 times
Download: 2 times
Share this document with a friend
24
Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 http://www-oss.fnal.gov/scs/farms/ farm_users
Transcript
Page 1: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

1

Farms Users meeting4/27/2005

http://www-oss.fnal.gov/scs/farms/farm_users

Page 2: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

2

Agenda

• Events on farm past two weeks• Scheduled downtimes• New Users

– M. Kostin, Accel. Division– A. Lebedev, E907/MIPP

• Existing User reports• Special Presentation: Upcoming

Transition of General Purpose Farms to Condor and Grid

Page 3: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

3

Issues in last 2 weeks

• Thermal problems in LCC over weekend, no nodes went down.

• Down nodes on CDF farm: 1 out of 98 FBSNG, 1 out of 72 condor/CAF

• Down nodes on D0 farm— 12 out of 444 nodes

• Down nodes GP farm—0 out of 102 • GP Farms networking was upgraded to

gigabit on all nodes that are capable

Page 4: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

4

Downtimes

• GP Farms—none scheduled

• D0 farms—moving 3 racks of worker nodes to GCC, to be scheduled

• CDF Farms, upgrade of condor/CAF nodes to SLF304, in progress

Page 5: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

5

Page 6: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

6

Queue Process type Share QPrio Time

(GHz-hr)

Quota

1CPU=100

Accel Accel_Worker 3 0 5200

Auger Auger_Worker 2.5 0 6400

Dark Energy DES 2.5 0 4000

E898 E898_Worker 3.0 500 10000

E898 Short E898_Short 3.0 1000 12 6400

E907 e907 2.5 0 1000

KTeVFast KTeV Fast (inf) 9000 n/a

KTeVLong KTeV_Long 1.0 0 10000

KTeV KTeV_Medium 3.0 1000 6 6400

Minos Minos 3.0 500 10000

MinosShort Minos_Short 3.0 1000 12 6400

Run2MC Run2MC 1.5 0 2000

SDSS Image 3.0 1000 12 10000

SDSS Spectro 3.0 1000 2400

Theory Theory 1.5 0 5000

General Purpose Farms Allocations

Page 7: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

7

Page 8: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

8

Page 9: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

9

Page 10: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

10

GRID on General Purpose Farms

• Executive Summary:– A 14-node test cluster is available for testing Condor and grid

jobs now– Plan tentatively to add new nodes to Condor/grid cluster this

summer– Hope to complete transition to Condor batch system by end of

calendar year 2005– Local and grid submissions will still be allowed on General

Purpose Farms– Existing GP Farms users will have same priority whether

submitting via grid or locally– We will make sure appropriate training, documentation and

support is available to help users with the transition.– Testing currently ongoing with first grid-enabled user SDSS/DES

Page 11: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

11

Outline:

• Why use the Grid?

• Why use Condor

• Virtual Organizations

• The Open Science Grid

• GP Farms on the Open Science Grid

• Fermigrid

• Access to mass storage

Page 12: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

12

Why the Grid?

• General Purpose Farms have limited resources and equipment budget

• All Fermilab CD resources have mandate from division to interoperate

• Adding a grid interface to the farms enables us to interoperate with the larger clusters at Fermilab (specifically CMS, CDF) and make use of extra resources.

• Negotiation to use resources of the Open Science Grid off-site is in progress as well.

Page 13: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

13

Why Condor?• Free software (but you can buy support).• Supported by large team at U. of Wisconsin (and not by Fermilab

programmers)• Widely deployed in multi-hundred node clusters at Fermilab (CDF,

CMS).• New versions of Condor allow Kerberos 5 and x509 authentication• Comes with Condor-G which simplifies submission of grid jobs• Condor-C components allow for interoperation of independent

Condor pools• Some of our grid-enabled users take advantage of the extended

Condor features, so it is the fastest way to get our users on the grid.

Page 14: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

14

Virtual Organizations

• Each experiment is a Virtual Organization• Membership is managed by VOMS software (Virtual

Organization Management Service) and VOMRS software (Virtual Organization Management Registration Service)

• Virtual Organizations have already been created for all major user groups on the General Purpose Farms as part of Fermigrid project.

• We need at least one responsible person from each user group that is using the farms to say who should be members of their virtual organization.

• Groups we have identified:– sdss, ktev, miniboone, hypercp, minos, numi, accelerator,

ppd_astro, ppd_theory, patriot (run2mc),auger

Page 15: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

15

Open Science Grid• Continuation of efforts that were begun in Grid3.• Integration testing has been ongoing since February• Provisioning and deployment is occurring as we speak.• General Purpose Farms and CMS will both be Fermilab presences on the

Open Science Grid• 10 Virtual Organizations so far, mostly US-based:

– USATLAS– USCMS– SDSS– fMRI (functional Magnetic Resonance Imaging, based at Dartmouth)– GADU (Applied Genomics, based at Argonne)– GRASE (Engineering applications, based at SUNY Buffalo)– LIGO– CDF– STAR– iVDGL

• http://www.opensciencegrid.org

Page 16: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

16

Current Fermi GP farmsOSG presence

• Node fngp-osg as gatekeeper and condor master– (Dell dual Xeon 3.6 GHz)

• Software comes from the Virtual Data Toolkit– http://www.cs.wisc.edu/vdt

• 14 worker nodes as condor pool (fnpc201-214)• Can successfully run batch jobs submitted locally via

Condor and across the grid via Condor-G• Has passed all validation tests of the Open Science Grid• Using the extended privilege authorization from the VO

Privilege Project – Each group can define different roles for their users.– We can map whole group to one userid, several userids, or a

pool of userid’s.

Page 17: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

17

Current Architecture

• All home directories and staging areas are served off of FNSFO and will be accessible as before

• All OSG sites have $app and $data directories for applications and data transfer, these are served off of fngp-osg by NFS

• All VDT-related software (globus, condor, etc) served off of fngp-osg

• Grid jobs come in directly to fngp-osg and are farmed out to the 14 condor nodes.

Page 18: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

18

Goals for GP Farms Grid Deployment

• GP Farms is very busy > 90%• Two big productions about to start • Need to preserve lions share of CPU cycles for existing

users• Jobs from groups that are not GP Farms users will have

only opportunistic use of the farms.– Run at lowest priority (10-6 of regular priority)– Limited in how many jobs they can start at once.

• At the moment OSG jobs confined to condor pool of 14 slow nodes that weren’t otherwise getting used at all.

• GP Farms users will be able to access allocated share of resources whether they come in via grid or not.

Page 19: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

19

FNSFO

FBSNG

HEAD NODE

ENSTORE

GP Farms

FBSNG

Worker Nodes

102 currently

ENCP

FBS Submit

NFS

RAID

Current Farms Configuration

Page 20: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

20

FNGP-OSG

Gate-keeper

FNPCSRV1FBSNG HEAD NODE

GP Farms

FBSNG

Worker Nodes

102 currently

ENSTORE

Condor

WN

14 currently

New Condor WN40 (coming

this summer)

Configuration with Grid

NFS

RAIDFBS Submit

Fermigrid1Site

gatekeeper

Condor submit

Job from OSG

Job from Fermilab

Page 21: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

21

Fermigrid Interface

• Fermigrid is providing common site services for virtual organization management (VOMS) and user mapping (GUMS)

• These services expected to be online in next month or two.

• All non-Fermi jobs will eventually go through site Fermigrid gatekeeper and be farmed out to the other clusters.

Page 22: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

22

Access to mass storage

• Study currently under way.• Encp access to Enstore will remain available

from the head node.• Want to open dccp, gridftp, srmcp interfaces to

dCache• Before this is done, more study needed on

– Authentication mechanisms—can we access mass storage from the worker nodes

– Resource load—public dCache would need to expand its disk pool if the demand increases significantly.

Page 23: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

23

Support and Documentation

• http://grid.fnal.gov/fermigrid• http://www-oss.fnal.gov/scs/public/farms/gr

id/• http://www.ivdgl.org/osg-int/• http://plone.opensciencegrid.org/• http://www.opensciencegrid.org/• http://www.cs.wisc.edu/vdt• http://www.cs.wisc.edu/condor

Page 24: Farms User Meeting April 27 2005--Steven Timm 1 Farms Users meeting 4/27/2005 .

Farms User Meeting April 27 2005--Steven Timm

24

Things to watch and try

• http://www-oss.fnal.gov/scs/public/farms/grid/ being continuously updated as we know more about what works.

• Hope to add sample Condor jobs shortly• Those familiar with Condor can log into fngp-osg and try

to submit local test jobs now.– Source /export/osg/grid/setup.csh to get all the software setup

• Grid job submission won’t work until we get the virtual organizations populated (except for SDSS).

• More presentations coming at these meetings in weeks to come

• Hope to organize a workshop this summer.


Recommended