+ All Categories
Home > Documents > LCG Operations

LCG Operations

Date post: 18-Jan-2016
Category:
Upload: vangie
View: 57 times
Download: 0 times
Share this document with a friend
Description:
Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22 nd October 2004. LCG Operations. Grid Operations: Scope of Responsibilities. Certification activities Certification of middleware as a coherent set of services - PowerPoint PPT Presentation
17
Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22 nd October 2004 LCG Operations LCG Operations
Transcript
Page 1: LCG Operations

Ian Bird

LCG Deployment Area Manager &EGEE Operations Manager

IT Department, CERN

Presentation to HEPiX22nd October 2004

LCG OperationsLCG Operations

Page 2: LCG Operations

22 October 2004 2

Grid Operations: Grid Operations: Scope of ResponsibilitiesScope of Responsibilities

• Certification activities Certification of middleware as a coherent set of services Preparing that package for deploying

• Operational and support activities Coordinating and supporting the deployment to collaborating computer

centres Coordinating Grid Operations activities Providing Operational support Providing Operational security support Providing User support CA management VO registration and management

• Policy CA and user registration policies Operational policy Security policies Resource usage and access policies

Page 3: LCG Operations

RALIN2P3

FNAL

Tier-1

USC….

KrakowCIEMAT

Rome

Taipei

LIP

CSCS

Legnaro

UB

IFCA

IC

MSU

Prague

Budapest

Cambridge

IFIC

NIKHEF

TRIUMF

CNAFFZK

BNLPIC

ICEPP Nordic

….

Tier-2small

centresdesktopsportables

• Tier-2 –– Well-managed, grid-

enabled disk storage– End-user analysis – batch

and interactive– Simulation

LHC Computing Model (simplified!!)• Tier-0 – the accelerator centre

– Filter raw data reconstruction event summary data (ESD)

– Record the master copy of raw and ESD

• Tier-1 – – Managed Mass Storage –

permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data service

– Data-heavy (ESD-based) analysis– Re-processing of raw data– National, regional support– “online” to the data acquisition process

high availability, long-term commitment

Page 4: LCG Operations

last update 04/21/23 14:49

LCG LCG-2

25 Universities4 National Labs2800 CPUs

Grid3

30 sites3200 cpus

Total:78 Sites~9000 CPUs

6.5 PByte

Total:78 Sites~9000 CPUs

6.5 PByte

Page 5: LCG Operations

22 October 2004 5

Operations services for LCGOperations services for LCG

• Operational support Hierarchical model

• CERN acts as 1st level support for the Tier 1 centres• Tier 1 centres provide 1st level support for associated Tier 2s

– Tier 1 “Primary sites”

Grid Operations Centres (GOC)• Provide operational monitoring, troubleshooting, coordination of incident

response, etc.• RAL (UK) led sub-project to prototype a GOC• 2nd GOC in Taipei now in prototype

• User support Central model

• FZK provides user support portal– Problem tracking system web-based and available to all LCG participants

• Experiments provide triage of problems CERN team provide in-depth support and support for integration of

experiment sw with grid middleware

Page 6: LCG Operations

22 October 2004 6

Support Teams within LCGSupport Teams within LCG

CERN DeploymentSupport (CDS)

Middleware Problems

4 LHCexperiments

(Alice Atlas CMS LHCb)

OtherCommunities

(VOs)

4 non-LHCexperiments

(BaBar CDF Compass D0)

Grid OperationsCenter (GOC)

Operations Problems

ResourceCenters (RC)

Hardware Problems

Experiment Specific User Support (ESUS)

Software Problems

Global Grid User Support (GGUS)Single Point of Contact

Coordination of User Support

Page 7: LCG Operations

22 October 2004 7

Experiences in deploymentExperiences in deployment

• LCG covers many sites (>70) now – both large and small Large sites – existing infrastructures – need to add-on grid interfaces etc. Small sites want a completely packaged, push-button, out-of-the-box

installation (including batch system, etc) Satisfying both simultaneously is hard – requires very flexible packaging,

installation, and configuration tools and procedures• A lot of effort had to be invested in this area

• There are many problems – but in the end we are quite successful System is stable and reliable System is used in production System is reasonably easy to install now – 60 sites Now have a basis on which to incrementally build essential functionality

• This infrastructure forms the basis of the initial EGEE production service

Page 8: LCG Operations

22 October 2004 8

• LCG Operations EGEE Operations

Page 9: LCG Operations

22 October 2004 9

What is EGEE ? (I)What is EGEE ? (I)

• EGEE (Enabling Grids for Escience in Europe) is a seamless Grid infrastructure for the support of scientific research, which: Integrates current national, regional

and thematic Grid efforts Provides researchers in academia

and industry with round-the-clock access to major computing resources, independent of geographic location

Applications

Geant network

Grid infrastructure

Page 10: LCG Operations

22 October 2004 10

What is EGEE ? (II)What is EGEE ? (II)

• 70 leading institutions in 28 countries, federated in regional Grids

• 32 M Euros EU funding (2004-5), O(100 M) total budget

• Aiming for a combined capacity of over 8000 CPUs (the largest international Grid infrastructure ever assembled)

• ~ 300 persons

Page 11: LCG Operations

22 October 2004 11

EGEE ActivitiesEGEE Activities

• Emphasis on operating a production grid and supporting the end-users

• 48 % service activities (Grid Operations, Support and Management, Network Resource Provision)

• 24 % middleware re-engineering (Quality Assurance, Security, Network Services Development)

• 28 % networking (Management, Dissemination and Outreach, User Training and Education, Application Identification and Support, Policy and International Cooperation)

Page 12: LCG Operations

22 October 2004 12

LCG and EGEE OperationsLCG and EGEE Operations

• EGEE is funded to operate and support a research grid infrastructure in Europe

• The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service LCG includes US and Asia-Pacific, EGEE includes other sciences Substantial part of infrastructure common to both

• LCG Deployment Manager is the EGEE Operations Manager CERN team (Operations Management Centre) provides coordination,

management, and 2nd level support

• Support activities are expanded with the provision of Core Infrastructure Centres (CIC) (4) Regional Operations Centres (ROC) (9) ROCs are coordinated by Italy, outside of CERN (which has no ROC)

Page 13: LCG Operations

22 October 2004 13

• User support: Becomes hierarchical Through the Regional Operations

Centres (ROC)• Act as front-line support for user and

operations issues• Provide local knowledge and

adaptations

• Coordination: At CERN (Operations Management

Centre) and CIC for HEP

• Operational support: The LCG GOC is the model for the

EGEE CICs• CIC’s replace the European GOC at

RAL• Also run essential infrastructure

services• Provide support for other (non-LHC)

applications• Provide 2nd level support to ROCs

LCG LCG EGEE in Europe EGEE in Europe

Page 14: LCG Operations

22 October 2004 14

SummarySummary

• Data challenges – demonstrated: Many m/w functional and performance issues (documented) Main problem is service stability

• Site fabric management, configuration, change control• Etc

Grid3 report similar problems … User support process needs improvement

• Now moving into continuous production + service & data challenges

Page 15: LCG Operations

22 October 2004 15

How to move forward – 1 How to move forward – 1

• Build an agreed operations model for the next year Should be able to evolve

• Operations/Fabric workshop Nov 2 – 4 Hepix ½ day – input from some sites and Grid3/OSG on their plans Documenting use-cases (based on experience), propose support

mechanisms for each EGEE SA1 infrastructure 5 working groups:

• Operations support• User support• Operational security• Fabric management issues• SW needs and tools requirements from operations

• Need fabric management training for many sites

Page 16: LCG Operations

22 October 2004 16

Some issuesSome issues

• Resource Centres: Large sites – have operations staff and/or on-call support Small sites – have no on-call and often little support at all

• Regional Operations Centres: Probably do not provide after-hours or on-call support. If this were the case then

the model of support could more include the ROCs. However, it is clear that most ROCs will not have this level of support.

• Core Infrastructure Centres: Must have on-call support after-hours

• To be rotated through the 4 or 5 active CICs

Thus, a basic question to answer is how much power or control can the CICs have in order to deal with problems when staff at RCs and ROCs are not available? Either CICs have rights to manage critical services on sites where there is no

support, or Have the right to remove “broken” sites and services from the infrastructure.

• Likely that we have all combinations of these …

Page 17: LCG Operations

22 October 2004 17

Immediate actionsImmediate actions

• Weekly operations meeting (Monday afternoon) Weekly reports from ROCs, CICs, other Tier 1s etc

• Operations Manager – Role rotates through 4 EGE CIC’s – manage problem reporting and

follow up Hand over responsibility in weekly meeting

• Operational security team Being set up – led by Ian Neilson, strong collaboration between US

and Europe on these issues.


Recommended