Critical services in the LHC Computing

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Critical services in the LHC Computing

Andrea Sciabà

CHEP 2009Prague, Czech Republic

21-27 March, 2009

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Content

• Introduction• Computing services• Service readiness metrics

– Software, service, site readiness• Critical services by experiment• Readiness status evaluation

– Criteria defined by parameters• Questions addressed to the service managers

• Conclusions– Are we ready for data taking?

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Introduction

• All LHC experiments work on the WLCG infrastructure– Resources distributed over ~140 computing centres– Variety of middleware solutions from different Grid projects

and providers• WLCG

– coordinates the interaction among experiments, resource providers and middleware developers

– A certain level of service availability and reliability is agreed in a MoU

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Services

• Computing systems are built on a large number of services– Developed by the experiment– Developed (or supported) by WLCG– Common infrastructure site services

• WLCG needs to– Periodically evaluate the readiness of services– Have monitoring data to assess the service

availability and reliability

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Infrastructure layer

Middleware layer

Experiment layer

WLCG services overview: Tier-0

VO data management

VO workload management

VO central catalogue

FTS LFC WMS

Fabric layer

OracleCASTOR Batch

AFS Web serversE-mail SavannahTwiki

SRM

xrootd

VOMS

Dashboard

CVS

MyProxy

The picture does not represent an actual experiment!

CE

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Middleware layer

Experiment layer

WLCG services overview: Tier-1

Site services VO local catalogue

FTS LFC

Fabric layer

DatabaseCASTOR,dCache Batch

SRM xrootd

VOBOX CE

The picture does not represent an actual experiment!

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Definition of “readiness”

Software readiness

High-level description of service available?

Middleware dependencies and versions defined?

Code released and packaged correctly?

Certification process exists?

Admin Guides available?

Service readiness

Disk, CPU, Database, Network requirements defined?

Monitoring criteria described?

Problem determination procedure documented?

Support chain defined (2nd/3rd level)?

Backup/restore procedure defined?

Site readiness

Suitable hardware used?

Monitoring implemented?

Test environment exists?

Problem determination procedure implemented?

Automatic configuration implemented?

Backup procedures implemented and tested?

• Service readiness has been defined as a set of criteria related to several aspects concerning• Software• Service• Site

• Service reliability is not included and it is rather measured a posteriori

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Critical services

• Each experiment has provided a list of “critical” services– Rated from 1 to 10

• A survey has been conducted on the critical tests to rate their readiness– To see where more effort is required

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ALICE critical services

Critical services Rank CommentAliEN 10 ALICE computing

frameworkSite VO boxes 10 Site becomes

unusable if downCastor and xrootd at Tier-0 10 Stops 1st pass reco

(24 hours buffer)Mass storage at Tier-1 5 Downtime does not

prevent data accessFile Transfer Service at Tier-0 7 Stops 2nd pass recogLite workload management 5 RedundantPROOF at Tier-0 CERN Analysis Facility

5 User analysis stops

Rank 10: critical, max downtime 2 hoursRank 7: serious disruption, max downtime 5 hoursRank 5: reduced efficiency, max downtime 12 hours

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ATLAS criticality

10: Very high interruption of these services affects online data-taking operations or stops any offline operations

7: High interruption of these services perturbs seriously offline computing operations

4: Moderate interruption of these services perturbs software development and part of computing operations

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ATLAS critical services

Rank Services at Tier-0Very high Oracle (online), DDM central catalogues, LFC

High CavernT0 transfers, online-offline DB connectivity, CASTOR internal data movement, T0 processing farm, Oracle (offline), FTS, VOMS, Dashboard, Panda/Bamboo, DDM site services

Moderate 3D streaming, WMS, SRM/SE, CAF, CVS, Subversion, AFS, build system

Rank Services at Tier-1High LFC, FTS, Oracle

Moderate 3D streaming, SRM/SE, CE

Rank Services at Tier-2Moderate SRM/SE, CE

Rank Services elsewhereHigh AMI database

CERN IT Department

CH-1211 Genève 23


it

InternetServices

CMS criticality

Rank 10: 24x7 on callRank 8,9: expert call-out

CMS gives a special meaning to all rank values

CERN IT Department

CH-1211 Genève 23


it

InternetServices

CMS critical services

Rank Services10 Oracle, CERN SRM, CASTOR, DBS, Batch, Kerberos, Cavern-

T0 transfer+processing, Web “back-ends”9 CERN FTS, PhEDEx, FroNTier launchpad, AFS, CAF

8 gLite WMS, VOMS, Myproxy, BDII, WAN, ProdManager

7 APT servers, build machines, Tag collector, testbed machines, CMS web server, Twiki

6 SAM, Dashboard, PhEDEx monitoring, Lemon

5 WebTools, e-mail, Hypernews, Savannah, CVS server

4 Linux repository, phone conferencing, valgrind machines

3 Benchmarking machines, Indico

CERN IT Department

CH-1211 Genève 23


it

InternetServices

LHCb critical services

Rank Services10 CERN VO boxes (DIRAC3 central services), Tier-0 LFC, VOMS

7 Tier-0 SE, T1 VO boxes, SE access from WN, FTS, WN misconfiguration, CE, Conditions DB, LHCb bookkeeping service, Oracle streaming, SAM

5 RB/WMS

3 Tier-1 LFC, Dashboard

CERN IT Department

CH-1211 Genève 23


it

InternetServices

CERN WLCG services readiness

Service ReadinessData services

CASTOR, SRM, FTS, LFC 100%

Oracle No piquet service

Computing ServicesCE, batch services No expert piquet service

gLite WMS + LB Insufficient monitoring and problem detection, no expert piquet service,no backup, doubts on needed resources

Other Grid servicesMyProxy Procedures not fully documented, no expert piquet service

VOMS Support chain not fully defined (no expert piquet service), problem determination procedure does not cover everything

BDII Some documentation slightly outdated, no expert piquet service

VOBOX No expert piquet service, only sysadmin piquet, but OK for VO boxes

Dashboard No certification process, no automatic configuration

SAM 100%

Other non-Grid servicesLemon ?

AFS, Kerberos No certification process at CERN, problem determination procedure not documented, no test environment, no automatic configuration

Twiki Relies heavily on AFS (svc, backend data, backups)

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ALICE services readiness

Service ReadinessAliEn 100%

VO BOX for ALICE 100%

Xrootd 100%

PROOF Support via mailing list onlyNo automatic configurationNo backup (but configuration files on AFS)

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ATLAS services readiness

Service Readiness

DDM central catalogues Problem determination by shifters, solving by expertsConfiguration must be done by an expert

Panda / Bamboo Problem determination mostly by expertsNo automatic configuration

DDM site services Certification process in place, but “preproduction” instance is a subset of the production systemHardware requirements ok for central activities, unknown for analysisBackup via OracleMonitoring via SLS

ATLAS Metadata Interface

Lacks a rigid certification process, but has test infrastructureNo admin guideBackups via Oracle StreamsNo special procedure for problem determinationSupport chain and monitoring being improved

CERN IT Department

CH-1211 Genève 23


it

InternetServices

CMS services readiness

Service ReadinessDBS 100%

Tier-0 No admin guide, no certification, constantly evolving

PhEDEx Problem determination procedure not complete but improving

FroNTier 100%

Production tools Problem determination procedure not complete but improving

WebTools Monitoring should be improved (via Lemon, SLS)

VOBox 100%

• All DM/WM services are in production since > 1 year (sometimes > 4 years)• Most services have documented procedures for installation, configuration, start-up,

testing. They all have contacts• Backups: managed by IT for what is needed, everything else is transitory

CERN IT Department

CH-1211 Genève 23


it

InternetServices

LHCb services readiness

Service ReadinessDIRAC central services

No high level descriptionNo documented procedures for problem solving

T1 VOBOX services

No test environmentNo documented procedures for problem solving

Bookkeeping No high level description, no middleware dependenciesNo monitoring, no automatic configuration

Note that DIRAC3 is a relatively new system!

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Support at Tier-0/1

Final

• 24x7 support in place everywhere• VO boxes support defined by Service Level Agreements in

majority of cases

Almost final

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Observations

• Almost all CERN central services are fully ready– A few concerns about WMS monitoring– Several services only “best effort” outside working hours (no

expert piquet service, but sysadmin piquet)– Documentation of some Grid services not fully satisfactory

• ALICE is ok!• ATLAS ok but relying a lot on experts for problem solving and

configuration; analysis impact is largely unknown• CMS services basically ready

– Tier-0 services (data processing) still too “fluid”, being debugged in production during cosmic runs

• LHCb has shortcomings in procedures, documentation– But note that DIRAC3 was put in production very recently

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Conclusions

• Service reliability is not taken in consideration here– Services “fully ready” might actually be rather fragile;

depends on the deployment strategies• Overall, all critical services are in a good shape

– No showstoppers of any kind identified– The goal is clearly to fix the few remaining issues by the

restart of the data taking• The computing systems of the LHC experiments

rely on services mature from the point of view of documentation, deployment and support

CERN IT Department

CH-1211 Genève 23


it

Backup slides

Detailed answers

CERN IT Department

CH-1211 Genève 23


it

InternetServices

WLCG (I)

• BDII (R. Dominques da Silva)– Software stable, some install guides outdated– HW requirements understood, monitoring criteria defined, ops/sysadmin

procedures in place, support chain well defined– Fully quattorized installation, HW adequate– Expert coverage outside working hours is best effort

• CE (U. Schwickerath)– Reinstallation procedure defined, log files backed up– Reliable nodes, test environment in PPS, automatic configuration via

CDB, remote syslog and TSM– Expert coverage outside working hours is best effort

• LXBATCH (U. Schwickerath)– Service description pages need updating, preprod system for Linux

updates– Batch system status held in LSF master, scalability issues to be

addressed when >50K jobs– Separate LSF test instance– Expert coverage outside working hours is best effort

CERN IT Department

CH-1211 Genève 23


it

InternetServices

WLCG (II)

• WMS + LB (E. Roche)– No real admin guide but plenty of docs exist– Doubts about the VO load estimates – HW adequate for them– Problem determination procedures being documented with experience– No backup procedure – all services stateful– Monitoring only partial– Expert coverage outside working hours is best effort

• MyProxy (M. Litmaath)– Docs on home page and gLite User Guide– HW defined at CERN– partial documentation of procedures– A testing procedure exists

• VOMS (S. Traylen)– No database reconnects (next version)– Problem determination does not cover everything– No out of hours coverage

CERN IT Department

CH-1211 Genève 23


it

InternetServices

WLCG (III)

• Dashboard (J. Andreeva)– No MW dependencies– No certification process– HW requirements not fully defined– Problem determination procedures not fully

defined

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ALICE (I)

• AliEn (P. Saiz)– All monitoring via MonaLisa– Support goes via mailing list– Suitable HW used at most sites

• VOBOX (P. Méndez)– No user or admin guide– Had some problems with the UI on 64-bit nodes– No admin guide for 64-bit installations– Problem determination procedure written by

ALICE– Automatic configuration via YAIM

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ALICE (II)

• xrootd (L. Betev)– Documentation on WWW and Savannah– MW dependencies on: dCache (emulation), DPM & CASTOR2, (plug-in), xrootd (native)– xroot releases: RPM, tarball, in-place compilation; other interfaces: packaged with corresponding

storage solution code– Certification depends on storage solution– Admin guides by ALICE as How-to’s– Monitoring: specific to storage solution; via MonaLISA for SE functionality and traffic– Problem determination procedures for all implementations– Support done via site support structure

• PROOF (L. Betev, M. Meoni)– SW description with dependencies and Admin Guide at

• http://root.cern.ch/twiki/bin/view/ROOT/PROOF• http://aliceinfo/Offline/Activities/Analysis/CAF/index.html

– Code managed via SVN and distributed as RPMs– There is a developer partition for certification– HW provided by IT: 15 nodes, 45 TB of disk, 120 cores, no DB, 1 Gb/s links– Monitoring via MonaLISA and Lemon– FAQ at http://aliceinfo/Offline/Activities/Analysis/CAF/index.html– No person on call, support via mailing list– Critical configuration files on AFS– No automatic configuration, no backup

http://root.cern.ch/twiki/bin/view/ROOT/PROOF

http://aliceinfo/Offline/Activities/Analysis/CAF/index.html

http://aliceinfo/Offline/Activities/Analysis/CAF/index.html

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ATLAS

• DDM Central Catalogues (B. Koblitz)– Backups via Oracle– Problem solving via experts, problem finding by

shifters• DDM site services (S. Campana)

– Certification is sort of done in a production environment• Devel, integration (new!), preprod, prod• Preprod is a portion of prod instance

– Hardware requirements are fully satisfied for central activities but not necessarily for analysis (difficult to quantify at the moment)

CERN IT Department

CH-1211 Genève 23


it

InternetServices

CMS (I)

• DM/WM tools (S. Metson)– PhEDEx, CRAB, ProdAgent not heavily tied to

any MW version– All software is packaged using RPM– All tools are properly tested, but– Tier-0 tools constantly evolving (and no Admin

Guide)– DBS certification needs improving– Core monitoring and problem determination

procedures are there, could improve– All that needs back-ups is on Oracle

CERN IT Department

CH-1211 Genève 23


it

InternetServices

CMS (II)

• VOBOX (P. Bittencourt)– Full documentation on Twiki (including Admin Guide),

dependencies always defined, using CVS and RPM– Certification testbed + preprod testbed for all WebTools

services• Services tested on VM running on testbed nodes

– Extending Lemon usage for alarms– Some services have operators for maintenance and

support, with testing procedures– Requirements specified for architecture (64-bit), disk

space, memory– No automatic configuration– Support goes via Patricia and – if needed – to

[email protected]

CERN IT Department

CH-1211 Genève 23


it

InternetServices

LHCb

• DIRAC 3 central services (M. Seco, R. Santinelli)– Only access procedures documented, no admin guide– Problem determination procedure procedure not fully documented– Backup procedures defined in SLA– DIRAC3 test environment

• DIRAC3 site services (R. Nandakumar, R. Santinelli)– No admin guide– No test environment

• Bookkeeping (Z. Mathe, R. Santinelli)– Support chain defined: 1) user support, 2) Grid expert on duty or DIRAC

developers, 3) BK expert– Service runs on a mid-range machine and has an integration instance for

testing– Back-end is on Oracle LHCb RAC– Is monitored

Date post:	24-Feb-2016
Category:	Documents
Upload:	maisie
View:	24 times
Download:	1 times

Critical services in the LHC Computing

Documents