CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Critical services in the LHC Computing
Andrea Sciabà
CHEP 2009Prague, Czech Republic
21-27 March, 2009
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Content
• Introduction• Computing services• Service readiness metrics
– Software, service, site readiness• Critical services by experiment• Readiness status evaluation
– Criteria defined by parameters• Questions addressed to the service managers
• Conclusions– Are we ready for data taking?
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Introduction
• All LHC experiments work on the WLCG infrastructure– Resources distributed over ~140 computing centres– Variety of middleware solutions from different Grid projects
and providers• WLCG
– coordinates the interaction among experiments, resource providers and middleware developers
– A certain level of service availability and reliability is agreed in a MoU
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Services
• Computing systems are built on a large number of services– Developed by the experiment– Developed (or supported) by WLCG– Common infrastructure site services
• WLCG needs to– Periodically evaluate the readiness of services– Have monitoring data to assess the service
availability and reliability
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Infrastructure layer
Middleware layer
Experiment layer
WLCG services overview: Tier-0
VO data management
VO workload management
VO central catalogue
FTS LFC WMS
Fabric layer
OracleCASTOR Batch
AFS Web serversE-mail SavannahTwiki
SRM
xrootd
VOMS
Dashboard
CVS
MyProxy
The picture does not represent an actual experiment!
CE
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Middleware layer
Experiment layer
WLCG services overview: Tier-1
Site services VO local catalogue
FTS LFC
Fabric layer
DatabaseCASTOR,dCache Batch
SRM xrootd
VOBOX CE
The picture does not represent an actual experiment!
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Definition of “readiness”
Software readiness
High-level description of service available?
Middleware dependencies and versions defined?
Code released and packaged correctly?
Certification process exists?
Admin Guides available?
Service readiness
Disk, CPU, Database, Network requirements defined?
Monitoring criteria described?
Problem determination procedure documented?
Support chain defined (2nd/3rd level)?
Backup/restore procedure defined?
Site readiness
Suitable hardware used?
Monitoring implemented?
Test environment exists?
Problem determination procedure implemented?
Automatic configuration implemented?
Backup procedures implemented and tested?
• Service readiness has been defined as a set of criteria related to several aspects concerning• Software• Service• Site
• Service reliability is not included and it is rather measured a posteriori
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Critical services
• Each experiment has provided a list of “critical” services– Rated from 1 to 10
• A survey has been conducted on the critical tests to rate their readiness– To see where more effort is required
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ALICE critical services
Critical services Rank CommentAliEN 10 ALICE computing
frameworkSite VO boxes 10 Site becomes
unusable if downCastor and xrootd at Tier-0 10 Stops 1st pass reco
(24 hours buffer)Mass storage at Tier-1 5 Downtime does not
prevent data accessFile Transfer Service at Tier-0 7 Stops 2nd pass recogLite workload management 5 RedundantPROOF at Tier-0 CERN Analysis Facility
5 User analysis stops
Rank 10: critical, max downtime 2 hoursRank 7: serious disruption, max downtime 5 hoursRank 5: reduced efficiency, max downtime 12 hours
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ATLAS criticality
10: Very high interruption of these services affects online data-taking operations or stops any offline operations
7: High interruption of these services perturbs seriously offline computing operations
4: Moderate interruption of these services perturbs software development and part of computing operations
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ATLAS critical services
Rank Services at Tier-0Very high Oracle (online), DDM central catalogues, LFC
High CavernT0 transfers, online-offline DB connectivity, CASTOR internal data movement, T0 processing farm, Oracle (offline), FTS, VOMS, Dashboard, Panda/Bamboo, DDM site services
Moderate 3D streaming, WMS, SRM/SE, CAF, CVS, Subversion, AFS, build system
Rank Services at Tier-1High LFC, FTS, Oracle
Moderate 3D streaming, SRM/SE, CE
Rank Services at Tier-2Moderate SRM/SE, CE
Rank Services elsewhereHigh AMI database
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
CMS criticality
Rank 10: 24x7 on callRank 8,9: expert call-out
CMS gives a special meaning to all rank values
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
CMS critical services
Rank Services10 Oracle, CERN SRM, CASTOR, DBS, Batch, Kerberos, Cavern-
T0 transfer+processing, Web “back-ends”9 CERN FTS, PhEDEx, FroNTier launchpad, AFS, CAF
8 gLite WMS, VOMS, Myproxy, BDII, WAN, ProdManager
7 APT servers, build machines, Tag collector, testbed machines, CMS web server, Twiki
6 SAM, Dashboard, PhEDEx monitoring, Lemon
5 WebTools, e-mail, Hypernews, Savannah, CVS server
4 Linux repository, phone conferencing, valgrind machines
3 Benchmarking machines, Indico
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
LHCb critical services
Rank Services10 CERN VO boxes (DIRAC3 central services), Tier-0 LFC, VOMS
7 Tier-0 SE, T1 VO boxes, SE access from WN, FTS, WN misconfiguration, CE, Conditions DB, LHCb bookkeeping service, Oracle streaming, SAM
5 RB/WMS
3 Tier-1 LFC, Dashboard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
CERN WLCG services readiness
Service ReadinessData services
CASTOR, SRM, FTS, LFC 100%
Oracle No piquet service
Computing ServicesCE, batch services No expert piquet service
gLite WMS + LB Insufficient monitoring and problem detection, no expert piquet service,no backup, doubts on needed resources
Other Grid servicesMyProxy Procedures not fully documented, no expert piquet service
VOMS Support chain not fully defined (no expert piquet service), problem determination procedure does not cover everything
BDII Some documentation slightly outdated, no expert piquet service
VOBOX No expert piquet service, only sysadmin piquet, but OK for VO boxes
Dashboard No certification process, no automatic configuration
SAM 100%
Other non-Grid servicesLemon ?
AFS, Kerberos No certification process at CERN, problem determination procedure not documented, no test environment, no automatic configuration
Twiki Relies heavily on AFS (svc, backend data, backups)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ALICE services readiness
Service ReadinessAliEn 100%
VO BOX for ALICE 100%
Xrootd 100%
PROOF Support via mailing list onlyNo automatic configurationNo backup (but configuration files on AFS)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ATLAS services readiness
Service Readiness
DDM central catalogues Problem determination by shifters, solving by expertsConfiguration must be done by an expert
Panda / Bamboo Problem determination mostly by expertsNo automatic configuration
DDM site services Certification process in place, but “preproduction” instance is a subset of the production systemHardware requirements ok for central activities, unknown for analysisBackup via OracleMonitoring via SLS
ATLAS Metadata Interface
Lacks a rigid certification process, but has test infrastructureNo admin guideBackups via Oracle StreamsNo special procedure for problem determinationSupport chain and monitoring being improved
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
CMS services readiness
Service ReadinessDBS 100%
Tier-0 No admin guide, no certification, constantly evolving
PhEDEx Problem determination procedure not complete but improving
FroNTier 100%
Production tools Problem determination procedure not complete but improving
WebTools Monitoring should be improved (via Lemon, SLS)
VOBox 100%
• All DM/WM services are in production since > 1 year (sometimes > 4 years)• Most services have documented procedures for installation, configuration, start-up,
testing. They all have contacts• Backups: managed by IT for what is needed, everything else is transitory
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
LHCb services readiness
Service ReadinessDIRAC central services
No high level descriptionNo documented procedures for problem solving
T1 VOBOX services
No test environmentNo documented procedures for problem solving
Bookkeeping No high level description, no middleware dependenciesNo monitoring, no automatic configuration
Note that DIRAC3 is a relatively new system!
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Support at Tier-0/1
Final
• 24x7 support in place everywhere• VO boxes support defined by Service Level Agreements in
majority of cases
Almost final
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Observations
• Almost all CERN central services are fully ready– A few concerns about WMS monitoring– Several services only “best effort” outside working hours (no
expert piquet service, but sysadmin piquet)– Documentation of some Grid services not fully satisfactory
• ALICE is ok!• ATLAS ok but relying a lot on experts for problem solving and
configuration; analysis impact is largely unknown• CMS services basically ready
– Tier-0 services (data processing) still too “fluid”, being debugged in production during cosmic runs
• LHCb has shortcomings in procedures, documentation– But note that DIRAC3 was put in production very recently
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Conclusions
• Service reliability is not taken in consideration here– Services “fully ready” might actually be rather fragile;
depends on the deployment strategies• Overall, all critical services are in a good shape
– No showstoppers of any kind identified– The goal is clearly to fix the few remaining issues by the
restart of the data taking• The computing systems of the LHC experiments
rely on services mature from the point of view of documentation, deployment and support
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Backup slides
Detailed answers
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
WLCG (I)
• BDII (R. Dominques da Silva)– Software stable, some install guides outdated– HW requirements understood, monitoring criteria defined, ops/sysadmin
procedures in place, support chain well defined– Fully quattorized installation, HW adequate– Expert coverage outside working hours is best effort
• CE (U. Schwickerath)– Reinstallation procedure defined, log files backed up– Reliable nodes, test environment in PPS, automatic configuration via
CDB, remote syslog and TSM– Expert coverage outside working hours is best effort
• LXBATCH (U. Schwickerath)– Service description pages need updating, preprod system for Linux
updates– Batch system status held in LSF master, scalability issues to be
addressed when >50K jobs– Separate LSF test instance– Expert coverage outside working hours is best effort
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
WLCG (II)
• WMS + LB (E. Roche)– No real admin guide but plenty of docs exist– Doubts about the VO load estimates – HW adequate for them– Problem determination procedures being documented with experience– No backup procedure – all services stateful– Monitoring only partial– Expert coverage outside working hours is best effort
• MyProxy (M. Litmaath)– Docs on home page and gLite User Guide– HW defined at CERN– partial documentation of procedures– A testing procedure exists
• VOMS (S. Traylen)– No database reconnects (next version)– Problem determination does not cover everything– No out of hours coverage
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
WLCG (III)
• Dashboard (J. Andreeva)– No MW dependencies– No certification process– HW requirements not fully defined– Problem determination procedures not fully
defined
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ALICE (I)
• AliEn (P. Saiz)– All monitoring via MonaLisa– Support goes via mailing list– Suitable HW used at most sites
• VOBOX (P. Méndez)– No user or admin guide– Had some problems with the UI on 64-bit nodes– No admin guide for 64-bit installations– Problem determination procedure written by
ALICE– Automatic configuration via YAIM
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ALICE (II)
• xrootd (L. Betev)– Documentation on WWW and Savannah– MW dependencies on: dCache (emulation), DPM & CASTOR2, (plug-in), xrootd (native)– xroot releases: RPM, tarball, in-place compilation; other interfaces: packaged with corresponding
storage solution code– Certification depends on storage solution– Admin guides by ALICE as How-to’s– Monitoring: specific to storage solution; via MonaLISA for SE functionality and traffic– Problem determination procedures for all implementations– Support done via site support structure
• PROOF (L. Betev, M. Meoni)– SW description with dependencies and Admin Guide at
• http://root.cern.ch/twiki/bin/view/ROOT/PROOF• http://aliceinfo/Offline/Activities/Analysis/CAF/index.html
– Code managed via SVN and distributed as RPMs– There is a developer partition for certification– HW provided by IT: 15 nodes, 45 TB of disk, 120 cores, no DB, 1 Gb/s links– Monitoring via MonaLISA and Lemon– FAQ at http://aliceinfo/Offline/Activities/Analysis/CAF/index.html– No person on call, support via mailing list– Critical configuration files on AFS– No automatic configuration, no backup
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ATLAS
• DDM Central Catalogues (B. Koblitz)– Backups via Oracle– Problem solving via experts, problem finding by
shifters• DDM site services (S. Campana)
– Certification is sort of done in a production environment• Devel, integration (new!), preprod, prod• Preprod is a portion of prod instance
– Hardware requirements are fully satisfied for central activities but not necessarily for analysis (difficult to quantify at the moment)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
CMS (I)
• DM/WM tools (S. Metson)– PhEDEx, CRAB, ProdAgent not heavily tied to
any MW version– All software is packaged using RPM– All tools are properly tested, but– Tier-0 tools constantly evolving (and no Admin
Guide)– DBS certification needs improving– Core monitoring and problem determination
procedures are there, could improve– All that needs back-ups is on Oracle
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
CMS (II)
• VOBOX (P. Bittencourt)– Full documentation on Twiki (including Admin Guide),
dependencies always defined, using CVS and RPM– Certification testbed + preprod testbed for all WebTools
services• Services tested on VM running on testbed nodes
– Extending Lemon usage for alarms– Some services have operators for maintenance and
support, with testing procedures– Requirements specified for architecture (64-bit), disk
space, memory– No automatic configuration– Support goes via Patricia and – if needed – to
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
LHCb
• DIRAC 3 central services (M. Seco, R. Santinelli)– Only access procedures documented, no admin guide– Problem determination procedure procedure not fully documented– Backup procedures defined in SLA– DIRAC3 test environment
• DIRAC3 site services (R. Nandakumar, R. Santinelli)– No admin guide– No test environment
• Bookkeeping (Z. Mathe, R. Santinelli)– Support chain defined: 1) user support, 2) Grid expert on duty or DIRAC
developers, 3) BK expert– Service runs on a mid-range machine and has an integration instance for
testing– Back-end is on Oracle LHCb RAC– Is monitored