ALICE T1/T2 workshop 4-6 June 2013 CCIN2P3 Lyon Famous last words.

transcript

ALICE T1/T2 workshop ALICE T1/T2 workshop 4-6 June 20134-6 June 2013CCIN2P3 LyonCCIN2P3 Lyon

Famous last wordsFamous last words

ALICE T1/T2 workshopsALICE T1/T2 workshops• Yearly event

• 2011 – CERN• 2012 – KIT (Germany)• 2013 – CCIN2P3 (Lyon)

• Aims at gathering middleware and storage software developers, grid operation and network experts, and site administrators involved in ALICE computing activities

Some stats of the Lyon workshopSome stats of the Lyon workshop• 46 registered participants, 45 attended

• Good attendance, clearly these venues are still popular and needed

• 24 presentations over 5 session• 9 general on operations, software, procedures• 15 site-specific

• Appropriate number of coffee and lunch breaks, social events• Ample time for questions (numerous) and discussion (lively), true workshop style

ThemesThemes• Operations summary• WLCG middleware/services• Monitoring• Networking: LHCONE and IPv6• Storage: xrood v4. and EOS• CVMFS and AliRoot• Site operations, upgrades and (new) projects, gripes (actually none…)

Messages digest from the Messages digest from the presentationspresentations

• The original slides are available at the workshop indico page• Operations

• Successful year for ALICE and Grid operations – smooth and generally problem free, incident handling is mature and fast

• No changes foreseen to the operations principles and communication channels

• 2013/2014 (LHC LS1) will be years of data reprocessing and infrastructure upgrade

• The focus is on analysis – how to make it more efficient

Messages (2)Messages (2)• WLCG middleware

• CVMFS installed on many sites, leverage ALICE deployment and tuning through the existing TF

• WLCG VO-box is there and everyone should update• All EMI-3 products can be used• SHA-2 is on the horizon, services must be made

compatible• glExec – hey, it is still alive!

• Agile Infrastructure – IaaS, SaaS (for now)• OpenStack (Cinder, Keystone, Nova, Horizon, Glance)• Management through Puppet (Foreman, MPM,

PuppetDB, Hiera, git) … and Facter• Storage with Ceph• All of the above – prototyping and tests, ramping up

Messages (3)Messages (3)• Site dashboard

• http://alimonitor.cern.ch/siteinfo/issues.jsp• Get on the above link and start fixing, if you are on the

list• LHCONE

• The figure speaks for itself• All T2s shouldget involved• Instructions,expert lists are in the presentation

Messages (4)Messages (4)• IPv6 and ALICE

• IPv4 address space almost depleted, IPv6 is being deployed (CERN, 3 ALICE sites already)

• Not all services are IPv6-ready – test and adjustment is needed

• Cool history of the network bw evolution

• Xrootd 4.0.0• Complete client rewrite, new caching, non-blocking

request (client call-back), new user classes for metadata and data operations, IPv6 ready

• Impressive speedup for large operations• API redesigned, no backward compatibility, some cli

commands change names• ROOT plugin ready and being tested• Mid-July release target

Messages (5)Messages (5)• EOS

• Main disk storage manager at CERN, 45PB deployed 32PB used (9.9/8/3 ALICE)

• Designed to work with cheap storage servers, uses software raid (RAIN), ppm probability of file loss

• Impressive array of control and service tools (operations in mind)

• Even more impressive benchmarks…• Site installation – read carefully the pros/cons to

decide if it is good for you• Support – best effort, xrootd type

Messages (6)Messages (6)• ALICE production and analysis software

• AliRoot is “one software to rule them all” in ALICE offline

• >150 developers, analysis 1M SLOC, reconstruction, simulation, calibration, alignment, visualization: ~1.4M SLOC, supported on many platforms and flavors

• In development since 1(8)998• Sophisticated MC framework with embedded physics

generators, using G3 and G4• Incorporates the full calibration code, which is also run

on-line and in HLT (code share)• Encapsulates fully the analysis, a lot of work on

improving it, more quality and control checks needed• Efforts to reduce memory consumption in reco• G4 and Fluka in MC

Messages (7)Messages (7)

•CVMFS – timeline and procedures• Mature, scalable and supported product• Used by all other LHC experiments (and beyond)• Based on proven CernVM Family• Enabling technology for Clouds, CernVM as a user

interface, Virtual Analysis Facilities, opportunistic resources, Volunteer computing, part of a Long Term Data Preservation

• April 2014 – CVMFS on all sites, only method of sw distribution for ALICE

Sites Messages (1)Sites Messages (1)• UK

• GridPP T1+19, RAL, Oxford and Birmingham for ALICE

• Smooth operation, ALICE can (and does) run beyond its pledge, occasional problems with job memory

• Test of cloud on small scale• RMKI_KFKI

• Shared CMS/ALICE (170 cores, 72TB disk)• Good resources delivery• Fast turnaround of experts, good documentation on

operations is a must (done)

Sites Messages (2)Sites Messages (2)• KISTI

• Extended support team of 8 people • Tape system tested with RAW data from CERN• Network still to be debugged, but not a showstopper• CPU to be ramped up x2 in 2013• Well on its way to be the first T1 since the big T1 bang

• NDGF• Lose some (PDC), get some more cores (CSC)• Smooth going, dCache will stay and will get location

information to improve efficiency• The 0.0009 (reported, not real) efficiency at DCSC/KU

still a mystery, however it hurts NDGF as a whole, must be fixed

Sites Messages (3)Sites Messages (3)• Italy

• New head honcho – Domenico Elia (grazie Massimo!)• Funding is tough, National Research Projects help a lot

for manpower, PON helps with hardware in the south• 6T2s and a T1 – smooth delivery and generally no

issues• Torino is a hotbed of new technology – Clouds

(OpenNebula, GlusterFS, OpenWRT)• TAF is open for business, completely virtual (surprise!)

• Prague• The city is (partially) under water• Current 3.7cores 2PB disk, shared LHC/D0,

contributes ~1.5% Grid resources of ALICE+ATLAS• Stable operation, distributed storage• Funding situation is degrading

Sites Messages (4)Sites Messages (4)• US

• LLNL+LBL resources purchasing is complimentary and fits well to cover changing requirements

• CPU pledges fulfilled, SE a bit underused, on the rise• Infestation of the ‘zombie grass’ jobs, this is California,

something of this sort was to be expected…• Possibility for tape storage at LBL (potential T1)

• France• 8T2s, 1T1, providing 10% of WLCG power, steady

operation • Emphasis on common solutions for services and

support• All centres are in LHCONE (7in+7out PB have already

passed through it)• Flat resources provisioning for the next 4 year

Sites Messages (5)Sites Messages (5)• India (Kolkata)

• Provides about 1.2% of ALICE resources• Innovative cooling solution, all issues of the past

solved, stable operation• Plans for steady resources expansion

• Germany• 2T2s, 1T1 – the largest T1 in WLCG, provides ~50%

of ALICE T1 resources• Good centre names: Hessisches

Hochleistungsrechenzentrum Goethe Universität (requires 180IQ to say it)

• The T2s have heterogeneous installation (both batch and storage), support many non-LHC groups, well integrated in the ALICE Grid, smooth delivery

Sites Messages (6)Sites Messages (6)• Slovakia

• Since 2006 In ALICE• Serves ALICE/ATLAS/HONE• Upgrades planed for air-conditioning and power, later

CPU and disk, expert support is a concern• Reliable and steady resources provision

• RDIG• RRC-KI (toward T1): Hardware (CPU/Storage) rollout,

service installation and validation, personnel is in place, pilot testing with ATLAS payloads

• 8T2s + JRAF + PoD@SPbSU, deliver ~5% of the ALICE Grid resources, historically support all LHC VOs

• Plans for steady growth and sites consolidation• As all others, reliable and smooth operation

Social eventsSocial events

Victory!Victory!

How are you so cool underpressure?

I work at a T1!

The group The group

ALICE T1/T2 workshop 4-6 June 2013 CCIN2P3 Lyon Famous last words.

Documents