Enabling Grids for E-sciencE
Status of the LCG project
Julia AndreevaOn behalf of LCG project
CERN
Geneva, Switzerland
NEC 2005
16.09.2005
2
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Contents
• LCG project short overview• LHC computing model and requirements to the LCG
project (as estimated in the LCG TDR)• Middleware evolution, new generation- gLite• ARDA prototypes• Summary
3
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
LCG project
• LCG project approved by CERN Council in September 2001• LHC Experiments
– Grid projects: Europe, US– Regional & national centres
Goal– Prepare and deploy the computing environment to help the
experiments analyse the data from the LHC detectorsPhase 1 – 2002-05– development of common software prototype– operation of a pilot computing servicePhase 2 – 2006-08– acquire, build and operate the LHC computing service
4
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Applications AreaCommon projects
Libraries and tools,data management
Middleware AreaProvision of grid
middleware – acquisition,development, integration,
testing, support
CERN Fabric AreaCluster management
Data handlingCluster technology
Networking (WAN+local)Computing service at CERN
Grid Deployment AreaEstablishing and managing the
Grid Service - Middleware certification, security, operations.
Service Challenges
LCG activities
Distributed AnalysisJoint project on distributed
analysis with the LHC experiments
5
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Cooperation with other projects
• Network Services– LCG will be one of the most demanding applications of national
research networks such as the pan-European backbone network, GÉANT
• Grid Software– Globus, Condor and VDT have provided key components of the
middleware used. Key members participate in OSG and EGEE– Enabling Grids for E-sciencE (EGEE) includes a substantial
middleware activity.
• Grid Operations– The majority of the resources used are made available as part of
the EGEE Grid (~140 sites, 12,000 processors). EGEE also supports Core Infrastructure Centres and Regional Operations Centres.
– The US LHC programmes contribute to and depend on the Open Science Grid (OSG). Formal relationship with LCG through US-Atlas and US-CMS computing projects.
– The Nordic Data Grid Facility (NDGF) will begin operation in 2006. Prototype work is based on the NorduGrid middleware ARC.
6
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Country providing resourcesCountry anticipating joining EGEE/LCG
In EGEE-0 (LCG-2): 150 sites ~14,000 CPUs ~100 PB storage
Operations: Computing Resources
This greatly exceeds the project expectations for numbers of sites
New middlewareNumber of sitesHeterogeneity
Complexity
7
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Grid Operations
CIC
CICCIC
CICCIC
CICCIC
CICCIC
CICCIC
RCRC
RCRC RCRC
RCRC
RCRC
ROCROC
RCRC
RCRC
RCRCRCRC
RCRCRCRC
ROCROC
RCRC
RCRC RCRC
RCRC
RCRC
ROCROC
RCRC
RCRC
RCRC
RCRC
ROCROC
OMCOMC
RC - Resource Centre
• The grid is flat, but there is a Hierarchy of responsibility
– Essential to scale the operation
• Operations Management Centre (OMC):– At CERN – coordination etc…
• Core Infrastructure Centres (CIC)– Acts as single operations centres (one centre in
shift)– Daily grid operations – oversight,
troubleshooting– Run essential infrastructure services– Provide 2nd level support to ROCs– UK/I, Fr, It, CERN, + Russia + Taipei
• Regional Operations Centres (ROC)– Front-line support for user and operations– Provide local knowledge and adaptations– One in each region – many distributed
• User Support Centre (GGUS)– In FZK (Karlsruhe) (service desk)
8
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Operations focus
• Main focus of activities now:– Improving the operational reliability and
application efficiency: Automating monitoring alarms Ensuring a 24x7 service Removing sites that fail functional tests Operations interoperability with OSG and
others
– Improving user support: Demonstrate to users a reliable and trusted
support infrastructure
– Deployment of gLite components: Testing, certification pre-production service Migration planning and deployment – while
maintaining/growing interoperability
Further developments now have to be driven by experience in real use
LCG-2 (=EGEE-0)
prototyping
prototyping
product
20042004
20052005
LCG-3 (=EGEE-x?)
product
9
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
The LHC Computing Hierarchical Model
• Tier-0 at CERN– Record RAW data (1.25 GB/s ALICE)– Distribute second copy to Tier-1s– Calibrate and do first-pass reconstruction
• Tier-1 centres (11 defined)– Manage permanent storage – RAW, simulated, processed– Capacity for reprocessing, bulk analysis
• Tier-2 centres (>~ 100 identified)– Monte Carlo event simulation– End-user analysis
• Tier-3– Facilities at universities and laboratories– Access to data and processing in Tier-2s, Tier-1s– Outside the scope of the project
10
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Tier-1s
Tier-1 CentreExperiments served with priority
ALICE ATLAS CMS LHCb
TRIUMF, Canada X
GridKA, Germany X X X X
CC, IN2P3, France X X X X
CNAF, Italy X X X X
SARA/NIKHEF, NL X X X
Nordic Data Grid Facility (NDGF) X X X
ASCC, Taipei X X
RAL, UK X X X X
BNL, US X
FNAL, US X
PIC, Spain X X X
11
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Tier-2s
~100 identified – number still growing
12
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Experiments’ Requirements
• Single Virtual Organization (VO) across the Grid• Standard interfaces for Grid access to Storage Elements (SEs)
and Computing Elements (CEs)• Need of a reliable Workload Management System (WMS) to
efficiently exploit distributed resources.• Non-event data such as calibration and alignment data but also
detector construction descriptions will be held in data bases – read/write access to central (Oracle) databases at Tier-0 and read
access at Tier-1s with a local database cache at Tier-2s• Analysis scenarios and specific requirements are still evolving
– Prototype work is in progress (ARDA)• Online requirements are outside of the scope of LCG, but there
are connections:– Raw data transfer and buffering– Database management and data export– Some potential use of Event Filter Farms for offline processing
13
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Architecture – Grid services
• Storage Element– Mass Storage System (MSS) (CASTOR, Enstore, HPSS, dCache, etc.)– Storage Resource Manager (SRM) provides a common way to access
MSS, independent of implementation– File Transfer Services (FTS) provided e.g. by GridFTP or srmCopy
• Computing Element– Interface to local batch system e.g. Globus gatekeeper.– Accounting, status query, job monitoring
• Virtual Organization Management– Virtual Organization Management Services (VOMS)– Authentication and authorization based on VOMS model.
• Grid Catalogue Services– Mapping of Globally Unique Identifiers (GUID) to local file name– Hierarchical namespace, access control
• Interoperability– EGEE and OSG both use the Virtual Data Toolkit (VDT)– Different implementations are hidden by common interfaces
14
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Technology - Middleware
• Currently, the LCG-2 middleware is deployed in more than 100 sites
• It originated from Condor, EDG, Globus, VDT, and other projects.• Will evolve now to include functionalities of the gLite middleware
provided by the EGEE project which has just been made available.• Site services include security, the Computing Element (CE), the
Storage Element (SE), Monitoring and Accounting Services – currently available both form LCG-2 and gLite.
• VO services such as Workload Management System (WMS), File Catalogues, Information Services, File Transfer Services exist in both flavours (LCG-2 and gLite) maintaining close relations with VDT, Condor and Globus.
15
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
gLite middleware
– The 1st release of gLite (v1.0) made end March’05 http://glite.web.cern.ch/glite/packages/R1.0/R20050331 http://glite.web.cern.ch/glite/documentation
– Lightweight services– Interoperability & Co-existence with deployed
infrastructure– Performance & Fault Tolerance– Portable– Service oriented approach– Site autonomy– Open source license
16
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Main Differences to LCG-2
• Workload Management System works in push and pull mode
• Computing Element moving towards a VO based scheduler guarding the jobs of the VO (reduces load on GRAM)
• Re-factored file & replica catalogs
• Secure catalogs (based on user DN; VOMS certificates being integrated)
• Scheduled data transfers
• SRM-based storage
• Information Services: R-GMA with improved API, Service
Discovery and registry replication
• Move towards Web Services
17
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Prototypes
• It is important that the hardware and software systems developed in the framework of LCG be exercised in more and more demanding challenges
• Data Challenges have been recommended by the ‘Hoffmann Review’ of 2001. They have now been done by all experiments. Though the main goal was to validate the distributed computing model and to gradually build the computing systems, the results have been used for physics performance studies and for detector, trigger, and DAQ design. Limitations of the Grids have been identified and are being addressed.
• Presently, a series of Service Challenges aim to realistic end-to-end testing of experiment use-cases over in extended period leading to stable production services.
• The project ‘A Realisation of Distributed Analysis for LHC’ (ARDA) is developing end-to-end prototypes of distributed analysis systems using the EGEE middleware gLite for each of the LHC experiments.
18
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
ARDA- A Realisation of Distributed Analysis for LHC
• Distributed analysis on the Grid is the most difficult and least defined topic
• ARDA sets out to develop end-to-end analysis prototypes using the LCG-supported middleware.
• ALICE uses the AliROOT framework based on PROOF.• ATLAS has used DIAL services with the gLite prototype
as backend.• CMS has prototyped the ‘ARDA Support for CMS
Analysis Processing’ (ASAP) that is used by CMS physicists for daily analysis work.
• LHCb has based its prototype on GANGA, a common project between ATLAS and LHCb.
19
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Running parallel instances of ATHENA on gLite (ATLAS/ARDA and Taipei ASCC)
20
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
CMS: ASAP prototype
RefDB PubDB
ASAP UI
Monalisa
gLiteJDL
ASAP JobMonitoring
servicePublishingJob status On the WEB
Delegates user credentials using MyProxy
Job submission
Checking job status
Resubmission in case of failure
Fetching results
Storing results to Castor
Output files location
Application,applicationversion,
Executable,
Orca data cards
Data sample,
Working directory,
Castor directory to save output,
Number of events to be processed
Number of events per job
Job running on the Worker Node
ISGC 2005 35
Enabling Grids for E-sciencE
INFSO-RI -508833
Job Monitoring
• ASAP Monitor
ISGC 2005 34
Enabling Grids for E-sciencE
INFSO-RI -508833
CMS - Using MonAlisafor user job monitoring A single job
Is submiitedto gLite
JDL contains job-splittinginstructions
Master job issplit by gLiteinto sub-jobs
Dynamicmonitoringof the totalnumber of
the events ofprocessed by
all sub-jobsbelonging to
the same Master job
Demo at Supercomputing
04
ISGC 2005 34
Enabling Grids for E-sciencE
INFSO-RI -508833
Merging the results
21
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Summary
• The LCG infrastructure is proving to be an essential tool for the experiments
• Development and deployment of the gLite middleware aim to provide additional functionality and improved performance and satisfy challenging requirements of the LHC experiments
22
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Backup slide What is EGEE?
EGEE is the largest Grid
infrastructure project in Europe: • 70 leading institutions in 27 countries,
federated in regional Grids
• Leveraging national and regional grid activities
• Started April 2004 (end March 2006)
• EU review, February 2005 successful
• Preparing 2nd phase of the project – proposal to EU Grid call September 2005– 2 years starting April 2006
• Promoting scientific partnership outside EU
Goal of EGEE: develop a service grid infrastructure which is available to scientists 24 hours-a-day
LCG and EGEE are different projectsBut collaboration is ensured (sharing instead duplication)
23
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Backup slide Tier-0 -1 -2 Connectivity
Tier-2s and Tier-1s are inter-connected by the general
purpose research networks
Any Tier-2 mayaccess data at
any Tier-1
Tier-2 IN2P3
TRIUMF
ASCC
FNAL
BNL
Nordic
CNAF
SARAPIC
RAL
GridKa
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2Tier-2
Tier-2
National Reasearch Networks (NRENs) at Tier-1s:ASnetLHCnet/ESnetGARRLHCnet/ESnetRENATERDFNSURFnet6NORDUnetRedIRISUKERNACANARIE
24
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Backup slides The Eventflow
Rate
[Hz]
RAW
[MB]
ESDrDSTRECO[MB]
AOD
[kB]
MonteCarlo
[MB/evt]
MonteCarlo
% of real
ALICE HI 100 12.5 2.5 250 300 100
ALICE pp 100 1 0.04 4 0.4 100
ATLAS 200 1.6 0.5 100 2 20
CMS 150 1.5 0.25 50 2 100
LHCb 2000
0.025
0.025 0.5 20
50 days running in 2007107 seconds/year pp from 2008 on ~109 events/experiment106 seconds/year heavy ion
25
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Backup slides CPU Requirements
0
50
100
150
200
250
300
350
2007 2008 2009 2010Year
MS
I200
0
LHCb-Tier-2
CMS-Tier-2
ATLAS-Tier-2
ALICE-Tier-2
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN
CE
RN
Tie
r-1
Tie
r-2
58%
pled
ged
26
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Backup slide Disk Requirements
0
20
40
60
80
100
120
140
160
2007 2008 2009 2010Year
PB
LHCb-Tier-2
CMS-Tier-2
ATLAS-Tier-2
ALICE-Tier-2
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN
CE
RN
Tie
r-1
Tie
r-2
54%
pled
ged
27
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Backup slide Tape Requirements
CE
RN
Tie
r-1
0
20
40
60
80
100
120
140
160
2007 2008 2009 2010Year
PB
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN75%
pled
ged
28
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Backup slide Tier-0 components
• Batch system (LSF) manage CPU resources• Shared file system (AFS)• Disk pool and mass storage (MSS) manager (CASTOR)• Extremely Large Fabric management system (ELFms)
– Quattor – system administration – installation and configuration– LHC Era MONitoring (LEMON) system, server/client based– LHC-Era Automated Fabric (LEAF) – high-level commands to sets of
nodes
• CPU servers – ‘white boxes’, INTEL processors, (scientific) Linux• Disk Storage – Network Attached Storage (NAS) – mostly mirrored• Tape Storage – currently STK robots – future system under
evaluation• Network – fast gigabit Ethernet switches connected to multigigabit
backbone routers
29
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Data Challenges
• ALICE– PDC04 using AliEn services native or interfaced to LCG-Grid. 400,000
jobs run producing 40 TB of data for the Physics Performance Report.– PDC05: Event simulation, first-pass reconstruction, transmission to
Tier-1 sites, second pass reconstruction (calibration and storage), analysis with PROOF – using Grid services from LCG SC3 and AliEn
• ATLAS– Using tools and resources from LCG, NorduGrid, and Grid3 at 133 sites
in 30 countries using over 10,000 processors where 235,000 jobs produced more than 30 TB of data using an automatic production system.
• CMS– 100 TB simulated data reconstructed at a rate of 25 Hz, distributed to
the Tier-1 sites and reprocessed there.• LHCb
– LCG provided more than 50% of the capacity for the first data challenge 2004-2005. The production used the DIRAC system.
30
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Service Challenges
• A series of Service Challenges (SC) set out to successively approach the production needs of LHC
• While SC1 did not meet the goal to transfer for 2 weeks continuously at a rate of 500 MB/s, SC2 did exceed the goal (500 MB/s) by sustaining throughput of 600 MB/s to 7 sites.
• SC3 starts soon, using gLite middleware components, with disk-to-disk throughput tests, 10 Gb networking of Tier-1s to CERN providing SRM (1.1) interface to managed storage at Tier-1s. The goal is to achieve 150 MB/s disk-to disk and 60 MB/s to managed tape. There will be also Tier-1 to Tier-2 transfer tests.
• SC4 aims to demonstrate that all requirements from raw data taking to analysis can be met at least 6 months prior to data taking. The aggregate rate out of CERN is required to be 1.6 GB/s to tape at Tier-1s.
• The Service Challenges will turn into production services for the experiments.
31
Enabling Grids for E-sciencE
Julia Andreeva CERNNEC2005
Backup slide Key dates for Service Preparation
SC3
LHC Service OperationFull physics run
2005 20072006 2008
First physicsFirst beams
cosmics
Sep05 - SC3 Service Phase
May06 –SC4 Service Phase
Sep06 – Initial LHC Service in stable operation
SC4
• SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 1GB/sec, including mass storage 500 MB/sec (150 MB/sec & 60 MB/sec at Tier-1s)• SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput (~ 1.5 GB/sec mass storage throughput)• LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput
Apr07 – LHC Service commissioned