Status of the LCG project

Enabling Grids for E-sciencE

Status of the LCG project

Julia AndreevaOn behalf of LCG project

CERN

Geneva, Switzerland

NEC 2005

16.09.2005

2


Julia Andreeva CERNNEC2005

Contents

• LCG project short overview• LHC computing model and requirements to the LCG

project (as estimated in the LCG TDR)• Middleware evolution, new generation- gLite• ARDA prototypes• Summary

3



LCG project

• LCG project approved by CERN Council in September 2001• LHC Experiments

– Grid projects: Europe, US– Regional & national centres

Goal– Prepare and deploy the computing environment to help the

experiments analyse the data from the LHC detectorsPhase 1 – 2002-05– development of common software prototype– operation of a pilot computing servicePhase 2 – 2006-08– acquire, build and operate the LHC computing service

4



Applications AreaCommon projects

Libraries and tools,data management

Middleware AreaProvision of grid

middleware – acquisition,development, integration,

testing, support

CERN Fabric AreaCluster management

Data handlingCluster technology

Networking (WAN+local)Computing service at CERN

Grid Deployment AreaEstablishing and managing the

Grid Service - Middleware certification, security, operations.

Service Challenges

LCG activities

Distributed AnalysisJoint project on distributed

analysis with the LHC experiments

5



Cooperation with other projects

• Network Services– LCG will be one of the most demanding applications of national

research networks such as the pan-European backbone network, GÉANT

• Grid Software– Globus, Condor and VDT have provided key components of the

middleware used. Key members participate in OSG and EGEE– Enabling Grids for E-sciencE (EGEE) includes a substantial

middleware activity.

• Grid Operations– The majority of the resources used are made available as part of

the EGEE Grid (~140 sites, 12,000 processors). EGEE also supports Core Infrastructure Centres and Regional Operations Centres.

– The US LHC programmes contribute to and depend on the Open Science Grid (OSG). Formal relationship with LCG through US-Atlas and US-CMS computing projects.

– The Nordic Data Grid Facility (NDGF) will begin operation in 2006. Prototype work is based on the NorduGrid middleware ARC.

6



Country providing resourcesCountry anticipating joining EGEE/LCG

In EGEE-0 (LCG-2): 150 sites ~14,000 CPUs ~100 PB storage

Operations: Computing Resources

This greatly exceeds the project expectations for numbers of sites

New middlewareNumber of sitesHeterogeneity

Complexity

7



Grid Operations

CIC

CICCIC

CICCIC

CICCIC

CICCIC

CICCIC

RCRC

RCRC RCRC

RCRC

RCRC

ROCROC

RCRC

RCRC

RCRCRCRC

RCRCRCRC

ROCROC

RCRC

RCRC RCRC

RCRC

RCRC

ROCROC

RCRC

RCRC

RCRC

RCRC

ROCROC

OMCOMC

RC - Resource Centre

• The grid is flat, but there is a Hierarchy of responsibility

– Essential to scale the operation

• Operations Management Centre (OMC):– At CERN – coordination etc…

• Core Infrastructure Centres (CIC)– Acts as single operations centres (one centre in

shift)– Daily grid operations – oversight,

troubleshooting– Run essential infrastructure services– Provide 2nd level support to ROCs– UK/I, Fr, It, CERN, + Russia + Taipei

• Regional Operations Centres (ROC)– Front-line support for user and operations– Provide local knowledge and adaptations– One in each region – many distributed

• User Support Centre (GGUS)– In FZK (Karlsruhe) (service desk)

8



Operations focus

• Main focus of activities now:– Improving the operational reliability and

application efficiency: Automating monitoring alarms Ensuring a 24x7 service Removing sites that fail functional tests Operations interoperability with OSG and

others

– Improving user support: Demonstrate to users a reliable and trusted

support infrastructure

– Deployment of gLite components: Testing, certification pre-production service Migration planning and deployment – while

maintaining/growing interoperability

Further developments now have to be driven by experience in real use

LCG-2 (=EGEE-0)

prototyping

prototyping

product

20042004

20052005

LCG-3 (=EGEE-x?)

product

9



The LHC Computing Hierarchical Model

• Tier-0 at CERN– Record RAW data (1.25 GB/s ALICE)– Distribute second copy to Tier-1s– Calibrate and do first-pass reconstruction

• Tier-1 centres (11 defined)– Manage permanent storage – RAW, simulated, processed– Capacity for reprocessing, bulk analysis

• Tier-2 centres (>~ 100 identified)– Monte Carlo event simulation– End-user analysis

• Tier-3– Facilities at universities and laboratories– Access to data and processing in Tier-2s, Tier-1s– Outside the scope of the project

10



Tier-1s

Tier-1 CentreExperiments served with priority

ALICE ATLAS CMS LHCb

TRIUMF, Canada X

GridKA, Germany X X X X

CC, IN2P3, France X X X X

CNAF, Italy X X X X

SARA/NIKHEF, NL X X X

Nordic Data Grid Facility (NDGF) X X X

ASCC, Taipei X X

RAL, UK X X X X

BNL, US X

FNAL, US X

PIC, Spain X X X

11



Tier-2s

~100 identified – number still growing

12



Experiments’ Requirements

• Single Virtual Organization (VO) across the Grid• Standard interfaces for Grid access to Storage Elements (SEs)

and Computing Elements (CEs)• Need of a reliable Workload Management System (WMS) to

efficiently exploit distributed resources.• Non-event data such as calibration and alignment data but also

detector construction descriptions will be held in data bases – read/write access to central (Oracle) databases at Tier-0 and read

access at Tier-1s with a local database cache at Tier-2s• Analysis scenarios and specific requirements are still evolving

– Prototype work is in progress (ARDA)• Online requirements are outside of the scope of LCG, but there

are connections:– Raw data transfer and buffering– Database management and data export– Some potential use of Event Filter Farms for offline processing

13



Architecture – Grid services

• Storage Element– Mass Storage System (MSS) (CASTOR, Enstore, HPSS, dCache, etc.)– Storage Resource Manager (SRM) provides a common way to access

MSS, independent of implementation– File Transfer Services (FTS) provided e.g. by GridFTP or srmCopy

• Computing Element– Interface to local batch system e.g. Globus gatekeeper.– Accounting, status query, job monitoring

• Virtual Organization Management– Virtual Organization Management Services (VOMS)– Authentication and authorization based on VOMS model.

• Grid Catalogue Services– Mapping of Globally Unique Identifiers (GUID) to local file name– Hierarchical namespace, access control

• Interoperability– EGEE and OSG both use the Virtual Data Toolkit (VDT)– Different implementations are hidden by common interfaces

14



Technology - Middleware

• Currently, the LCG-2 middleware is deployed in more than 100 sites

• It originated from Condor, EDG, Globus, VDT, and other projects.• Will evolve now to include functionalities of the gLite middleware

provided by the EGEE project which has just been made available.• Site services include security, the Computing Element (CE), the

Storage Element (SE), Monitoring and Accounting Services – currently available both form LCG-2 and gLite.

• VO services such as Workload Management System (WMS), File Catalogues, Information Services, File Transfer Services exist in both flavours (LCG-2 and gLite) maintaining close relations with VDT, Condor and Globus.

15



gLite middleware

– The 1st release of gLite (v1.0) made end March’05 http://glite.web.cern.ch/glite/packages/R1.0/R20050331 http://glite.web.cern.ch/glite/documentation

– Lightweight services– Interoperability & Co-existence with deployed

infrastructure– Performance & Fault Tolerance– Portable– Service oriented approach– Site autonomy– Open source license

16



Main Differences to LCG-2

• Workload Management System works in push and pull mode

• Computing Element moving towards a VO based scheduler guarding the jobs of the VO (reduces load on GRAM)

• Re-factored file & replica catalogs

• Secure catalogs (based on user DN; VOMS certificates being integrated)

• Scheduled data transfers

• SRM-based storage

• Information Services: R-GMA with improved API, Service

Discovery and registry replication

• Move towards Web Services

17



Prototypes

• It is important that the hardware and software systems developed in the framework of LCG be exercised in more and more demanding challenges

• Data Challenges have been recommended by the ‘Hoffmann Review’ of 2001. They have now been done by all experiments. Though the main goal was to validate the distributed computing model and to gradually build the computing systems, the results have been used for physics performance studies and for detector, trigger, and DAQ design. Limitations of the Grids have been identified and are being addressed.

• Presently, a series of Service Challenges aim to realistic end-to-end testing of experiment use-cases over in extended period leading to stable production services.

• The project ‘A Realisation of Distributed Analysis for LHC’ (ARDA) is developing end-to-end prototypes of distributed analysis systems using the EGEE middleware gLite for each of the LHC experiments.

18



ARDA- A Realisation of Distributed Analysis for LHC

• Distributed analysis on the Grid is the most difficult and least defined topic

• ARDA sets out to develop end-to-end analysis prototypes using the LCG-supported middleware.

• ALICE uses the AliROOT framework based on PROOF.• ATLAS has used DIAL services with the gLite prototype

as backend.• CMS has prototyped the ‘ARDA Support for CMS

Analysis Processing’ (ASAP) that is used by CMS physicists for daily analysis work.

• LHCb has based its prototype on GANGA, a common project between ATLAS and LHCb.

19



Running parallel instances of ATHENA on gLite (ATLAS/ARDA and Taipei ASCC)

20



CMS: ASAP prototype

RefDB PubDB

ASAP UI

Monalisa

gLiteJDL

ASAP JobMonitoring

servicePublishingJob status On the WEB

Delegates user credentials using MyProxy

Job submission

Checking job status

Resubmission in case of failure

Fetching results

Storing results to Castor

Output files location

Application,applicationversion,

Executable,

Orca data cards

Data sample,

Working directory,

Castor directory to save output,

Number of events to be processed

Number of events per job

Job running on the Worker Node

ISGC 2005 35


INFSO-RI -508833

Job Monitoring

• ASAP Monitor

ISGC 2005 34


INFSO-RI -508833

CMS - Using MonAlisafor user job monitoring A single job

Is submiitedto gLite

JDL contains job-splittinginstructions

Master job issplit by gLiteinto sub-jobs

Dynamicmonitoringof the totalnumber of

the events ofprocessed by

all sub-jobsbelonging to

the same Master job

Demo at Supercomputing

04

ISGC 2005 34


INFSO-RI -508833

Merging the results

21



Summary

• The LCG infrastructure is proving to be an essential tool for the experiments

• Development and deployment of the gLite middleware aim to provide additional functionality and improved performance and satisfy challenging requirements of the LHC experiments

22



Backup slide What is EGEE?

EGEE is the largest Grid

infrastructure project in Europe: • 70 leading institutions in 27 countries,

federated in regional Grids

• Leveraging national and regional grid activities

• Started April 2004 (end March 2006)

• EU review, February 2005 successful

• Preparing 2nd phase of the project – proposal to EU Grid call September 2005– 2 years starting April 2006

• Promoting scientific partnership outside EU

Goal of EGEE: develop a service grid infrastructure which is available to scientists 24 hours-a-day

LCG and EGEE are different projectsBut collaboration is ensured (sharing instead duplication)

23



Backup slide Tier-0 -1 -2 Connectivity

Tier-2s and Tier-1s are inter-connected by the general

purpose research networks

Any Tier-2 mayaccess data at

any Tier-1

Tier-2 IN2P3

TRIUMF

ASCC

FNAL

BNL

Nordic

CNAF

SARAPIC

RAL

GridKa

Tier-2

Tier-2

Tier-2

Tier-2

Tier-2

Tier-2

Tier-2Tier-2

Tier-2

National Reasearch Networks (NRENs) at Tier-1s:ASnetLHCnet/ESnetGARRLHCnet/ESnetRENATERDFNSURFnet6NORDUnetRedIRISUKERNACANARIE

24



Backup slides The Eventflow

Rate

[Hz]

RAW

[MB]

ESDrDSTRECO[MB]

AOD

[kB]

MonteCarlo

[MB/evt]

MonteCarlo

% of real

ALICE HI 100 12.5 2.5 250 300 100

ALICE pp 100 1 0.04 4 0.4 100

ATLAS 200 1.6 0.5 100 2 20

CMS 150 1.5 0.25 50 2 100

LHCb 2000

0.025

0.025 0.5 20

50 days running in 2007107 seconds/year pp from 2008 on ~109 events/experiment106 seconds/year heavy ion

25



Backup slides CPU Requirements

0

50

100

150

200

250

300

350

2007 2008 2009 2010Year

MS

I200

0

LHCb-Tier-2

CMS-Tier-2

ATLAS-Tier-2

ALICE-Tier-2

LHCb-Tier-1

CMS-Tier-1

ATLAS-Tier-1

ALICE-Tier-1

LHCb-CERN

CMS-CERN

ATLAS-CERN

ALICE-CERN

CE

RN

Tie

r-1

Tie

r-2

58%

pled

ged

26



Backup slide Disk Requirements

0

20

40

60

80

100

120

140

160

2007 2008 2009 2010Year

PB

LHCb-Tier-2

CMS-Tier-2

ATLAS-Tier-2

ALICE-Tier-2

LHCb-Tier-1

CMS-Tier-1

ATLAS-Tier-1

ALICE-Tier-1

LHCb-CERN

CMS-CERN

ATLAS-CERN

ALICE-CERN

CE

RN

Tie

r-1

Tie

r-2

54%

pled

ged

27



Backup slide Tape Requirements

CE

RN

Tie

r-1

0

20

40

60

80

100

120

140

160

2007 2008 2009 2010Year

PB

LHCb-Tier-1

CMS-Tier-1

ATLAS-Tier-1

ALICE-Tier-1

LHCb-CERN

CMS-CERN

ATLAS-CERN

ALICE-CERN75%

pled

ged

28



Backup slide Tier-0 components

• Batch system (LSF) manage CPU resources• Shared file system (AFS)• Disk pool and mass storage (MSS) manager (CASTOR)• Extremely Large Fabric management system (ELFms)

– Quattor – system administration – installation and configuration– LHC Era MONitoring (LEMON) system, server/client based– LHC-Era Automated Fabric (LEAF) – high-level commands to sets of

nodes

• CPU servers – ‘white boxes’, INTEL processors, (scientific) Linux• Disk Storage – Network Attached Storage (NAS) – mostly mirrored• Tape Storage – currently STK robots – future system under

evaluation• Network – fast gigabit Ethernet switches connected to multigigabit

backbone routers

29



Data Challenges

• ALICE– PDC04 using AliEn services native or interfaced to LCG-Grid. 400,000

jobs run producing 40 TB of data for the Physics Performance Report.– PDC05: Event simulation, first-pass reconstruction, transmission to

Tier-1 sites, second pass reconstruction (calibration and storage), analysis with PROOF – using Grid services from LCG SC3 and AliEn

• ATLAS– Using tools and resources from LCG, NorduGrid, and Grid3 at 133 sites

in 30 countries using over 10,000 processors where 235,000 jobs produced more than 30 TB of data using an automatic production system.

• CMS– 100 TB simulated data reconstructed at a rate of 25 Hz, distributed to

the Tier-1 sites and reprocessed there.• LHCb

– LCG provided more than 50% of the capacity for the first data challenge 2004-2005. The production used the DIRAC system.

30



Service Challenges

• A series of Service Challenges (SC) set out to successively approach the production needs of LHC

• While SC1 did not meet the goal to transfer for 2 weeks continuously at a rate of 500 MB/s, SC2 did exceed the goal (500 MB/s) by sustaining throughput of 600 MB/s to 7 sites.

• SC3 starts soon, using gLite middleware components, with disk-to-disk throughput tests, 10 Gb networking of Tier-1s to CERN providing SRM (1.1) interface to managed storage at Tier-1s. The goal is to achieve 150 MB/s disk-to disk and 60 MB/s to managed tape. There will be also Tier-1 to Tier-2 transfer tests.

• SC4 aims to demonstrate that all requirements from raw data taking to analysis can be met at least 6 months prior to data taking. The aggregate rate out of CERN is required to be 1.6 GB/s to tape at Tier-1s.

• The Service Challenges will turn into production services for the experiments.

31



Backup slide Key dates for Service Preparation

SC3

LHC Service OperationFull physics run

2005 20072006 2008

First physicsFirst beams

cosmics

Sep05 - SC3 Service Phase

May06 –SC4 Service Phase

Sep06 – Initial LHC Service in stable operation

SC4

• SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 1GB/sec, including mass storage 500 MB/sec (150 MB/sec & 60 MB/sec at Tier-1s)• SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput (~ 1.5 GB/sec mass storage throughput)• LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput

Apr07 – LHC Service commissioned

Date post:	07-Jan-2016
Category:	Documents
Upload:	khanh
View:	75 times
Download:	0 times

Status of the LCG project

Documents