+ All Categories
Home > Documents > Status of PDC’07 and user analysis issues (from admin point of view)

Status of PDC’07 and user analysis issues (from admin point of view)

Date post: 08-Jan-2016
Category:
Upload: kalila
View: 19 times
Download: 0 times
Share this document with a friend
Description:
Status of PDC’07 and user analysis issues (from admin point of view). L. Betev August 28, 2007. The ALICE Grid. Powered by AliEn Interfaces to gLite, ARC and (future) OSG WMS As of today – 65 entry points (62 sites), 4 continents Africa (1), Asia (4), Europe (53), North America (4) - PowerPoint PPT Presentation
Popular Tags:
18
Status of PDC’07 and Status of PDC’07 and user analysis issues user analysis issues (from admin point of (from admin point of view) view) L. Betev L. Betev August 28, 2007 August 28, 2007
Transcript
Page 1: Status of PDC’07 and user analysis issues (from admin point of view)

Status of PDC’07 and user Status of PDC’07 and user analysis issues (from analysis issues (from admin point of view)admin point of view)

L. BetevL. BetevAugust 28, 2007August 28, 2007

Page 2: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 22

The ALICE GridThe ALICE Grid

Powered by AliEnPowered by AliEn Interfaces to gLite, ARC and (future) OSG WMSInterfaces to gLite, ARC and (future) OSG WMS

As of today – 65 entry points (62 sites), 4 continentsAs of today – 65 entry points (62 sites), 4 continents Africa (1), Asia (4), Europe (53), North America (4)Africa (1), Asia (4), Europe (53), North America (4) 21 countries, 1 consortium (NDGF)21 countries, 1 consortium (NDGF) 6 Tier-1 (MSS capacity) sites, 58 Tier-26 Tier-1 (MSS capacity) sites, 58 Tier-2 All together – ~5000 CPUs (pledged), 1.5PB disk, 1.5PB All together – ~5000 CPUs (pledged), 1.5PB disk, 1.5PB

TapeTape Contribution range: from 4 to 1200 CPUsContribution range: from 4 to 1200 CPUs PIII, PIV, Itanium, Xeon, AMDPIII, PIV, Itanium, Xeon, AMD All Linux: Mandriva, Suse to Ubuntu, mostly SL3/4, no All Linux: Mandriva, Suse to Ubuntu, mostly SL3/4, no

Gentoo + all possible kernel+gcc combinationsGentoo + all possible kernel+gcc combinations

Page 3: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 33

The ALICE Grid (2)The ALICE Grid (2)

62 active sites

Page 4: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 44

OperationOperation ALICE offline is:ALICE offline is:

Hosting the central AliEn services: Grid catalogue, task queue, job Hosting the central AliEn services: Grid catalogue, task queue, job handling, authentication, API services, user registrationhandling, authentication, API services, user registration

Organising (guided by the requirements of the PWGs) and running the Organising (guided by the requirements of the PWGs) and running the productionproduction

AliEn site services updates and operation (together with the regional AliEn site services updates and operation (together with the regional experts)experts)

User analysis supportUser analysis support Sites are:Sites are:

Hosting the VO-boxes (interface to site services)Hosting the VO-boxes (interface to site services) Operating the local services (gLite and site fabric)Operating the local services (gLite and site fabric) Providing CPU and storageProviding CPU and storage

This modelThis model Has been in operation with minor modification since several years and is Has been in operation with minor modification since several years and is

working quite well for productionworking quite well for production Requires minor modification to support a large user community - mostly Requires minor modification to support a large user community - mostly

in the area of user support in the area of user support

Page 5: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 55

History of PDCsHistory of PDCs

Exercise of the ALICE production model Exercise of the ALICE production model Data production / storage/ replicationData production / storage/ replicationValidation of AliRootValidation of AliRootValidation of Grid software and operationValidation of Grid software and operationUser analysis (not yet integral part of the User analysis (not yet integral part of the

PDC)PDC)Since April 2006 the PDC is running Since April 2006 the PDC is running

continuouslycontinuously

Page 6: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 66

PDC job historyPDC job history

Average of 1500 CPUs running continuously since April 2006

Page 7: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 77

PDC job history - zoom on last 2 monthsPDC job history - zoom on last 2 months

2900 jobs in average, saturating all available resources

Page 8: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 88

Site performanceSite performanceTypical operation:- Up to 10% of the sites not in production at any given moment- Half of these are undergoing scheduled upgrades- The other half - Grid or local services failures- T1s are in general better in stability than T2- Some T2s are much better than any of the T1s

Achieving better stability of theservices at the computing centresis a top priority of all parties involved

The central services availability is better than 95%The central services availability is better than 95%

Page 9: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 99

Production statusProduction status

Total 85,837,100 events as of 26/082007 24:00 hours

Page 10: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 1010

Sites contributionsSites contributionsStandard distribution: 50/50 T1/T2 contribution

Page 11: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 1111

Relative contribution - Germany Relative contribution - Germany Standard distribution: 50/50 T1/T2 contribution

15% of total

Page 12: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 1212

Efficiencies/debuggingEfficiencies/debugging

Workload management for productionWorkload management for production Under control and is near production qualityUnder control and is near production quality We keep saying that, but this time we really mean itWe keep saying that, but this time we really mean it Improvements (speed, stability) are expected with the new gLite Improvements (speed, stability) are expected with the new gLite

version 3.1, still untested version 3.1, still untested Support and debuggingSupport and debugging

The overall situation is much less fragile nowThe overall situation is much less fragile now Substantial improvements in AliEn and monitoring are making Substantial improvements in AliEn and monitoring are making

the work of the experts supporting the operations easierthe work of the experts supporting the operations easier gLite services at the sites are (mostly) well understood and gLite services at the sites are (mostly) well understood and

supportedsupported User support is still very much in need of improvementUser support is still very much in need of improvement

The issues with user analysis are often unique and sometimes The issues with user analysis are often unique and sometimes lead to development of new functionalitylead to development of new functionality

But at least the response time (if not the solution) is quick But at least the response time (if not the solution) is quick

Page 13: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 1313

GeneralGeneral

The Grid is getting betterThe Grid is getting better Running conditions are improvingRunning conditions are improving The Grid middleware in general and AliEn in particular are quite The Grid middleware in general and AliEn in particular are quite

stablestable After a long and hard work by the developersAfter a long and hard work by the developers

Even user analysis, much derided in the past few months is finally Even user analysis, much derided in the past few months is finally not a painful exercisenot a painful exercise

The operation is more streamlined nowThe operation is more streamlined now Better understanding of running conditions and problems by the Better understanding of running conditions and problems by the

expertsexperts

We continue with the usual PDC’07 programmeWe continue with the usual PDC’07 programme Simulation/reconstruction of MC eventSimulation/reconstruction of MC event Validation of new middleware componentsValidation of new middleware components User analysisUser analysis And in addition the Full Dress Rehearsal (FDR)And in addition the Full Dress Rehearsal (FDR)

Page 14: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 1414

User analysis issues - short listUser analysis issues - short listMajor issues - February/June 2007Major issues - February/June 2007

Jobs do not start/lost/output missingJobs do not start/lost/output missing Input data collections are difficult to handle Input data collections are difficult to handle

and impossible to process at onceand impossible to process at oncePriorities are not set - single user can ‘grab’ Priorities are not set - single user can ‘grab’

all resources all resources Unclear definition of storage elements Unclear definition of storage elements

(Disk/MSS)(Disk/MSS)

Page 15: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 1515

User analysis issues - short list (2)User analysis issues - short list (2) What has been doneWhat has been done

Failover CE for user queue (Grid partition ‘Analysis’)Failover CE for user queue (Grid partition ‘Analysis’) Since 20 June - 100% availabilitySince 20 June - 100% availability

Pre staging of data (available on spinning media) and Pre staging of data (available on spinning media) and creation of xml collections centrallycreation of xml collections centrally

The availability of the pre-staged files is checked periodicallyThe availability of the pre-staged files is checked periodically More robust central services (see previous slides)More robust central services (see previous slides) Use of dedicated SE for user files - this will be Use of dedicated SE for user files - this will be

transparently increased to multile SEs with quotastransparently increased to multile SEs with quotas Priority mechanism (not the final version) put in placePriority mechanism (not the final version) put in place

We haven’t had reports of unfair useWe haven’t had reports of unfair use

Page 16: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 1616

Job completion chart Job completion chart Standard distribution: 50/50 T1/T2 contribution

User jobs

Page 17: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 1717

User analysis issues - currentUser analysis issues - current Storage availability and consistencyStorage availability and consistency

Still very few working SEs - common storage solutions are not Still very few working SEs - common storage solutions are not yet ‘production’ qualityyet ‘production’ quality

The effort is now concentrated on CASTOR2 with xrootdThe effort is now concentrated on CASTOR2 with xrootd Sites (GSI f.e.) are installing large xrootd pools - these are Sites (GSI f.e.) are installing large xrootd pools - these are

tested and workingtested and working With more SEs, holding replicas of the data, the Grid will With more SEs, holding replicas of the data, the Grid will

naturally become more stablenaturally become more stable Availability of specific data setsAvailability of specific data sets

Dependent on the storage capacity in operationDependent on the storage capacity in operation Currently TPC RAW data is being replicated to GSICurrently TPC RAW data is being replicated to GSI With CASTOR2+xrootd working, the number of events on With CASTOR2+xrootd working, the number of events on

spinning media will increase 20xspinning media will increase 20x

Page 18: Status of PDC’07 and user analysis issues (from admin point of view)

GSI DarmstadtGSI Darmstadt 1818

User analysis issues - current (2)User analysis issues - current (2)User applicationsUser applications

Compatibility of user installation of ROOT, gcc Compatibility of user installation of ROOT, gcc version, OS - locally complied application will not version, OS - locally complied application will not necessarily run on the Gridnecessarily run on the Grid

All sites are installed with ‘lowest common All sites are installed with ‘lowest common denominator’ middleware and packages - currnetly denominator’ middleware and packages - currnetly SLC3, gcc v.3.2, while most users have gcc v.3.4SLC3, gcc v.3.2, while most users have gcc v.3.4

There is no easy way out, until the centres migrate to There is no easy way out, until the centres migrate to SL(C)4 and gcc v.3.4SL(C)4 and gcc v.3.4

Meanwhile, the experts are looking into repackaging Meanwhile, the experts are looking into repackaging the Grid apps (most notably gshell) the Grid apps (most notably gshell)

Currently the only solution is to always compile ROOT Currently the only solution is to always compile ROOT and user application with the same compiler, before and user application with the same compiler, before submitting to Gridsubmitting to Grid


Recommended