Computing and LHCb Raja Nandakumar. The LHCb experiment Universe is made of matter Still not clear...

Post on 16-Jan-2016

215 views 1 download

transcript

Computing and LHCb

Raja Nandakumar

The LHCb experiment Universe is made of matter

Still not clear whyAndrei Sakharov’s theory of cp-violation

Study cp-violationIndirect evidence of new physics

There are many other questions (of course)

The LHCb experiment has been built Hope to answer some of these questions

The LHCb detector

February 2002Cavern ready for detector installationAugust 2008

How the data looks

The detector records … >1 Million channels of data every bunch

crossing 25ns between bunch crossings Trigger reduces to about 2000 events/sec

~7 Million events / hour25 KB/s raw event size

4.3 TB/day Not as much as ATLAS / CMS but still … Assuming continuous operation

Breaks for fills, etc. These events will need to be farmed out of

CERNReconstructed and stripped at Tier-1sThen replicated to all LHCb Tier-1 sites

Finally available for user analysis

The LHCb computing model

CERN

Production (T2/T1/T0) Simulation + digitization

.digi

Reconstruction (T1 / T0)

.rdst

.digi

Stripping(T1 / T0)

.dst .rdst

T1 / T0.dst

FTS

User Analysis(T1/T0)

LHCb job submission Computing distributed all over the

world Particle physics is collaborative across

institutes in various nations Both cpu, storage available at various sites

Welcome to the world of grid computing Take advantage of distributed resources Set up a framework for other disciplines

alsoFault tolerant job execution.Also used by Medicine, Chemistry, Space

science, … LHCb interface : DIRAC

What the user sees …

Submit job to the “grid” Ganga (ATLAS/LHCb) Sometimes needs a lot of persuasion

Usually the job comes back successful

On occasion problems seen Frequently wrong parameters, code, …

Correct and resubmit

What the user does not see …

Requirements of DIRAC Fault tolerance

Retries Duplication Failover

Guard against possible grid problems … Network, timeouts Drive failures Systems hacked Bugs in code If it cannot go wrong, it still will

Caching Watchdogs Logs

Overloaded machine, service

Thread safety Fire, Cooling

problems

Submitting jobs on the grid

Two ways of submitting jobs Push jobs out to a site’s batch system

The grid is a simple multiple batch system Job waits at the site until it runs

Lose control of jobs when they leave us (LHCb)Many things can change in the time between job

submission and runningWe only see the batch systems / queues

We do not see the status of the grid in real timeCause of low success rate – previous experience

Load on site Site temporary downtime Change in job priority within the experiment

Pull jobs into the site Pilot jobs

Pilot jobs “Wrapper” jobs

Submitted to a site If site is available, free & there are waiting jobs

Pilot job returns information at current time Job may have resource requirements too …

Look at local environment and request job from DIRACDIRAC returns job with highest priority matching

available resource Internal job prioritisation within DIRAC

Has latest information on experiment prioritiesExit after a short delay if no matching job found

Have fine grained (level of worker node) view of the grid Very high job success rate Pioneered by LHCb

Very simple requirements for sites

Does all on previous slide Refinements still needed (as always)

Job prioritisation still static Dynamic job prioritisation on the way

Basic logs all in place Not everything easy to view for user / shifter Being improved

More improvements in resilience upcoming DIRAC portal : http://lhcbweb.pic.es

All needed information for LHCb users Locating data, Job monitoring, …

Restricted information for outsiders Grid privacy issues

Ganga + DIRAC the only official LHCb grid interface Will support any reasonable use case

Successes …

A single machine is the DIRAC server No particular load issues seen

Analysis also going on

Comparison of different monte carlo

The occasional problem Black hole worker nodes

Bad environment that cannot match jobs Sink for our pilot jobs

Once sink for production jobs alsoMigration from sl3 to sl4

Introduce short sleep time before pilot exits DOS attack on CERN servers

Software being downloaded from CERNWas done if software was not available locally

Now users do not install software

We donot understand …

Very very preliminary Still working on

understanding this

“Same” class of cpu-s at different sites

CPU time scaled median for the cpu class

Now over to ATLAS …