+ All Categories
Home > Documents > Henry Nebrensky – Data Flow Workshop – 30 June 2009 MICE Data Flow Workshop Henry Nebrensky...

Henry Nebrensky – Data Flow Workshop – 30 June 2009 MICE Data Flow Workshop Henry Nebrensky...

Date post: 22-Dec-2015
Category:
Upload: arron-lane
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
26
Henry Nebrensky – Data Flow Workshop – 30 June 2009 MICE Data Flow Workshop Henry Nebrensky Brunel University 1
Transcript

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data Flow Workshop

Henry Nebrensky

Brunel University

1

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE and Grid Data Storage

The Grid provides MICE not only with computing (number-crunching) power, but also with a secure global framework allowing users access to data

Good news: storing development data on the Grid keeps it available to the collaboration – not stuck on an old PC in the corner of the lab

Bad news: loss of ownership – who picks up the data curation responsibilities?

Data can be downloaded from the Grid to user’s “own” PC – doesn’t need to be analysed remotely

2

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data Flow

The basic data flow in MICE is thus something like:

The raw data file from the experiment are sent to tape using Grid protocols, including registering the files in LFC.

The offline reconstruction can then use Grid/LFC to pull down the raw data, and upload reconstructed (“RECO” or DST) files.

Users can use Grid/LFC to access RECO files they want to play with.

Combining the above description with the Grid and work being done by current users gives:

3

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data Flow

Diagram

4

Short-dashed lines indicate entities that still need confirmation

Question marks indicate even higher levels of uncertainty

More details in MICE Note 252

The diagram would look pretty much the same if non-Grid tools were used

M IC E D A Q

O n lin e B u ffe r

O n lin e F a rm

R A W

R A W E p h e m e ra l R O O T h is to g ra m s

C re a te f ile m e ta d a ta ?

R A W

M ic e N e t D A Q n e tw o rk

1 G b p s to T ie r 1 (O p tic a l F ib re ? )

R A W

“ R o b o t” ? c e r t if ic a te V O M S “ a rc h iv e r” ? ro le M IC E _ R A W _ T A P E ? to k e n

C A S T O R ta p e

B a c k u p lin k to T ie r 1 ? F a i lo v e r l in k to T ie r 2 ?

V O M S , D N S , N T P

IS IS o r P P D n e tw o rk

R A W

R E C O (R O O T tre e s )

L F C

T ie r 1 n e tw o rk

O fflin e R e c o n s tr u c tio n

O n lin e R e c o n s tr u c tio n

“ A n o in te d u s e r” c e r tif ic a te V O M S “ p ro d u c tio n ” ro le M IC E _ R E C O _ D IS K ? to k e n

U K G rid P P T ie r 2 F a rm s

A n y T ie r 2 F a rm

S e m i-a u to m a te d p ro c e s s

T ie r 2 S E d isk

U se r lo c a l d isk

F ig u r e 1 : D a ta f lo w fr o m th e M IC E e x p e r im e n t. S h o r t-d a sh e d e n t it ie s r e q u ir e c o n f ir m a tio n . L o n g -d a sh e d lin e s r e p r e s e n t b o r d e r s b e tw e e n s u b n e ts .

A n a ly s is re su lts

re su lts a rc h iv e ?

M y P ro x y

B D II

T ra n s fe r B o x

(M IC E A C Q 0 5 )

C o n tro ls & M o n ito r in g n e tw o rk

C o n d it io n s D a ta b a se (E P IC S )

2 T B / d a y = 2 0 0 M b p s

C o n fig u ra tio n D a ta b a se ?

A b s tra c tio n L a y e r

?

?

“ C h a o tic ” (o n -d e m a n d ) a n a ly s is G e n e ric u s e r c e r tif ic a te

M C s im u la tio n

?

C A S T O R d isk

A M G A

R E C O

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data and the Grid

5

Storage, archiving and dissemination of experimental data:

Not been a high priority so far Overall strategy not documented anywhere obvious Individual work on parts of this – but do the pieces

fit together?

Grid: Certain Grid services are separately funded to

provide a production service to MICE Provides a ready-made set of building blocks – but

“we” have to put them together MICE need to know what they want to do, to make

sure that the finished edifice meets all their needs (and that Grid includes all the necessary bricks)

Henry Nebrensky – Data Flow Workshop – 30 June 2009

Decision Time

We need to start putting the pieces together NOW, including requesting sufficient resources from outside bodies.=> need an agreed plan in the VERY near future

There are a number of unresolved issues – see Note 252 and the data flow diagram.

Data volumes, lifetime and access control mostly unclear

(LFC) File naming scheme – see MICE Note 247 File metadata requirements – raised at CM23 and CM24 Management and administration

Hence this workshop.

6

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data Unknowns

MICE Note 252 identifies four main flavours of data: RAW, RECO, analysis results, and Monte Carlo simulation.

For all four, we need to understand the: volume (the total amount of data, the rate at which it

will be produced, and the size of the individual files in which it will be stored)

lifetime (ephemeral or longer lasting? will it need archiving to tape? replication?)

access control (who will create the data? who is allowed to see it? can it be modified or deleted, and if so who has those privileges?)

“service level” (desired availability? allowable downtime?)

Also need to identify use cases I’ve missed, especially ones that will need more VOMS roles or CASTOR space tokens.

7

Henry Nebrensky – Data Flow Workshop – 30 June 2009

What will users want to do?

Another way of answering the same question comes from looking at what users will want to do: which sorts of data will they want access to, how much of it and how often:

RAW or just RECO? Selected runs or a whole step at a time? Daily? Monthly?

8

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data - RAW

For RAW data: volume (the total amount of data: 27 TB, the rate at

which it will be produced: 30MB/s, and the size of the individual files in which it will be stored: 1-2 GB)

lifetime (ephemeral or longer lasting: permanent. will it need archiving to tape: yes. Replication?)

access control (who will create the data: archiver Who is allowed to see it: all. Can it be modified or deleted: no and if so who has those privileges?)

“service level” (desired availability: write 24/7 if ISIS up Allowable outage: 48 hrs)

(Tape Storage – see http://www.gridpp.rl.ac.uk/blog

/2009/06/10/step09-tape-drive-performance/ and http://www.gridpp.rl.ac.uk/blog

/2009/06/12/step09-tape-migration-stream-policies/ ) (Based on the CM24 500 million events figure. That implies that all MICE

steps add up to less than a fortnight’s data taking – is that right?) 9

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data - RECO

For RECO data: volume (the total amount of data: ???, the rate at

which it will be produced: ???MB/s, and the size of the individual files in which it will be stored: ??? GB)

lifetime (ephemeral or longer lasting: ???. will it need archiving to tape: ???. Replication???)

access control (who will create the data: ??? Who is allowed to see it: all. Can it be modified or deleted: ??? and if so who has those privileges?)

“service level” (desired availability: write ??? if ISIS up Allowable outage: ??? hrs)

(I’ve seen a claim of 6 TB for RECO data somewhere)

10

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data - Analysis

For analysis output: volume (the total amount of data: ???, the rate at

which it will be produced: ???MB/s, and the size of the individual files in which it will be stored: ??? GB)

lifetime (ephemeral or longer lasting: ???. will it need archiving to tape: ???. Replication???)

access control (who will create the data: ??? Who is allowed to see it: all. Can it be modified or deleted: ??? and if so who has those privileges?)

“service level” (desired availability: write ??? if ISIS up Allowable outage: ??? hrs)

11

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data - Simulation

For simulations: volume (the total amount of data: ???, the rate at

which it will be produced: ???MB/s, and the size of the individual files in which it will be stored: ??? GB)

lifetime (ephemeral or longer lasting: ???. will it need archiving to tape: ???. Replication???)

access control (who will create the data: ??? Who is allowed to see it: all. Can it be modified or deleted: ??? and if so who has those privileges?)

“service level” (desired availability: write ??? if ISIS up Allowable outage: ??? hrs)

12

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Data - Other

What other data is to be kept?: volume (the total amount of data: ???, the rate at

which it will be produced: ???MB/s, and the size of the individual files in which it will be stored: ??? GB)

lifetime (ephemeral or longer lasting: ???. will it need archiving to tape: ???. Replication???)

access control (who will create the data: ??? Who is allowed to see it: all. Can it be modified or deleted: ??? and if so who has those privileges?)

“service level” (desired availability: write ??? if ISIS up Allowable outage: ??? hrs)

I know about the Tracker QA data.13

Henry Nebrensky – Data Flow Workshop – 30 June 2009

Data Integrity

(For recent SE releases) a checksum is calculated automatically when a file is uploaded.

This can be checked when the file is transferred between SEs, or the value retrieved to check local copies.

Should we also do it ourselves before uploading the file in the first place, or should we use “compression” (can check integrity with gunzip –t …)?

(Default algorithm is Adler32 – lightweight + effective)

14

Henry Nebrensky – Data Flow Workshop – 30 June 2009

Metadata Catalogue

For many applications – such as analysis – you will want to identify the list of files containing the data that matches some parameters

This is done by a “metadata catalogue”.For MICE this doesn't yet exist

A metadata catalogue can in principle return either the GUID or an LFN – it shouldn’t matter which as long as it’s properly integrated with the other Grid services.

15

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Metadata Catalogue

We need to select a technology to use for this use the configuration database? (no) gLite AMGA (who else uses it – will it remain supported?) ?

Need to implement – i.e. register metadata to files

What metadata will be needed for analysis?

Should the catalogue include the file format and compression scheme (gzip ≠ PKzip)?

16

Henry Nebrensky – Data Flow Workshop – 30 June 2009

MICE Metadata Cataloguefor Humans

or, in non-Gridspeak: we have several databases (configuration

DB, EPICS, e-Logbook) where we should be able to find all sorts of information about a run/timestamp.

but how do we know which runs to be interested in, for our analysis?

we need an “index” to the MICE data, and for this we need to define the set of “index terms” that will be used to search for relevant datasets.

17

Henry Nebrensky – Data Flow Workshop – 30 June 2009

If I wanted to analyse some data,

…I might search for all events with a particular: Run, date/time Step Beam – e-, π, p, μ (back or forward) Nominal 4-d / transverse normalised emittance Diffuser setting Nominal momentum Configuration:

Magnet currents (nominal) Physical geometry

Absorber material Some RF parameter? MC Truth?Anything else?

18

Henry Nebrensky – Data Flow Workshop – 30 June 2009

People and Roles

A “role” is a combination of duties and privileges, with a specific aim. These are distinct from those of the person fulfilling that role.

The Operations Manager (“MOM”) is an example of a continuous role, enacted by different people over time.

Some roles may be so specialised that only a particular person can do them; others can have many people in them at the same time.=> Don’t equate roles with FTEs!

For the data flow, the privileges associated with roles are enforced by VOMS. They may also require space tokens to be set up (on a site-by-site basis).

19

Henry Nebrensky – Data Flow Workshop – 30 June 2009

Management

Roles identified so far: Online reconstruction manager Archiver (storage of RAW data to tape) Archivist (storage of miscellaneous data to

tape) Offline reconstruction manager Data Manager (moving data around Tier2s,

LFC consistency) Simulation Production Manager Analysis manager? VO manager

20

Henry Nebrensky – Data Flow Workshop – 30 June 2009

File Catalogue Namespace (1)

Also, we need to agree on a consistent namespace for the file catalogue

Proposal (MICE Note 247, Grid talk at CM23): We get given /grid/mice/ by the server Five upper-level directories: Construction/

historical data from detector development and QA

Calibration/needed during analysis (large datasets, c.f.

DB) TestBeam/

test beam data MICE/

DAQ output and corresponding MC simulation21

Henry Nebrensky – Data Flow Workshop – 30 June 2009

File Catalogue Namespace (2)

/grid/mice/users/nameFor people to use as scratch space for their

own purposes, e.g. analysis

Encourage people to do this through LFC – helps avoid “dark data”

LFC allows Unix-style access permissions

Again, the LFC namespace is something that needs to be finalised before production data can start to be registered.

22

Henry Nebrensky – Data Flow Workshop – 30 June 2009

The VOMS server (1)

File permissions will needed e.g. to ensure that users can’t accidentally delete RAW data. These rules will need to last for at least the life of the experiment.

VOMS is a Grid service that allows us to define specific roles (e.g. DAQ data archiver) which will then be allowed certain privileges (such as writing to tape at RAL Tier 1).

The VOMS service then maps humans to those roles, via their Grid certificates.

Thus the VOMS service provides us with a single portal where we can add/remove/reassign Mice, without needing to negotiate with the operators of every Grid resource worldwide – we actually keep control “in-house.”

23

Henry Nebrensky – Data Flow Workshop – 30 June 2009

The VOMS server (2)

MICE VOMS server is provided via GridPP at Manchester, UK.

New Mice are added or assigned to roles by the VO Manager (and Mouse) Paul Hodgson.

The RAL Castor must query the VOMS server to authorise every transaction – should we move it somewhere local?

24

Henry Nebrensky – Data Flow Workshop – 30 June 2009

Other Questions

How does one run become the next – what triggers it, who confirms, how is it propagated?

How does “data” come out of the DAQ and get turned into files?

How do we know a run is complete => that a file is closed?

Does the GB file size for CASTOR match the online reco sample rate? => Could the data mover trigger the online reco?

That ol’ online buffer round-robin thang Replication of data to other Tier1s Should EPICS monitor the data mover?

25

Henry Nebrensky – Data Flow Workshop – 30 June 2009

Actions arising

Work out desired CASTOR resources, interface (SRM?) and QoS. Meet with Tier1 and iterate.

Draw up list of VOMS roles and get them created.

Draw up list of space tokens by Tier and role. Create LFC namespace, set permissions,

upload existing data Get archiver robot certificate Identify needed Tier2 resources

26


Recommended