Large-Scale Computing with Grids

Large-Scale Computing with Grids

Jeff Templon

KVI Seminar

17 February 2004

Jeff Templon – Seminar, KVI, 2004.02.17 - 2

Information I intend to transfer

Why are Grids interesting? Grids are solutions so I will spend some time talking about the problem and show how Grids are relevant. Solutions should solve a problem.

What are Computational Grids?

How are Grids being used, in HEP as well as in other fields?

What are some examples of current Grid “research topics”?


Our Problem

Place event info on 3D map

Trace trajectories through hits

Assign type to each track

Find particles you want

Needle in a haystack!

This is “relatively easy” case


More complex example

interactivephysicsanalysis

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventreprocessing

eventsimulation

eventsimulation

analysis objects(extracted by physics topic)

Data Handling and Computation for

Physics Analysisevent filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata


Computational Aspects

To reconstruct and analyze 1 event takes about 90 seconds

Maybe only a few out of a million are interesting. But we have to check them all!

Analysis program needs lots of calibration; determined from inspecting results of first pass.

Each event will be analyzed several times!


online systemmulti-level triggerfilter out backgroundreduce data volume

level 1 - special hardware

40 MHz (40 TB/sec)level 2 - embedded processorslevel 3 - PCs

75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &

offline analysis

One of the four LHC detectors


Computational Implications (2)

90 seconds per event to reconstruct and analyze

100 incoming events per second

To keep up, need either: A computer that is nine thousand times faster, or

nine thousand computers working together

Moore’s Law: wait 20 years and computers will be 9000 times faster (we need them in 2007!)


More Computational Issues

Four LHC experiments – roughly 36k CPUs needed

BUT: accelerator not always “on” – need fewer

BUT: multiple passes per event – need more!

BUT: haven’t accounted for Monte Carlo production – more!!

AND: haven’t addressed the needs of “physics users” at all!


Large Dynamic Range Computing

Clear that we have enough work to keep somewhere between 50 and 100 k CPUs busy

Most of the activities are “bursty”, so doing them with dedicated facilities is inefficient

Need some meta-facility that can serve all the activities Reconstruction & reprocessing

Analysis

Monte-Carlo

All four LHC experiments

All the LHC users


LHC User Distribution

•Putting all computers in one spot leads to traffic jams

•Which spot is willing to pay for & maintain 100k CPUs?


Classic Motivation for Grids

Large Scales: 50k CPUs, petabytes of data (if we’re only talking ten machines, who cares?)

Large Dynamic Range: bursty usage patterns Why buy 25k CPUs if 60% of the time you only need 900

CPUs?

Multiple user groups on single system Can’t “hard-wire” the system for your purposes

Wide-area access requirements Users not in same lab or even continent


Solution using Grids

Large Scales: 50k CPUs, petabytes of data Assemble 50k+ CPUs and petabytes of mass storage Don’t need to be in the same place!

Large Dynamic Range: bursty usage patterns When you need less than you have, others use excess

capacity When you need more, use others’ excess capacities

Multiple user groups on single system “Generic” grid software services (think web server here)

Wide-area access requirements Public Key Infrastructure for authentication & authorization

How do Grids Work?

Public Key Infrastructure

&

Generic Grid Software Services


Public Key Infrastructure – Single Login

Based on asymmetric cryptography: public/private key pairs

Need network of trusted Certificate Authorities (e.g. Verisign)

CAs issue certificates to users, computers, or services running on computers (typically valid for two years)

Certificates carry an “identity” /O=dutchgrid/O=users/O=kvi/CN=Nasser Kalantar a public key identity of the issuing CA a digital signature (transformation on cert text using CA private

key)

Need to tighten this part up, takes too long

What is a Grid?

(what are these Generic Grid Software Services?)


C = DS = =Grid software service

(like http server)

InformationSystem

CC

C

C

C

C

C

Information System is Central Nervous System of Grid

Info system defines grid



I.S.

CC

C

C

C

C

C

DS

DS

DSDSDS

DS

D.M.S

Data Grid



I.S.

CC

C

C

C

C

C

DS

DS

DSDSDS

DS

D.M.S

Computing Task Submission

W.M.S.

proxy + command;(data);

Get fresh, detailed info

Coarse Requirements

Candidate Clusters


DS

DS

DSDSDS

DS

List of b

est

loca

tions


I.S.

C

D.M.S

Computing Task Execution

W.M.S.

proxy + command;(data);

logger

Where

is my

data

?

proxy

Find DMS


DSDSDS


I.S.

C

D.M.S


W.M.S.

logger

How to contact O.D.S.?

Where do I put the data?

proxy + data

Register outputDone


Can do a bit more with data…



I.S.

CC

C

C

C

C

C

DS

DS

DSDSDS

DS

D.M.S

Computing Task Submission with data

W.M.S.

proxy + command;(data file spec);

Get fresh, detailed info

Coarse Requirements

Candidate Clusters

Find copies of data


DS

DS

DSDSDS

DS

List of b

est

loca

tions


I.S.

C

D.M.S


W.M.S.

proxy + command;(data file spec);

logger

Where

is my

data

?

Find DMS

Local access


Grids are working now

See it in action

Jeff Templon – Seminar, KVI, 2004.02.17 - 26Applications

D0 Run II Data Reprocessing

Biomedical Image Comparison

Ozone Profile Computation, Storage, & Access


Grid Research Topics “push” vs. “pull” model of operation (you just saw “push”)

We know how to do fine-grained priority & access control in “push”

We think “pull” is more robust operationally

Software distribution HEP has BIG programs with many versions … “pre” vs. “zero”

install

Data distribution Similar issues (software is just executable data!)

A definite bottleneck in D0 reprocessing (1 Gbit/100/2/2=modem!) Speedup limited to 92% of theoretical About 1 hr CPU lost per job

Optimization in general is hard, and is it worth it????



Didn’t talk about

Grid projects European Data Grid – ends this month, 10 M€ / 3y

VL-E – ongoing, 20 M€ / 5y

EGEE – starts in April, 30 M€ / 2y

Gridifying HEP software & computing activities

Transition from supercomputing to Grid computing (funding)

Integration with other technologies Submit, monitor, control computing work from your PDA or

mobile!


Conclusions

Grids are working now for workload management in HEP

Big developments in next two years with large-scale deployments of LCG & EGEE projects

Seems Grids are catching on (IBM / Globus alliance)

Lots of interesting stuff to try out!!!


What’s There Now? Job Submission

Marriage of Globus, Condor-G, EDG Workload Manager Latest version reliable at 99% level

Information System New System (R-GMA) – very good information model, implementation

still evolving Old System (MDS) – poor information model, poor architecture, but

hacks allow 99% “uptime”

Data Management Replica Location Service – convenient and powerful system for

locating and manipulating distributed data; mostly still user-driven (no heuristics)

Data Storage “Bare gridFTP server” – reliable but mostly suited to disk-only mass

store SRM – no mature implementations


Grids are Information

InformationSystem

InformationSystem

A-Grid

B-Grid

Date post:	31-Dec-2015
Category:	Documents
Upload:	amy-weeks
View:	40 times
Download:	0 times

Large-Scale Computing with Grids

Documents