Jeff Templon – Seminar, KVI, 2004.02.17 - 2
Information I intend to transfer
Why are Grids interesting? Grids are solutions so I will spend some time talking about the problem and show how Grids are relevant. Solutions should solve a problem.
What are Computational Grids?
How are Grids being used, in HEP as well as in other fields?
What are some examples of current Grid “research topics”?
Jeff Templon – Seminar, KVI, 2004.02.17 - 3
Our Problem
Place event info on 3D map
Trace trajectories through hits
Assign type to each track
Find particles you want
Needle in a haystack!
This is “relatively easy” case
interactivephysicsanalysis
batchphysicsanalysis
batchphysicsanalysis
detector
event summary data
rawdata
eventreprocessing
eventreprocessing
eventsimulation
eventsimulation
analysis objects(extracted by physics topic)
Data Handling and Computation for
Physics Analysisevent filter(selection &
reconstruction)
event filter(selection &
reconstruction)
processeddata
Jeff Templon – Seminar, KVI, 2004.02.17 - 6
Computational Aspects
To reconstruct and analyze 1 event takes about 90 seconds
Maybe only a few out of a million are interesting. But we have to check them all!
Analysis program needs lots of calibration; determined from inspecting results of first pass.
Each event will be analyzed several times!
Jeff Templon – Seminar, KVI, 2004.02.17 - 7
online systemmulti-level triggerfilter out backgroundreduce data volume
level 1 - special hardware
40 MHz (40 TB/sec)level 2 - embedded processorslevel 3 - PCs
75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &
offline analysis
One of the four LHC detectors
Jeff Templon – Seminar, KVI, 2004.02.17 - 8
Computational Implications (2)
90 seconds per event to reconstruct and analyze
100 incoming events per second
To keep up, need either: A computer that is nine thousand times faster, or
nine thousand computers working together
Moore’s Law: wait 20 years and computers will be 9000 times faster (we need them in 2007!)
Jeff Templon – Seminar, KVI, 2004.02.17 - 9
More Computational Issues
Four LHC experiments – roughly 36k CPUs needed
BUT: accelerator not always “on” – need fewer
BUT: multiple passes per event – need more!
BUT: haven’t accounted for Monte Carlo production – more!!
AND: haven’t addressed the needs of “physics users” at all!
Jeff Templon – Seminar, KVI, 2004.02.17 - 10
Large Dynamic Range Computing
Clear that we have enough work to keep somewhere between 50 and 100 k CPUs busy
Most of the activities are “bursty”, so doing them with dedicated facilities is inefficient
Need some meta-facility that can serve all the activities Reconstruction & reprocessing
Analysis
Monte-Carlo
All four LHC experiments
All the LHC users
Jeff Templon – Seminar, KVI, 2004.02.17 - 11
LHC User Distribution
•Putting all computers in one spot leads to traffic jams
•Which spot is willing to pay for & maintain 100k CPUs?
Jeff Templon – Seminar, KVI, 2004.02.17 - 12
Classic Motivation for Grids
Large Scales: 50k CPUs, petabytes of data (if we’re only talking ten machines, who cares?)
Large Dynamic Range: bursty usage patterns Why buy 25k CPUs if 60% of the time you only need 900
CPUs?
Multiple user groups on single system Can’t “hard-wire” the system for your purposes
Wide-area access requirements Users not in same lab or even continent
Jeff Templon – Seminar, KVI, 2004.02.17 - 13
Solution using Grids
Large Scales: 50k CPUs, petabytes of data Assemble 50k+ CPUs and petabytes of mass storage Don’t need to be in the same place!
Large Dynamic Range: bursty usage patterns When you need less than you have, others use excess
capacity When you need more, use others’ excess capacities
Multiple user groups on single system “Generic” grid software services (think web server here)
Wide-area access requirements Public Key Infrastructure for authentication & authorization
Jeff Templon – Seminar, KVI, 2004.02.17 - 15
Public Key Infrastructure – Single Login
Based on asymmetric cryptography: public/private key pairs
Need network of trusted Certificate Authorities (e.g. Verisign)
CAs issue certificates to users, computers, or services running on computers (typically valid for two years)
Certificates carry an “identity” /O=dutchgrid/O=users/O=kvi/CN=Nasser Kalantar a public key identity of the issuing CA a digital signature (transformation on cert text using CA private
key)
Need to tighten this part up, takes too long
Jeff Templon – Seminar, KVI, 2004.02.17 - 17
C = DS = =Grid software service
(like http server)
InformationSystem
CC
C
C
C
C
C
Information System is Central Nervous System of Grid
Info system defines grid
Jeff Templon – Seminar, KVI, 2004.02.17 - 18
C = DS = =Grid software service
I.S.
CC
C
C
C
C
C
DS
DS
DSDSDS
DS
D.M.S
Data Grid
Jeff Templon – Seminar, KVI, 2004.02.17 - 19
C = DS = =Grid software service
I.S.
CC
C
C
C
C
C
DS
DS
DSDSDS
DS
D.M.S
Computing Task Submission
W.M.S.
proxy + command;(data);
Get fresh, detailed info
Coarse Requirements
Candidate Clusters
Jeff Templon – Seminar, KVI, 2004.02.17 - 20
DS
DS
DSDSDS
DS
List of b
est
loca
tions
C = DS = =Grid software service
I.S.
C
D.M.S
Computing Task Execution
W.M.S.
proxy + command;(data);
logger
Where
is my
data
?
proxy
Find DMS
Jeff Templon – Seminar, KVI, 2004.02.17 - 21
DSDSDS
C = DS = =Grid software service
I.S.
C
D.M.S
Computing Task Execution
W.M.S.
logger
How to contact O.D.S.?
Where do I put the data?
proxy + data
Register outputDone
Jeff Templon – Seminar, KVI, 2004.02.17 - 23
C = DS = =Grid software service
I.S.
CC
C
C
C
C
C
DS
DS
DSDSDS
DS
D.M.S
Computing Task Submission with data
W.M.S.
proxy + command;(data file spec);
Get fresh, detailed info
Coarse Requirements
Candidate Clusters
Find copies of data
Jeff Templon – Seminar, KVI, 2004.02.17 - 24
DS
DS
DSDSDS
DS
List of b
est
loca
tions
C = DS = =Grid software service
I.S.
C
D.M.S
Computing Task Execution
W.M.S.
proxy + command;(data file spec);
logger
Where
is my
data
?
Find DMS
Local access
Jeff Templon – Seminar, KVI, 2004.02.17 - 26Applications
D0 Run II Data Reprocessing
Biomedical Image Comparison
Ozone Profile Computation, Storage, & Access
Jeff Templon – Seminar, KVI, 2004.02.17 - 27
Grid Research Topics “push” vs. “pull” model of operation (you just saw “push”)
We know how to do fine-grained priority & access control in “push”
We think “pull” is more robust operationally
Software distribution HEP has BIG programs with many versions … “pre” vs. “zero”
install
Data distribution Similar issues (software is just executable data!)
A definite bottleneck in D0 reprocessing (1 Gbit/100/2/2=modem!) Speedup limited to 92% of theoretical About 1 hr CPU lost per job
Optimization in general is hard, and is it worth it????
Jeff Templon – Seminar, KVI, 2004.02.17 - 29
Didn’t talk about
Grid projects European Data Grid – ends this month, 10 M€ / 3y
VL-E – ongoing, 20 M€ / 5y
EGEE – starts in April, 30 M€ / 2y
Gridifying HEP software & computing activities
Transition from supercomputing to Grid computing (funding)
Integration with other technologies Submit, monitor, control computing work from your PDA or
mobile!
Jeff Templon – Seminar, KVI, 2004.02.17 - 30
Conclusions
Grids are working now for workload management in HEP
Big developments in next two years with large-scale deployments of LCG & EGEE projects
Seems Grids are catching on (IBM / Globus alliance)
Lots of interesting stuff to try out!!!
Jeff Templon – Seminar, KVI, 2004.02.17 - 31
What’s There Now? Job Submission
Marriage of Globus, Condor-G, EDG Workload Manager Latest version reliable at 99% level
Information System New System (R-GMA) – very good information model, implementation
still evolving Old System (MDS) – poor information model, poor architecture, but
hacks allow 99% “uptime”
Data Management Replica Location Service – convenient and powerful system for
locating and manipulating distributed data; mostly still user-driven (no heuristics)
Data Storage “Bare gridFTP server” – reliable but mostly suited to disk-only mass
store SRM – no mature implementations