Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | raymond-wilson |
View: | 216 times |
Download: | 0 times |
The LHC Computing Grid – February 2008
CERN’s use of gLite
Dr Markus Schulz
LCG Deployment Leader
24 April 2008
4th Black Forest Grid Workshop
Markus Schulz, CERN, IT Department
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Outline
• CERN• LHC the computing challenge
– Data rate, computing , community
• Grid Projects @ CERN– WLCG, EGEE
• gLite Middleware – Code Base– Software life cycle
• Outlook and summary
Markus Schulz, CERN, IT Department
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
The LHC Computing Challenge
Signal/Noise 10-9
Data volume High rate * large number of
channels * 4 experiments 15 PetaBytes of new data each
year Compute power
Event complexity * Nb. events * thousands users
200 k of (today's) fastest CPUs Worldwide analysis & funding
Computing funding locally in major regions & countries
Efficient analysis everywhere GRID technology
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Flow to the CERN Computer Center
Markus Schulz, CERN, IT Department
10Gbit 10Gbit
10Gbit10Gbit
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Flow out of the center
Markus Schulz, CERN, IT Department
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
LHC Computing Grid project (LCG)
• Dedicated 10Gbit links between the T0 and each T1 center
LCG Service HierarchyTier-0: the accelerator centre• Data acquisition & initial processing• Long-term data curation• Distribution of data Tier-1 centres
Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschunszentrum KarlsruheItaly – CNAF (Bologna)Netherlands – NIKHEF/SARA (Amsterdam)Nordic countries – distributed Tier-1
Spain – PIC (Barcelona)Taiwan – Academia SInica (Taipei)UK – CLRC (Oxford)US – FermiLab (Illinois) – Brookhaven (NY)
Tier-1: “online” to the data acquisition process high availability
• Managed Mass Storage – grid-enabled data service
• Data-heavy analysis• National, regional support
Tier-2: ~200 centres in ~35 countries• Simulation• End-user analysis – batch and interactive
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
LHC DATA ANALYSIS
HEP code key characteristics • modest memory requirements
• 2GB/job• performs well on PCs • independent events
trivial parallelism• large data collections (TB PB)• shared by very large user
collaborations
For all four experiments• ~15 PetaBytes per year• ~200K processor cores • > 6,000 scientists & engineers
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
LHC Computing Multi-science
• 1999 - MONARC project – First LHC computing architecture – hierarchicaldistributed model
• 2000 – growing interest in grid technology– HEP community main driver in launching the DataGrid project
• 2001-2004 - EU DataGrid project– middleware & testbed for an operational grid
• 2002-2005 – LHC Computing Grid – LCG– deploying the results of DataGrid to provide aproduction facility for LHC experiments
• 2004-2006 – EU EGEE project phase 1– starts from the LCG grid– shared production infrastructure– expanding to other communities and sciences
• 2006-2008 – EU EGEE project phase 2– expanding to other communities and sciences– Scale and stability– Interoperations/Interoperability
• 2008-2010 – EU EGEE project phase 3– More communities– Efficient operations– Less central coordination
CERN
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
WLCG Collaboration• The Collaboration
– 4 LHC experiments– ~250 computing centres– 12 large centres
(Tier-0, Tier-1)– 38 federations of smaller
“Tier-2” centres– Growing to ~40 countries– Grids: EGEE, OSG, Nordugrid
• Technical Design Reports– WLCG, 4 Experiments: June 2005
• Memorandum of Understanding– Agreed in October 2005
• Resources– 5-year forward look
• Relies on EGEE and OSG – and other regional efforts like NDGF
The EGEE project
• EGEE– Started in April 2004, now in second phase with 91 partners in
32 countries– 3rd phrase (2008-2010) starts next month
• Objectives– Large-scale, production-quality
grid infrastructure for e-Science – Attracting new resources and
users from industry as well asscience
– Maintain and further improve“gLite” Grid middleware
Markus Schulz, CERN, IT Department
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Registered Collaborating Projects
Applicationsimproved services for academia,
industry and the public
Support Actionskey complementary functions
Infrastructuresgeographical or thematic coverage
25 projects have registered as of September 2007: web page
Markus Schulz, CERN, IT Department
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Collaborating infrastructures
Markus Schulz, CERN, IT Department
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Virtual Organizations
6866
58
52
33
18
117
3 2
1 2 5 10 20 50 100 200 500 1000
VO Members
Vir
tua
l O
rga
niz
ati
on
s
201
139
77
42
22
7 7 2
204
151
59
38
207 5 1
1 2 5 10 20 50 100 200
Supporting Sites
Vir
tual
Org
an
izati
on
s
CPUs Storage
Total Users: 5034Affected People: 10200Median members per VO: 18
Total VOs: 204Registered VOs: 116Median sites per VO: 3
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
ArcheologyAstronomyAstrophysicsCivil ProtectionComp. ChemistryEarth SciencesFinanceFusionGeophysicsHigh Energy PhysicsLife SciencesMultimediaMaterial Sciences…
>250 sites48 countries>50,000 CPUs>20 PetaBytes>10,000 users>200VOs>150,000 jobs/day
Markus Schulz, CERN, IT Department
Sustainability• Need to prepare for permanent Grid infrastructure in Europe and the
world• Ensure a high quality of service for all user communities• Independent of short project funding cycles• Infrastructure managed in collaboration
with National Grid Initiatives (NGIs)• European Grid Initiative (EGI)• Future of projects like OSG, NorduGrid, ... ?
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
For more information:
Thank you for your kind attention!
www.cern.ch/lcg www.eu-egee.org
www.eu-egi.org/
www.gridcafe.org
www.opensciencegrid.org
Summary of Computing Resource RequirementsAll experiments - 2008From LCG TDR - June 2005
CERN All Tier-1s All Tier-2s TotalCPU (MSPECint2000s) 25 56 61 142Disk (PetaBytes) 7 31 19 57Tape (PetaBytes) 18 35 53
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
0
50
100
150
200
250
LHC CPU Capacity - MSI2K
CERN Tier-1 Tier-2
LHC Computing Requirements
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Grid Activity
Ramp-up needed over next 8 months : factor 5
170K Jobs/day
16000 KSpecInt Years
WLCG ran ~ 44 M jobs in 2007 – workload has continued to increase now at ~ 165k jobs/day
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
October 2007 – CPU Usage
• > 85% of CPU Usage is external to CERN
* NDGF usage for September 2007
*Tier-2s
CERN
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Tier-2 Sites – October 2007
• 30 sites deliver 75% of the cpu• 30 sites deliver 1%
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
2007 – CERN Tier-1 Data Distribution
Average data rate per day by experiment (Mbytes/sec)
1.5 Gbyte/s
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Site Reliability
Site ReliabilityTier-2 Sites
83 Tier-2 sites being monitored
Targets – CERN + Tier-1s
BeforeJuly July 07 Dec 07 Avg.last 3
months
Each site 88% 91% 93% 89%
8 best sites 88% 93% 95% 93%
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Grid Computing at CERN
• Core grid infrastructure services (~300 nodes)– CA, VOMS servers, monitoring hosts, information system, testbeds
• Grid Catalogues– Using ORACLE clusters as backend DB– 20 instances
• Workload management nodes– 16 RBs , 15 WMS (different flavours, not all fully loaded)– 22 CEs (for headroom)
• Worker Nodes (limited by power < 2.5 MW)– LSF managed cluster– 16000 cores, currently adding 12000 cores (2GB/core)– We use node disks only as scratch space for OS installation
• Extensive use of fabric management– Quattor for install and config, Lemon+Leaf for fabric monitoring
Markus Schulz, CERN, IT Department
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Grid Computing at CERN
• Storage (CASTOR-2)– Disk caches : 5 Pbyte (20k disks) mid 2008 additional 12k disks (16 PB)– Linux boxes with RAID disks– Tape storage: 18 PB (~30k cartidges) – We have to add 10 PB this year ( the robots can be extended)
• 700GB/cartridge
– Why tapes?• still 3 times lower system costs• long time stability is well understood• The gap is closing
• Networking– T0 -> T1 dedicated 10Gbit links– CIXP Internet exchange point for links to T2– Internal: 10Gbit infrastructure
Markus Schulz, CERN, IT Department
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
www.glite.org
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
gLite Middleware Distribution• Combines components from different
providers– Condor and Globus (via VDT)– LCG– EDG/EGEE– Others
• After prototyping phases in 2004 and 2005 convergence with LCG-2 distribution reached in May 2006– gLite 3.0– gLite 3.1 ( 2007)
• Focus on providing a deployable MW distribution for EGEE production service
LCG-2
prototyping
prototyping
product
20042004
20052005 product
gLite
20062006 gLite 3.0
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
gLite Services
gLite offers a range of services
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it Markus Schulz, CERN, IT Department
Middleware structure• Applications have access both to
Higher-level Grid Services and to Foundation Grid Middleware
• Higher-Level Grid Services are supposed to help the users building their computing infrastructure but should not be mandatory
• Foundation Grid Middleware will be deployed on the EGEE infrastructure– Must be complete and robust– Should allow interoperation with
other major grid infrastructures– Should not assume the use of
Higher-Level Grid ServicesFoundation Grid Middleware Security model and infrastructureComputing (CE) and Storage Elements (SE)AccountingInformation and Monitoring
Higher-Level Grid Services Workload ManagementReplica ManagementVisualizationWorkflowGrid Economies...
Applications
Overview paper http://doc.cern.ch//archive/electronic/egee/tr/egee-tr-2006-001.pdf
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Workload Management (compact)
Desktops
A few~50 nodes
1-20 per site
1-24000 per site
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Software Process• Introduced new software lifecycle process
– Based on the gLite process and LCG-2 experience
– Components are updated independently – Updates are delivered on a weekly basis to PPS
• Each week either a gLite 3.1 or 3.0 update (if needed)• Move after 2 weeks to production
– Acceptance criteria for new components – Clear link between component versions, Patches and Bugs
• Semi-automatic release notes – Clear prioritization by stakeholders
• TCG for medium term (3-6 months) and EMT for short term goals– Clear definition of roles and responsibilities – Documented in MSA3.2
– In use since July 2006
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Component based process
Release Day
time
C
Update1
B
Update2
AC
Update3
B
Integration CertificationBuild
Regular release interval
Component A
Component B
Component C
Illustration of
in a component based release process
Update4
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
gLite code base
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
gLite code details
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
gLite code details10K 5K
2K
1K
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
gLite code details
2K
The list is not complete. Some components are provided as binaries and are only packaged by the ETICS system
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Complex Dependencies
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Data Management
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Dependency Challenge
We spent significant resources in streamlining dependencies, goal is improved portability of the code
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Code Stability• Since mid 2006 we processed more than 500
Patches (changes) to gLite middleware• Covering two release branches
06 /
2006
07 /
2006
08 /
2006
09 /
2006
10 /
2006
11 /
2006
12 /
2006
01 /
2007
02 /
2007
03 /
2007
04 /
2007
05 /
2007
06 /
2007
07 /
2007
08 /
2007
09 /
2007
10 /
2007
11 /
2007
12 /
2007
01 /
2008
0
5
10
15
20
25
30
35
40
45
50
gLite 3.1
gLite 3.0
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
The Process is monitored• To spot problems and manage resources
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Code Stability: Bugs• Bugs….
Almost constant rate
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Code Stability• We are improving
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Experience Gained• Middleware is almost irrelevant
– As long as some minimal functionality is provided• Authentication, Authorization, Accounting, robustness, scalability• Leave complex services to user communities
• Distributed Operations is the challenge– Policies, Security, Monitoring, User Support…….– Upgrades take months…
Number of tickets processed by GGUS
0
200
400
600
800
1000
1200
1400
1600
Apr-06 May-06 Jun-06 Jul-06 Aug-06 Sep-06 Oct-06 Nov-06 Dec-06 Jan-07 Feb-07 Mar-07
Date
Nu
mb
er
CoD ENOC Others AllNo. Tickets Processed
Operations Network User All
1300 tickets/month
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Experience Gained• We are victims of our own success
– Moved prototypes into production very early• Complex software stack
– Didn’t understood the right split ( lightweight WNs)• Hard to port to other platforms
– Since we have users we can only slowly migrate to new approaches (standards)
• Interoperability/Interoperation– Is absolutely a key area
• All middleware that is sufficiently fit will survive– OSG 100%– Prototypes: ARC, Unicore, Naregi
• Standards have to follow practice– OGF only recently agrees to this concept
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Summary• The core of a grid production infrastructure for LHC
computing is available• In the next two years the focus will be on:
– Scaling (jobs and data management)• Handle chaotic use cases (analysis)• Data management for analysis
– Monitoring to improve:• Availability and Reliability• Automate more (reduce operations cost)
– Interoperation• Lessons Learned:
– Operations is the hardest part in grid computing– A strict, monitored software lifecycle is essential– Keep the software as simple as possible– You can’t monitor too much
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 LHCC Comprehensive Review; 19-20 November 2007
GridMap Prototype Visualization
Metric selection for colour of rectangles
Show SAM status
Show GridView availability data
Grid topology view (grouping)
Metric selection for size of rectangles
VO selection
Overall Site or Site Service selection
Link: http://gridmap.cern.ch Drilldown into region by clicking on the title
Context sensitive information
Colour KeyDescription of current view
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
www.glite.org Middleware components
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Standards• EGEE needs to interoperate with other infrastructures:
– To provide users with the ability to access resources available on collaborating infrastructures
• The best solution is to have common interfaces through the development and adoption of standards.
• The gLite reference forum for standardization activities is the Open Grid Forum– Many contributions (e.g. OGSA-AUTH, BES, JSDL, new GLUE-
WG, UR, RUS, SAGA, INFOD, NM, …)• Problems:
– Infrastructures are already in production– Standards are still in evolution and often
underspecified• OGF-GIN follows a pragmatic approach
– balance between application needs vs. technology push
52
GIN
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Authentication • gLite authentication is based on X.509 PKI
– Certificate Authorities (CA) issue (long lived) certificates identifying individuals (much like a passport)
• Commonly used in web browsers to authenticate to sites– Trust between CAs and sites is established (offline)– In order to reduce vulnerability, on the Grid user
identification is done by using (short lived) proxies of their certificates
• Support for Short-Lived Credential Services (SLCS) – issue short lived certificates or proxies to its local users
• e.g. from Kerberos or from Shibboleth credentials (new in EGEE II)
• Proxies can– Be delegated to a service such that it can act on the user’s
behalf– Be stored in an external proxy store (MyProxy) – Be renewed (in case they are about to expire)– Include additional attributes
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Authorization• VOMS is now a de-facto standard
– Attribute Certificates provide users with additional capabilities defined by the VO.
– Basis for the authorization process • Authorization: currently via mapping to a local
user on the resource– glexec changes the local identity (based on suexec
from Apache)• Designing an authorization service with a
common interface agreed with multiple partners– Uniform implementation of authorization in gLite
services– Easier interoperability with other infrastructures– Prototype being prepared now
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Common AuthZ interface
SAML-XACML interface
Common SAML XACML library
Site Central: LCAS + LCMAPS
L&L plug-ins
GPBox
LCMAPSplug-in
Site Central: GUMS (+ SAZ)
Common SAML XACML library
glexec
L&L plug-in: SAML-XACML
edg-gk
edg-gridftp
gt4-interface
pre-WS GT4 gk, gridftp, opensshd
Prima + gPlazma: SAML-XACML
GT4 gatekeeper,
gridftp, (opensshd)
dCache
LCAS + LCMAPS
CREAM
Oblg: user001, somegrp<other obligations>
SAML-XACML Query
Q:
R:
map.user.to.some.pool
Pilot job on Worker Node(both EGEE and OSG)
OSG EGEE
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Information Schema
• gLite is using the GLUE schema (version 1.3)– abstract modeling for Grid resources and mapping to
concrete schemas that can be used in Grid Information Services
– The definition of this schema started in April 2002 as a collaboration effort between EU-DataTAG and US-iVDGL projects
• The GLUE Schema is now an official activity of OGF– Starting points are the Glue Schema 1.3 the
Nordugrid Schema and CIM (used by NAREGI)– Will produce the GLUE 2.0 specifications
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Information System Architecture
SiteBDII
ResourceBDII
SiteBDII
ResourceBDII
Provider Provider ProviderProvider
Query
TopBDII
FCR
DNS Round Robin Alias
ResourceBDII
ResourceBDII
TopBDII
DNS Round Robin Alias
Query
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it Markus Schulz, CERN, IT Department
Performance Improvements
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
18 54 90
parallel Requests
Res
po
nse
Tim
e [s
ec]
SLC3
SLC4
QuadCore SLC4
0,01
0,1
1
10
3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60
parallel Requests
Tim
e [s
ec]
indexed DB
nonindexed DB
Log Scale
!
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
EGEE Data Management
lcg_utilsFTS
Vendor Specific
APIs
GFAL Cataloging Storage Data transfer
Data Management
User ToolsVOFrameworks
(RLS) LFC SRM(Classic
SE)gridft
pRFIO
Information System/Environment Variables
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Storage Element
• Storage Resource Manager (SRM) – Standard that hides the storage system implementation (disk or active
tape)– handles authorization– Web service based on gsoap– translates SURLs (Storage URL) to TURLs (Transfer URLs)– disk-based: DPM, dCache,Storm, BeSTman; tape-based: Castor, dCache– SRM-2.2
• Space tokens (manage space by VO/USER), advanced authorization,• Better handling of directories, lifetime, +++++++
• File I/O: posix-like access from local nodes or the grid GFAL (Grid File Access Layer)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
SRM basic and use cases tests
61
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department 62
/vo
Storage Element - DPM/dpm
/domain/home
DPMheadnode
file
(uid, gid)
DPM disk servers
Direct data transfer from/to disk server (no bottleneck)External transfers via gridFTP (de-facto standard)Target: small to medium sites
– One or more disk servers
data transfer
Disk Pool Manager (DPM)– Manages storage on
disk serversUses LFC as local
catalog– Same features for role
based ACLs, etc...
Client
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it Markus Schulz, CERN, IT Department
LCG File Catalog• LFC maps LFNs to SURLs
– Logical File Name (LFN): user file name• in VO namespace, aliases supported
– Glbally Unique IDentifier (GUID)• unique string assigned by the system to the
file– Site URL (SURL): identifies a replica
– A Storage Element and the logical name of the file inside it• GSI security: ACLs (based on VOMS)
– To each VOMS group/role corresponds a virtual group identifier– Support for secondary groups
• Web Service query interface: Data Location Interface (DLI)• Hierarchical Namespace• Supports sessions and bulk operations
LFC
GUIDSURL 1
SURL 2
ACL
LFN 1
LFN 2
lfc-ls –l /grid/vo/
/grid/vo/data
fileLFCDLI
lfc-getacl /grid/vo/data
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it Markus Schulz, CERN, IT Department
Encrypted Data Storage• Intended for VO’s with very strong security requirements
– e.g. medical community• anonymity (patient data is separate)• fine grained access control (only selected individuals)• privacy (even storage administrator cannot read)
• Interface to DICOM (Digital Image and COmmunication in Medicine)
• Hydra keystore– store keys for data
encryption
• N instances– Decryption works with
subset of stores
AMGAHydra gridftp SRM I/O
DPM
DICOMTrigger 0. retrieve
image
0. storeencrypted
image&ACL
0. storekeys&ACL
0. storepatient
data&ACL
1. patient look-up2. retrievekeys 3. get
TURL
4. read
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department 65
File Transfer Service• FTS: Reliable, scalable and customizable file transfer
– Multi-VO service, used to balance usage of site resources according to the SLAs agreed between a site and the VOs it supports
– WS interface, support for different user and administrative roles (VOMS)– Manages transfers through channels
• mono-directional network pipes between two sites– File transfers handled as jobs
• Prioritization• Retries in case of failures
– Automatic discovery of services
• Designed to scale up to the transfer needs of very data intensive applications– Demonstrated about 1 GB/s
sustained– Over 9 petabytes transferred
in the last 6 months (> 10 million files)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
FTS server architecture
• All components are decoupled from each other– Each interacts only with the (Oracle) database
Experiments interact via aweb-service
VO agents do VO-specific operations (1 per VO)
Channel agents do channel specific operation (e.g. the transfers)
Monitoring and statistics can be collected via the DB
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
High Level Services: Catalogues
• File Catalogs– LFC from LCG
• Fully VOMS aware• Readonly replicas• MySQL and ORACLE
– Hydra: stores keys for data encryption• Being interfaced to GFAL (done by December 2007)• Currently only one instance, but in future there will be 3 instances: at
least 2 need to be available for decryption.• Not yet certified in gLite 3.0. Certification will start soon.
– AMGA Metadata Catalog: generic metadata catalogue• Joint JRA1-NA4 (ARDA) development. Used mainly by Biomed• Released for Postgres and ORACLE
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Job Management Services
Computing Element
Storage Element
Site X
Information System
submit
submit
query
retrieve
retrieve
Workload ManagementLogging & Bookkeeping
User Interface
publishstate
File and ReplicaCatalogs
AuthorizationService
query
updatecredential publish
state
discoverservices
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it Markus Schulz, CERN, IT Department
Workload Management System• Workload Management System
– Assigns jobs to resources according to user requirements• possibly including data location and user defined ranking of resources
– Handles I/O data (input and output sandboxes)– Supports compound jobs and workflows (Direct Acyclic Graphs)
• One shot submission of a group of jobs, shared input sandbox– Has a Web Service interface: WMProxy
• UI→WMS decoupled form WMS→CE– Supports automatic re-
submissions• Logging&Bookkeeping
– Tracks jobs while they are running
• Job Provenance– Store and retain data on
finished jobs– Provides data mining
capabilities– Allows job re-execution– Prototype! 15000 jo
bs/day
20000 jobs/d
ay
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Resource Access• Currently the most critical element for interoperation
– Many implementations– Standard is new and underspecified:
• Basic Execution Service (BES) document published on 28/8/2007
• EGEE needs to support the applications running on the infrastructure– Support of the legacy Globus pre-WS service: LCG-CE
• Now being certified on SL4
• In order to improve the interoperability with other infrastructures a new WS-I Compute Element has been developed– CREAM offers direct access to the resource via WSDL and
CLI– Supports JSDL– Collaboration with OMII-EU to implement a BES-compliant
interface• See demos at SC06 (Tampa) and future demo at SC07 (Reno)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Compute Element(s)
LCG-CE (GT4 pre-WS GRAM)– Globus client, Condor-G– Globus code not modified
Condor-C (was gLite CE)– Condor-G– Maintained by Condor– VDT includes BLAH & glexec
CREAM (WS-I)– CREAM and BES client, ICE,
Condor-G (prototype)
Condor-G
Globusclient
gLite WMS
User
CREAMCEMon
ICE
CREAM orBES client
EGEE authZ,InfoSys,
AccountingIn production Existing prototype
gLitecomponent
non-gLitecomponent
BatchSystem
LCG-CE(GT4 +
add-ons)Condor-C
BLAH
User / Resource
UI
Sit
e
GIP
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it Markus Schulz, CERN, IT Department
CREAM status• CREAM passed
the EGEE accepatance tests– > 90000 jobs in 8
days by 50 simultaneous users
– 111 failures (all LSF errors)
• All missing features (needed to work on the EGEE infrastructure) being implemented now
• Certification started April 2008
~ 6 K jobs on the CE at any time
}
Load phase:~ 1 K jobs/hour
CREAM
Run phase:~10 K jobs/day
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Coming: support for pilot jobs• Several VOs submit pilot jobs with a single identity for all of the VO
– The pilot job gets the user job when it arrives on the WN and executes it• Just-in-time scheduling. VO policies implemented at the central queue
• Use the same mechanism for changing the identity on the Computing Element also on the Worker Nodes (glexec)– The site may know the identity of the real user
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it Markus Schulz, CERN, IT Department
Why Interoperability Matters
PBS/Torque
LSF
Condor
Load Leveler
Sun Grid EngineGRAM
v2
ARCCREAM
NAREGI
UnicoreOSG
GRAMv4
Nordugrid
Naregi
DEISA
EGEE
Teragrid
Large number of batch systems and CEs
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Why?
• Required common interfaces– We now have multiple ”common” interfaces
• Tried to solve one problem– But we created another
• Reasons:– The infrastructures were developed independently– Initially there were no standards– Standards take time to mature
• We need to build the infrastructures now!
– Good standards require experience– Experimentation with different approaches
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Select Strategy
• Long term solution– Common interfaces– Standards
• Medium term solutions– Gateways – Adaptors and Translators
• Short term solutions– Parallel Infrastructures
• User driven• Site driven
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Interoperability models
Adaptors and Translators
AP
I Plu
gin
Plu
gin
User driven parallel infr.
Site driven parallel infr.
Gateways
Gateway
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Example Interoperations Activity• November 2004
– Initial meeting with OSG to discuss interoperation– Use common schema, Glue v1.2
• January 2005– Proof of concept was demonstrated
• Modifications to the software releases– Interoperability achieved
• August 2005– Month of focussed activity on operations issues– First OSG site available
• November 2005– First user jobs from GEANT4 arrived on OSG sites
• March 2006– Operations Progress
• Information system bootstrapping, trouble tickets, operations VO, …..
• Summer 2006– CMS successfully taking advantage of interoperations
• Without being aware of it!
• Summer 2007– Joining software certification testbeds
• To ensure interoperability is maintained
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Grid Interoperability Now
• Building upon the many bi-lateral activities • Started at GGF-16 (now OGF) in Feb 2006• Demonstrate what we can for SC 2006
– Applications, Security, Job Management– Information Systems, Data Management
GIN
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
GIN Information System
Generic Information Provider
Pro
vide
r E
GE
E
Pro
vide
r O
SG
Pro
vide
r N
DG
F
GINBDII
ARCBDII
Pro
vide
r N
areg
i
Pro
vide
r Te
ragr
id
Pro
vide
r P
ragm
a
EGEESite
OSGSite
NDGFSite
NaregiGrid
TeragridGrid
PragmaGrid
Translators
Glue
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
The EGEE Infrastructure
Markus Schulz, CERN, IT Department
Production Service
Pre-production service
Certification test-beds (SA3)
Test-beds & Services
Operations Coordination Centre
Regional Operations Centres
Global Grid User Support
EGEE Network Operations Centre (SA2)
Operational Security Coordination Team
Operations Advisory Group (+NA4)
Joint Security Policy Group
EuGridPMA (& IGTF)
Grid Security Vulnerability Group
Security & Policy Groups
Support Structures & Processes
Training infrastructure (NA4) Training activities (NA3)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
User Support• GGUS – now well established
– Used more and more– 10 of 11 ROCs provide dedicated effort to manage the
process – similar to operator on duty teams– Development plan (DSA1.1) and assessment of progress
(MSA1.8) deliveredNumber of tickets processed by GGUS
0
200
400
600
800
1000
1200
1400
1600
Apr-06 May-06 Jun-06 Jul-06 Aug-06 Sep-06 Oct-06 Nov-06 Dec-06 Jan-07 Feb-07 Mar-07
Date
Nu
mb
er
CoD ENOC Others AllNo. Tickets Processed
Operations Network User All
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Grid Operations
• Grid Operator on Duty (“CoD”)– Teams from 10 of 11 ROCs participate– 5-weekly rotations: each week 1 team primary and 1 team
backup– Critical activity in maintaining usability and stability of sites– Important tools
• Site Availability Tests (SAM)• Information system monitoring • GGUS system for trouble ticket management
– Portal for operations : https://cic.gridops.org
• Significant work on operations procedures– Evolved throughout EGEE and EGEE-II– Contribute to establishment of regional grid infrastructures
through related projects – well beyond Europe now
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Accounting
• Accounting system set up by UK/I – now well established – all sites reporting into it– Now starting to deploy a version that reports by user – User DN is encrypted for privacy – Policy (in draft) that defines who can access what information
and for what purpose
• Storage accounting – prototype available now– Schema has been defined– Uses information system to publish available and used storage
space data, for different classes of storage– Sensor queries the BDII and stores into R-GMA and the APEL
system– Portal to query the data is based on the CPU accounting portal
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it Markus Schulz, CERN, IT Department
Security & Policy
Collaborative policy development– Many policy aspects are collaborative works; e.g.:
• Joint Security Policy Group• Certification Authorities
– EUGridPMA IGTF, etc.• Grid Acceptable Use Policy (AUP)
– common, general and simple AUP – for all VO members using many Grid
infrastructures• EGEE, OSG, SEE-GRID, DEISA, national Grids…
• Incident Handling and Response – defines basic communications paths– defines requirements (MUSTs) for IR– not to replace or interfere with local response
plans
Security & Availability Policy
UsageRules
Certification Authorities
AuditRequirements
Incident Response
User Registration & VO Management
Application Development& Network Admin Guide
VOSecurity
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Security groups• Joint Security Policy Group:
– Joint with WLCG, OSG, and others– Focus on policy issues– Strong input to e-IRG
• EUGridPMA– Pan-European trust federation of CAs– Included in IGTF (and was model for it)– Success: most grid projects now subscribe to the IGTF
• Grid Security Vulnerability Group– New group in EGEE-II– Looking at how to manage vulnerabilities– Risk analysis is fundamental– Hard to balance between openness and giving away insider info
• Operational Security Coordination Team– Main day-to-day operational security work– Incident response and follow up– Members in all ROCs and sites– Recent security incident (not grid-related) was good shakedown
TAGPMA APGridPMA
The Americas Grid PMA
European Grid PMA
EUGridPMA
Asia-Pacific
Grid PMA
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Grid Monitoring
• Becoming a critical activity to achieve reliability and stability
System ManagementFabric management
Best PracticesSecurity
…….
Grid ServicesGrid sensors
TransportRepositories
Views…….
System AnalysisApplication monitoring
……
• “… To help improve the reliability of the grid infrastructure …”
• “ … provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service …”
• “ … to gain understanding of application failures in the grid environment and to provide an application view of the state of the infrastructure …”
• “ … improving system management practices,
• Provide site manager input to requirements on grid monitoring and management tools
• Propose existing tools to the grid monitoring working group
• Produce a Grid Site Fabric Management cook-book
• Identify training needs
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Markus Schulz, CERN, IT Department
Prototype site implementation
...Service checks
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 LHCC Comprehensive Review; 19-20 November 2007
GridMap Prototype Visualization
Metric selection for colour of rectangles
Show SAM status
Show GridView availability data
Grid topology view (grouping)
Metric selection for size of rectangles
VO selection
Overall Site or Site Service selection
Link: http://gridmap.cern.ch Drilldown into region by clicking on the title
Context sensitive information
Colour KeyDescription of current view
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Availability metrics - GridView