+ All Categories
Home > Documents > EGEE Infrastructure, Services, & Operations

EGEE Infrastructure, Services, & Operations

Date post: 02-Feb-2016
Category:
Upload: jera
View: 20 times
Download: 0 times
Share this document with a friend
Description:
EGEE Infrastructure, Services, & Operations. Ian Bird, CERN IT SA1 Activity Leader 1 st EGEE User Forum2 nd March 2006. Outline. Introduction – history Middleware and Services Middleware distributions Operations User Support Access to resources & Introducing new VOs - PowerPoint PPT Presentation
50
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org EGEE Infrastructure, Services, & Operations Ian Bird, CERN IT SA1 Activity Leader 1 st EGEE User Forum 2 nd March 2006
Transcript
Page 1: EGEE Infrastructure, Services, & Operations

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE Infrastructure, Services, & Operations

Ian Bird, CERN ITSA1 Activity Leader

1st EGEE User Forum 2nd March 2006

Page 2: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Outline

• Introduction – history • Middleware and Services• Middleware distributions• Operations• User Support• Access to resources & Introducing

new VOs• What can you get from EGEE?

– And what does it cost?• From EGEE to EGEE-II• Outlook

SA1 – Operations & Management 97%

SA2 – Network Services 3%

Page 3: EGEE Infrastructure, Services, & Operations

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Introduction

Page 4: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 4

Enabling Grids for E-sciencE

INFSO-RI-508833

History• EGEE infrastructure (middleware distribution and

operations) was built up during 18 months prior to the start of EGEE by the LCG project

– The LCG work formed the basic infrastructure of EGEE– The middleware distribution retained this name (LCG-2.x) as it was

expected to be replaced by gLite– Now the middleware distribution will evolve with additional or

replacement services coming from gLite or elsewhere

• EGEE started in April 2004 with a running grid infrastructure

– 40 sites, 3000 CPU– Basic operations– Developed certification and deployment process

• Now expanded to:– 200 sites, >20 000 CPU, 40 countries– Managed operations – stability of sites– >10 000 jobs / day sustained over the last year

Sites

CPU

Jobs/day

Page 5: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 5

Enabling Grids for E-sciencE

INFSO-RI-508833

Page 6: EGEE Infrastructure, Services, & Operations
Page 7: EGEE Infrastructure, Services, & Operations

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Middleware & Services

Page 8: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 8

Enabling Grids for E-sciencE

INFSO-RI-508833

Grid middleware • Middleware is software and services that sit between the user application

and the underlying computing and storage resources, to provide a uniform access to those resources.

• The GRID middleware services: should– Find convenient places for

the application to be run– Optimise use of resources– Organise efficient access to data – Deal with authentication to the

different sites that are used– Run the job & monitor progress– Recover from problems– Transfer the result back to the scientist

Page 9: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 9

Enabling Grids for E-sciencE

INFSO-RI-508833

Middleware Distributions and Stacks

• Terminology:– EGEE deploys a middleware distribution

Drawn from various middleware products, stacks, etc. Do not confuse the distribution with development projects or with software packages Count on 6 months from software developer “release” to production deployment

– The EGEE distribution: Current production version labelled: LCG-2.7.0 Next version labelled: gLite-3.0 Name change to hopefully reduce confusion

• EGEE distribution contents: LCG-2.7.0:

– VDT – packaging Globus 2.4, Condor, MyProxy

– EDG workload management– LCG components:

BDII (info sys), catalogue (LFC), DPM, data management libraries and

CLI tools monitoring tools

– gLite: R-GMA, VOMS, FTS

gLite-3.0:– Based on LCG-2.7.0, and– gLite workload management– Other gLite components (not in the

distribution but provided as services): AMGA, Hydra, Fireman gLite-IO

evolution

Page 10: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 10

Enabling Grids for E-sciencE

INFSO-RI-508833

CAs, Authentication, Authorization

Authentication• Use of GSI, X.509 certificates

– Generally issued by national certification authorities

• Agreed network of trust:– International Grid Trust Federation (IGTF)

EUGridPMA APGridPMA TAGPMA

– All EGEE sites will usually trust all IGTF root CAs

Authorization• Until LCG-2.7.0 via grid-map files only• From LCG-2.7.0 using VOMS

extended proxies– Call-outs to local authorization services– Integration with grid services under way –

compute elements, storage systems– For some time the authorization will be a

mixture of call-outs and grid-map files until all services understand extended proxies

TAGPMA APGridPMA

The Americas Grid PMA

European Grid PMA

EUGridPMA

Asia-Pacific

Grid PMA

Page 11: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 11

Enabling Grids for E-sciencE

INFSO-RI-508833

Basic ServicesJob Management:• Workload Management –

– Resource Broker– DLI/SI interface to catalogues for data-based

scheduling– Bulk job submission (gLite-3.0)– DAGs (gLite-3.0)– Push/pull mode (pull untested – gLite-3.0)

• Compute Element (CE):– Globus/EDG/LCG Condor_C (VO-based

scheduling) in gLite-3.0• Logging & Bookkeeping• Local Batch systems:

– LSF, PBS, Condor, (Sun Grid Engine)• Additional tools:

– Ability to “peek” at stdout/stderr of running jobs– User job monitoring – look at the status (state,

cpu time, etc) of running jobs

Data Management• File and replica catalogues (LFC)

– Central or local (not distributed)– Replication via Oracle, or squid caches tested

by LCG– Secure

• File Transfer Service (FTS)– Reliable data transfer– Uses gridftp or srmcopy as transport

• Storage Elements based on SRM interface– DPM: implements Posix ACLs, VOMS

roles/groups (gLite-3.0)– Other available SEs: dCache, Castor– Deprecated: “Classic SE” – basically just gridftp

• Metadata catalogue:– AMGA (gLite-3.0 – partial support)

• Secure Keystore:– Hydra (gLite-3.0 – partial support)

• Utilities and IO libraries:– Lcg-utils– GFAL – this is the SRM client library– gLiteIO – expect functionality to be replaced

Page 12: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 12

Enabling Grids for E-sciencE

INFSO-RI-508833

Other servicesInformation system• BDII (implementation of Globus MDS)• GLUE schema• Several tools to access information• FCR site selection tool (see next

slide)

Monitoring & Accounting• R-GMA used as monitoring framework• Aggregation for various sources of

monitoring data• Accounting: APEL package:

– After-the-fact accounting – Uses GGF User Record as schema– Does not provide user-level data – but

this is a legal/privacy issue not technical!

Page 13: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 13

Enabling Grids for E-sciencE

INFSO-RI-508833

Selecting resources• Selecting resources:

– Tool that uses dynamically updated data about sites

Site functional tests– VO can:

Select critical tests White/black list sites

– VO gets a customised set of “good” sites – a view in the information system

– VO can add VO-specific tests

• Can be used by RB or other workload management system to run on good/stable sites

Page 14: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 14

Enabling Grids for E-sciencE

INFSO-RI-508833

Selecting resources

Page 15: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 15

Enabling Grids for E-sciencE

INFSO-RI-508833

Selecting resources

Page 16: EGEE Infrastructure, Services, & Operations

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Middleware distributions Deployment

Page 17: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 17

Enabling Grids for E-sciencE

INFSO-RI-508833

Inte

grat

ion

VDT/OSG

OMII-Europe

JRA1

SA3

Test

ing

& C

ertif

icat

ion

Support, analysis, debugging

Pro

duct

ion

serv

ice

SA1P

re-p

rodu

ctio

n se

rvic

e

Mid

dlew

are

prov

ider

s

SA3

Certification activities SA3+SA1

Process to deployment

Page 18: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 18

Enabling Grids for E-sciencE

INFSO-RI-508833

Release Process (simplified)

C&T

EISGIS

GDB

ApplicationsRC Bugs/Patches/TaskSavannah

EISCICs

Head of Deployment

prioritization&

selection

Developers

Applications

Developers

1

List for next release(can be empty)2

integration&

first testsC&T

3

Internal Releases

4User Level install of

client toolsEIS

5

full deployment on test clusters (6)

functional/stress tests~1 week

C&T

6

assign and update cost

Bugs/Patches/TaskSavannah

componentsready at cutoff

InternalClient

Release

7Client

ReleaseService Release

Updates Release

Core Service Release

C&T

Page 19: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 19

Enabling Grids for E-sciencE

INFSO-RI-508833

Deployment process

Release(s)

Certificationis run daily

Update User Guides EIS

UpdateRelease Notes

GIS

ReleaseNotes

InstallationGuides

UserGuides

Re-Certify

CIC

Every Month

11

ReleaseReleaseClient Release

Deploy ClientReleases

(User Space)GIS

Deploy ServiceReleases (Optional) CICs

RCs

Deploy MajorReleases

(Mandatory) ROCsRCs

YAIM

Every Month

Every 3 months

on fixed dates !

at own pace

Page 20: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 20

Enabling Grids for E-sciencE

INFSO-RI-508833

Certification test bed

RB_ a

BDI I _ a

MDS_ a

CE_ a

SE_ a

RB_ b

BDI I _ b

CE_ b

WNs

CE_ 2

SE_ 2

WNs

RB_ 3

BDI I _ 3

MDS_ 3_ a

CE_ 3

SE_ 3

WNs

CE_ 4

SE_ 4

WNsWNsWN_ a1

WNsWNs

WN_ b1 WNsWNsWNs

WN_ 2_ a1WNsWNs

WN_ 3_ a2

WN_ 3_ a1

WNsWNsWNsWNs

WN_ 4

RLS_ MySQL

RLS_ oracle

Cluster_1 Cluster_2 Cluster_3 Cluster_4

UI _ 1 UI _ 4

CE_ 5

WNsWNsWNsWNsWNs

WN_ 5

Cluster_5

CE_ 6

WNsWNsWNsWNsWNsWN

Cluster_6

LSFCondor

CertTB

Proxy

WN_ b2WN_ a2

WN_ 2_ a2

LCFGng Lite install

MDS_ b

MDS_ 3_ b

Page 21: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 21

Enabling Grids for E-sciencE

INFSO-RI-508833

Time to upgrade

• Time to upgrade ~constant (~2.5 sites/day)

• Takes a long time to upgrade entire infrastructure

• Better now than it was – site functional tests and operational oversight

• Need to move away from the need to do full upgrades more than 1-2 times / year

– But need to be able to deploy updates, new tools, security patches, etc.

LCG-2.6.0

Page 22: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 22

Enabling Grids for E-sciencE

INFSO-RI-508833

Desired scenario• Steady-state with:

– Components delivered (as far as possible) independent of each other– Developed according to realistic schedules – not constrained by artificial release

deadlines– Production service running stable, tested (certified) versions of services and tools

Major upgrades only 1 or 2 times per year Potential for upgrading individual services Client tools: new versions deployed as needed Emphasis on reliability, stability, performance, backward compatibility, …

– Pre-production service running new, but certified versions of services Anticipated as upgrades to production services (beta releases of next versions or new

services) Allowing reasonable scale application testing and integration with new versions

– Certification testbed running full regression, stress, and functional tests Pre-requisite before moving to pre-production and production

• Software can be rejected (not working, not ready, … )– During testing/certification– During pre-production

• Net result must be that the production service is stable and as reliable as possible; and evolves incrementally and in a controlled way

Page 23: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 23

Enabling Grids for E-sciencE

INFSO-RI-508833

Checklist for a new service• User support procedures (GGUS)

– Troubleshooting guides + FAQs– User guides

• Operations Team Training– Site admins– CIC personnel– GGUS personnel

• Monitoring– Service status reporting– Performance data

• Accounting– Usage data

• Service Parameters – Scope - Global/Local/Regional– SLAs– Impact of service outage– Security implications

• Contact Info– Developers– Support Contact– Escalation procedure to developers

• Interoperation– Documented issues

• First level support procedures– How to start/stop/restart service– How to check it’s up– Which logs are useful to send to

CIC/Developers and where they are

• SFT Tests– Client validation– Server validation– Procedure to analyse these

error messages and likely causes• Tools for CIC to spot problems

– GIIS monitor validation rules (e.g. only one “global” component)

– Definition of normal behaviour Metrics

• CIC Dashboard– Alarms

• Deployment Info– RPM list– Configuration details– Security audit

This is what is takes to make a reliable production service from a middleware component

Not much middleware is delivered with all this … yet

Page 24: EGEE Infrastructure, Services, & Operations

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Operations

Page 25: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 25

Enabling Grids for E-sciencE

INFSO-RI-508833

Grid Operations• Services:

– Production service– Pre-production service– Operational security – incident response

• Operation process, includes:– Problem detection– Reporting– Problem solving– Escalation procedures

Page 26: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 26

Enabling Grids for E-sciencE

INFSO-RI-508833

EGEE Operations Structure• Operations Management Centre

(OMC)• Core Infrastructure Centres

(CIC)– Manage daily grid

operations – oversight, troubleshooting

“Operator on Duty”– Run infrastructure services– UK/I, Fr, It, CERN,

Ru,Taipei• Regional Operations Centres

(ROC)– Front-line support for user

and operations issues– Provide local knowledge

and adaptations– One in each region – many

distributed• User Support Centre (GGUS)

– In FZK: provide single point of contact (service desk) + portal.

Page 27: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 27

Enabling Grids for E-sciencE

INFSO-RI-508833

EGEE Operations Process• Grid operator on duty

– 6 teams working in weekly rotation CERN, IN2P3, INFN, UK/I, Ru,Taipei

– Crucial in improving site stability and management

• Operations coordination– Weekly operations meetings– Regular ROC, CIC managers meetings– Series of EGEE Operations Workshops

Nov 04, May 05, Sep 05, (June 06?)• Geographically distributed responsibility

for operations:– There is no “central” operation– Tools are developed/hosted at different sites:

GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon)

• Procedures described in Operations Manual

– Introducing new sites– Site downtime scheduling– Suspending a site– Escalation procedures– etc

Page 28: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 28

Enabling Grids for E-sciencE

INFSO-RI-508833

Operations tools: Dashboard• Dashboard provides top level

view of problems:– Integrated view of monitoring

tools (SFT, GStat) shows only failures and assigned tickets

– Single tool for ticket creation and notification emails with detailed problem categorisation and templates

– Detailed site view with table of open tickets and links to monitoring results

– Ticket browser highlighting expired tickets

Test summary (SFT,GSTAT)

GGUS Ticket status

•`Problem categories

•`Sites list (reporting new problems)

Developed and operated by CC-IN2P3: http://cic.in2p3.fr/

Page 29: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 29

Enabling Grids for E-sciencE

INFSO-RI-508833

Regional Operations

Centre

Regional Operations

Centre

Regional Operations

Centre… …

Resource Centre

Resource Centre

… Resource Centre

Resource Centre

OperationsCoordination

Centre OSCT

Coordination,Middleware deployment

Operational security coordination

1st Level support

2nd Level support

JSPG

Coordination,Middleware deployment

Coordination,Middleware deployment

JSPG: Joint Security Policy GroupOSCT: Operational Security Coordination Team

Operations/deployment support

Page 30: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 30

Enabling Grids for E-sciencE

INFSO-RI-508833

Regional Operations

Centre… …

Regional Operations

Centre

Resource Centre

Resource Centre

Regional Operations

Centre

Resource Centre

Resource Centre

OSCTGrid Operator on-duty

2nd Level support

1st Level support

Monitoring shows a problem

Operator submits a GGUS ticket against the ROC and cc’s the site. The ticket is followed until it is solved

ROC and Site work to resolve the problem

Operations support workflows

Page 31: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 31

Enabling Grids for E-sciencE

INFSO-RI-508833

Evolution of SFT metric

Missing log data

Available sites

Available CPU

Daily: July November

Page 32: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 32

Enabling Grids for E-sciencE

INFSO-RI-508833

Security Policy

• Joint Security Policy Group– EGEE with strong input from OSG– Policy Set:

• Policy Revisions– Grid Acceptable Use Policy (AUP)

https://edms.cern.ch/document/428036/ common, general and simple AUP for all VO members using many Grid

infrastructures• EGEE, OSG, SEE-GRID, DEISA, national Grids…

– VO Security https://edms.cern.ch/document/573348/ responsibilities for VO managers and

members VO AUP to tie members to Grid AUP

accepted at registration– Incident Handling and Response

https://edms.cern.ch/document/428035/ defines basic communications paths defines requirements (MUSTs) for IR

• reporting• response• protection of data• analysis

not to replace or interfere with local response plans

Security & Availability Policy

UsageRules

Certification Authorities

AuditRequirements

Incident Response

User Registration & VO Management

Application Development& Network Admin Guide

VOSecurity

Page 33: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 33

Enabling Grids for E-sciencE

INFSO-RI-508833

Operational Security Coordination Team (OSCT)

– What it is not: Not focused on middleware security

architecture Not focused on vulnerabilities (see

Vulnerabilities Group)– Focus on Incident Response

Coordination Assume it’s broken, how do we

respond? Planning and Tracking

– Focus on ‘Best Practice’ Advice Monitoring Analysis

– Coordinators for each EGEE ROC plus OSG LCG Tier 1 + Taipei

SSC1 - Job Trace

SSC2 - Storage Audit

Infrastructure

HA

ND

BO

OK

IncidentResponse

Policy

Procedures

Resources

Reference

Playbook

SecurityService

Challenge

Infrastructure

Agents

Deployment

MonitoringTools

3 strategies

• OSCT membership ROC security contacts

Page 34: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 34

Enabling Grids for E-sciencE

INFSO-RI-508833

Vulnerability Group• Has been set up last summer (CCLRC lead)• Purpose: inform developers, operations, site managers of vulnerabilities as they

are identified and encourage them to produce fixes or to reduce their impact• Set up (private!) database of vulnerabilities

– To inform sites and developers• Urgent action OSCT to manage• After reaction time (45 days)

– Vulnerability and risk analysis given to OSCT to define action – publication?– Will not publish vulnerabilities with no solution

• Intend to report progress and statistics on vulnerabilities by middleware component and response of developers

• Balance between open responsible public disclosure and creating security issues with precipitous publication

• Following first experience in implementing this process, review of procedures under way, including need for appropriate risk analyses

Page 35: EGEE Infrastructure, Services, & Operations

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

User Support

Page 36: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 36

Enabling Grids for E-sciencE

INFSO-RI-508833

Goals• A single access point for support• A portal with a well structured information and updated documentation• Knowledgeable experts • Correct, complete and responsive support• Tools to help resolve problems

– search engines – monitoring applications– resources status

• Examples, templates, specific distributions for software of interest• Interface with other Grid support systems• Connection with developers, deployment, operation teams• Assistance during production use of the grid infrastructure

Page 37: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 37

Enabling Grids for E-sciencE

INFSO-RI-508833

Central Application

(GGUS)

DeploymentSupport

MiddlewareSupport

NetworkSupport

Operations Support

TPM

ROC 1 ROC 10ROC…

VOSupport

InterfaceWebportal

The Support Model ““Regional Support with Central Coordination"Regional Support with Central Coordination"

The ROCs, VOs and other project-wide groups such

as the Core Infrastructure Center (CIC), middleware

groups (JRA), network groups (

NA), service groups (SA) areconnected via a

central integration

platform provided by GGUS.

Regional Support units

User Support unitsTechnical Support units

Page 38: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 38

Enabling Grids for E-sciencE

INFSO-RI-508833

The GGUS System

Page 39: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 39

Enabling Grids for E-sciencE

INFSO-RI-508833

GGUS Portal: user services

Browseable ticketsBrowseable tickets

Search through solved ticketsSearch through solved tickets

Useful links (Wiki FAQ)Useful links (Wiki FAQ)

Broadcast toolsBroadcast tools

Latest NewsLatest News

GGUS Search EngineGGUS Search Engine

Updated documentation (Wiki FAQ)Updated documentation (Wiki FAQ)

Page 40: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 40

Enabling Grids for E-sciencE

INFSO-RI-508833

TPMGrid experts

GGUS Supporters

VO-TPMVO experts

User

First line support

VO SupportUnits

Middleware Support Units

Deployment Support Units

Operations Support

ROC Support Units

Network Support

Second line support

Page 41: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 41

Enabling Grids for E-sciencE

INFSO-RI-508833

Performance statistics

Tickets per Submitter

CIC; 144GGUS user; 137

Average processing times for cms

00:00:00

140:01:52

Tim

e (h

h:m

m:s

s) Average time f romticket creation toticket assignment

Average time f romticket assignment toticket solution

September

Average processing times for cms tickets

0:18:35

21:03:45

Tim

e (h

h:m

m:s

s) Average time fromticket submit to ticketassignment

Average time fromticket assignment toticket solution

October

Average processing times for TPM

0:01:34

2:38:37

Tim

e (h

h:m

m:s

s) Average time fromticket creation toticket assignment

Average time fromticket assignment toticket solution

October

Average processing times for all ROCs

1:35:16

41:59:13

Tim

e (h

h:m

m:s

s) Average time from ticketcreation to ticketassignment

Average time from ticketassignment to ticketsolution

October

A peak of 80 tickets per day has been reached.

0

10

20

30

40

50

15

22

1

1118

23

34

19

46

16 13 15

26

42

4

16

2 1

Cas

tor

Gen

eric

Dep

loym

ent

Glo

balG

ridU

serS

uppo

rt

Net

wor

kOpe

ratio

ns

RO

C_A

sia/

Pac

ific

RO

C_C

E

RO

C_C

ER

N

RO

C_D

E/C

H

RO

C_F

ranc

e

RO

C_I

taly

RO

C_N

orth

RO

C_R

ussi

a

RO

C_S

E

RO

C_S

W

RO

C_U

K/Ir

elan

d

Sec

urity

Man

agem

ent

TPM

VO

Sup

port

Wor

kloa

d M

anag

emen

t

Am

ount

of t

icke

ts

November 2005: 315 tickets

Page 42: EGEE Infrastructure, Services, & Operations

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

New VOs; Access to Resources; Benefits & Costs

Page 43: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 43

Enabling Grids for E-sciencE

INFSO-RI-508833

How new VOs find resourcesVarious possibilities:1. Pilot applications:

– Expectation that they have access to resources provided by many partners For EGEE-II this is specified in TA

2. Applications reviewed and approved by EGAAP:– Negotiation via OAG to understand which ROCs/sites are willing to

Run services on behalf of the VO Provide compute and/or storage resources

3. Other (self supporting) applications Own their own resources Use EGEE infrastructure, operations, support Many successful examples of such VOs

• 1 & 2: – Formal agreements (TA or MoU) – Should expect support via NA4 – but should also build up internal support teams– Expected to collaborate on improving the service – not just “users”

• 1, 2 & 3:– Full user and operations support– VOs need to provide support teams – some problems are application problems!

Page 44: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 44

Enabling Grids for E-sciencE

INFSO-RI-508833

NegotiationOperations Advisory Group (OAG)• Brings together VOs and resource providers (ROCs)• Negotiate for services and resources

• Should not always be an expectation of “free” resources– In future applications should bring some resources with them – Computational and storage resources are not funded (!) by the project

Page 45: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 45

Enabling Grids for E-sciencE

INFSO-RI-508833

EGEE – What can it deliver?• A managed operation – providing a service:

– A large number of sites of different sizes and capabilities– Developed operational procedures

Monitoring of the grid services providing access to resources– Operational security support; incident response coordination– Support services: user support, training, etc. – Building up considerable experience in grid-enabling a variety of different

applications– Tools for monitoring of resources at a site … if required

• A new VO joining EGEE with a few sites:– Benefits from the operations and support – the VO sites can be monitored and

supported as part of the infrastructure– Potentially access to other resources – It is a significant effort to set up a grid infrastructure from scratch

Page 46: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 46

Enabling Grids for E-sciencE

INFSO-RI-508833

… and what does it cost?• “The application VO buys into the EGEE model”

– Actually not so restrictive now – supports many linux flavours, IA64, (other teams have worked on AIX, SGI ports)

– Simple installation of client software now (can be done on the fly)– Basic grid services are quite general, nothing really application-specific

• Some unresolved issues:– Commercial licensed software used by an application– Levels of privacy/security needed in some life-science applications– True interactivity

• … and of course, this is all new, rapidly evolving and many problems still to be overcome

• VOs should:– Provide application support effort to help other VO users– Invest effort into helping improve the infrastructure and services – should not be

simple “client – server” – rather a collaboration

Page 47: EGEE Infrastructure, Services, & Operations

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Future

Page 48: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 48

Enabling Grids for E-sciencE

INFSO-RI-508833

From EGEE to EGEE-II• Simplify operations structure

– ROCs absorb CIC roles – spread of expertise• Introduce SA3

– Integration, certification, distribution preparation– Emphasises focus on stability, reliability, performance rather than new features– Mechanism for integrating non-EGEE software – according to need

• Increased emphasis on – Platform support (OS, 64-bit, etc)– Interoperability with other grids (international, regional, national, local, campus,) and other

middleware stacks (Unicore, ARC, …)

SA: 54% of total• SA1 (operations) : 86%• SA2 (network) : 3%• SA3 (certification): 11%

Page 49: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 49

Enabling Grids for E-sciencE

INFSO-RI-508833

Outlook• LHC VOs must achieve reliable production and analysis in 2006

– Will be making significant use of resources• Consolidate and improve existing services: Focus on

– Reliability, robustness– Manageability– Performance, scalability– Evolution or replacement of services driven by needs of application (or

security/manageability)• Expand grid operations

– Spread expertise to ROCs– Collaboration with OSG, A-P– Start to negotiate SLAs

• New applications– Must bring resources – show commitment – Resource sharing and negotiation – must become streamlined

Will need a mechanism for cost/credit for use of resources

Page 50: EGEE Infrastructure, Services, & Operations

EGEE Infrastructure & Operations 50

Enabling Grids for E-sciencE

INFSO-RI-508833

Summary• EGEE Infrastructure – world’s largest multi-science production grid

service– But does not exist in isolation: interoperability and interoperation is essential

• Significant improvements in reliability and stability over the last year• Is in constant use for significant production work

– Many VOs now use it as their primary resource• Middleware distribution is

– Consolidating existing and new services– Basis for evolution according to needs

• Shift from EGEE to EGEE-II– No major changes, but adjustments based on experience and anticipated evolution– Refine and improve processes


Recommended