The High Energy Physics Community Grid Project Inside D-Grid ACAT 07 Torsten Harenberg - University...

Post on 28-Mar-2015

215 views 0 download

Tags:

transcript

The High Energy PhysicsCommunity Grid Project

Inside D-Grid

ACAT 07Torsten Harenberg - University of Wuppertal

harenberg@physik.uni-wuppertal.de

2/27

D-Grid organisational structure

3/27

technical infrastructure

Nutzer

User API

D-Grid resources

Grid services

Core services

Distributed data services

Distributed data services

D-Grid Services

Communities

Daten/Software

Distributed computing resources

Distributed computing resourcesnetwork

network

Security and VOmanagement

I/O

GAT API

Scheduling undWorkflow Management

Portal (GridSphere based)

UNICORE

Accounting undBilling

Data management

Globus Toolkit V4

LCG/gLiteMonitoring

4/27

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

EDG EGEE

LCG R&D WLCG Ramp-up ...

EGEE 2

HEP CG

Okt.HI run

Mar-Seppp run

today

EGEE 3 ?

GridKa / GGUS

DGI

HEP Grid effords since 2001

DGI 2

D-Grid Initiative

???

???

5/27

LHC Groups in Deutschland

Alice: Darmstadt, Frankfurt, Heidelberg, Münster

ATLAS: Berlin, Bonn, Dortmund, Dresden, Freiburg, Gießen, Heidelberg, Mainz, Mannheim, München, Siegen, Wuppertal

CMS: Aachen, Hamburg, Karlsruhe

LHCb: Heidelberg, Dortmund

6/27

German HEP institutes participating in WLCG

WLCG: Karlsruhe (GridKa & Uni), DESY, GSI, München, Aachen, Wuppertal, Münster, Dortmund, Freiburg

7/27

HEP CG participants:

Participants: Uni Dortmund, TU Dresden, LMU München, Uni Siegen, Uni Wuppertal, DESY (Hamburg & Zeuthen), GSI

Associated partners: Uni Mainz, HU Berlin, MPI f. Physik München, LRZ München, Uni Karlsruhe, MPI Heidelberg, RZ Garching, John von Neumann Institut für Computing, FZ Karlsruhe, Uni Freiburg, Konrad-Zuse-Zentrum Berlin

8/27

HEP Community Grid

WP 1: Data management (dCache)

WP 2: Job Monitoring and user support

WP 3: distributed data analysis (ganga)

==> Joint venture between physics and computer science

9/27

WP 1: Data managementcoordination: Patrick Fuhrmann

An extensible metadata catalogue for semantical data access:

Central service for gauge theory

DESY, Humboldt Uni, NIC, ZIB

A scaleable storage element:

Using dCache on multi-scale installations.

DESY, Uni Dortmund E5, FZK, Uni Freiburg

Optimized job scheduling in data intensive applications:

Data and CPU Co-scheduling

Uni Dortmund CEI & E5

10/27

WP 1: Highlights

Establishing a metadata catalogue for the gauge theory

Production service of a metadata catalogue with > 80.000 documents.

Tools to be used in conjunction with LCG data grid

Well established in international collaboration

http://www-zeuthen.desy.de/latfor/ldg/

Advancements in data management with new functionality

dCache could become quasi standard in WLCG

Good documentation and automatic installation procedure helps to provide useability for small Tier-3 installations up to Tier-1 sites.

High troughput for large data streams, optimization on quality and load of disk storage systems, giving high performant access to tape systems

11/27

dCache based scaleable storage element

dCache project well established

New since HEP CG:

Professional product management, i.e. code versioning, packaging, user support and test suits.

- single host- ~ 10 TeraBytes- Zero Maintenance

- thousands of pools- >> PB Disk Storage- >> 100 File transfers/ sec- < 2 FTEs

dCache.ORG

12/27

dCache: principle

P

Backend Tape Storage

Streaming Data

(gsi)FTPhttp(g)

Posix I/O

xRootdCap

Storage Control

SRMEIS

protocol Engines

dCache Controller

Managed Disk Storage

HS

M A

dapt

er

dCache.ORG

Information Prot.

13/27

dCache: connection to the Grid world

Storage Element

Firewall

IN - SITE

Compute Element

Information System

FTS Channels

gsiFtp

gsiFtp

SRM

Storage ResourceManager Protocol

File Transfer Service

dCap/rfio/root

OUT - SITE

14/27

dCache: achieved goals

Development of the xRoot protocol for distributed analysis

Small sites: automatic installation and configuration (dCache in 10mins)

Large sites (> 1 Petabyte):

Partitioning of large systems.

Transfer optimization from / to tape systems

Automatic file replication (freely configurable)

15/27

dCache: Outlook

Current usage

7 Tier I centres with up to 900 Tbytes on disk (pre center) plus tape system. (Karlsruhe, Lyon, RAL, Amsterdam, FermiLab, Brookhaven, Nordu Grid)

~ 30 Tier II centres, including all US CMS in USA, planned for US ATLAS.

Planned usage

dCache is going to be included in the Virtual Data Toolkit (VDT) of the Open Science Grid: proposed storage element in the USA.

Planned US Tier I will break the 2 PB boundary end of the year.

16/27

HEP Community Grid

WP 1: Data management (dCache)

WP 2: Job Monitoring and user support

WP 3: distributed data analysis (ganga)

==> Joint venture between physics and computer science

17/27

WP 2: job monitoring and user support co-ordination: Peter Mättig (Wuppertal)

Job monitoring- and resource usage visualizer

TU Dresden

Expert system classifying job failures:

Uni Wuppertal, FZK, FH Köln, FH Niederrhein

Job online steering:

Uni Siegen

18/27

Worker NodeJob Monitoring

_ monitoring sensorsJob Execution Monitoring

_ stepwise

User Application(Physics)

Worker NodeJob Monitoring

_ monitoring sensors

User Application(Physics)

Worker NodeJob Monitoring

_ monitoring sensors

User Application(Physics)

Worker NodeJob Monitoring

_ monitoring sensorsJob Execution Monitoring

_ stepwise

User Application(Physics)

Worker NodeJob Monitoring

_ monitoring sensors

User Application(Physics)

Monitoring Box_ R-GMA

User_ Browser_ Visualisation Applet_ Visualisations

_ Interactivity_ Overviews_ Details_ Timelines, Histograms

...

Analysis_ Web-Service_ Interface to

monitoring systemse.g. R-GMA Consumer

R -GMA_

_

_

Portal Server_ GridSphere_ Monitoring Portlet

Job monitoring- and resource usage visualizer

19/27

Integration into GridSphere

20/27

Job Execution Monitor in LCG

submitted

waiting

ready

scheduled

running

What is goingon here ?

done (failed) done (ok)

cleared

cancelled aborted

Motivation

1000s of jobs each day in LCG

Job status unknown while running

Manual error detection: slow and difficult

GridICE, ...: service/hardware based monitoring

Conclusion

Monitor job while running

JEM

Automatical error detection needed

expert system

21/27

gLite/LCGWorkernodePre-execution test

Script monitoring

Information exchange: R-GMA

Visualization: e.g. GridSphere

Bash

Python

Experten system for classification

Integration into ATLAS

Integration into GGUS

post D-Grid I: ... ?

JEM: Job Execution Monitor

22/27

JEM - status

Monitoring part ready for use

Integration into GANGA (ATLAS/LHCb distributed analysis tool) ongoing

Connection to GGUS planned

http://www.grid.uni-wuppertal.de/jem/

23/27

HEP Community Grid

WP 1: Data management (dCache)

WP 2: Job Monitoring and user support

WP 3: distributed data analysis (ganga)

==> Joint venture between physics and computer science

24/27

WP 3: distributed data managementCo-ordination: Peter Malzacher (GSI Darmstadt)

GANGA: distributed analysis @ ATLAS and LHCb

Ganga is an easy-to-use frontend for job definition and management

Python, IPython or GUI interface

Analysis jobs are automatically splitted into subjobs which are sent to multiple sites in the Grid

Data management for in- and output. Distributed output is collected.

Allows simple switching between testing on a local batch system and large-scale data processing on distributed resources (Grid)

Developed in the context of ATLAS and LHCb

Implemented in Python

25/27

GANGA schema

Storage

queues

manager

outputs

catalog

query

submit

files

jobsdata file splitting

myAna.C

mergingfinal analysis

26/27

PROOF schema

catalog Storage

scheduler

query

MASTER

PROOF query:data file list, myAna.C

files

final outputs

(merged)

feedbacks

27/27

DESY, DortmundDresden, Freiburg,

GSI, München,Siegen,

Wuppertal

Dortmund, Dresden, Siegen,Wuppertal, ZIB,

FH Köln,FH Niederrhein

Physics Departments Computer SciencesD-GRID: Germany‘s contribution to HEP computing:

dCache, Monitoring, distributed analysis

Effort will continue,

2008: Start of LHC data taking challenge for GRID Concept

==> new tools and developments needed

HEPCG: summary