Union of Light Ion Centres in Europe
Shared services and tools based on the Ganga job defini=on and
management framework. EGI User Forum 2011 – Vilnius, Lithuania.
U. Egede (Imperial College London), M. Kenyon (CERN), J. Moscicki (CERN), I.A. Dzhunov (University of Sofia), L. Kokoszkiewicz (CERN), M. Cinquilli (CERN), L. Sargsyan (CERN/Yerevan Physics Ins=tute), E. Karavakis (CERN), J. Andreeva
(CERN)
The ULICE project is co-‐funded by the European Commission under FP7 Grant Agreement Number 228436.
Overview
• The EGI Introductory Package. – Tools that were ini=ally developed by and for specific user communi=es.
• Tool adop=on case-‐studies. – Grid job submission & management.
– Error-‐repor=ng infrastructure for users.
4/27/11 2 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
The EGI Introductory Package
• “Simple, complete solu=on for running and monitoring compu=ng tasks on the grid.” – Audience: small to medium-‐sized user communi=es.
• Comprises the following components. – Ganga: user interface for job submission and management.
– DIANE: automa=c control and scheduling of Ganga jobs. – Mini-‐dashboard: monitoring of Ganga and/or Diane tasks & error repor=ng.
4/27/11 3 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Ganga overview
• Ganga project was started by ATLAS & LHCb in 2001. – En=rely wrieen in Python from the outset. – Extensible framework; simple and well documented procedure.
• Underwent a major redesign in 2005. – Core codebase is stable, though new features are in ac=ve
development. – Assign ownership of sub-‐packages.
• Release procedure. – Rota=ng release manager. – Rigorous tes=ng prior to each release.
• Registered in the EGI applica=on database.
4/27/11 4 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Ganga overview
• Easy to use frontend for the configura=on, execu=on and management of computa=onal tasks. – Interac=ve Python prompt, script submission, GUI.
• Submission to a range of plaiorms in a consistent manner (localhost, PBS, LSF, SGE, Condor, gLite, ARC, Globus).
• Key concept: “The Job”.
4/27/11 5 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
DIANE – Distributed Analysis Environment
• DIANE development started within CERN IT in 2000. – Registered in the EGI applica=on database.
• Job execu=on control framework. – Automa=c load balancing, scheduling and failure recovery.
• U=lises a pilot-‐job mechanism (also known as “late binding”). – Pilots controlled by Ganga. – Workload fed directly to pilots.
4/27/11 6 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Mini-‐dashboard: task monitoring
• A service to monitor the state of submieed Ganga and/or DIANE tasks.
4/27/11 7 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Mini-‐dashboard: task monitoring
• Based around the hBrowse framework. – HTML/Javascript client that consumes JSON data. – Highly configurable. – Plugin architecture (dynamic tables, user selec=on, field filters).
– Support for Google charts.
4/27/11 8 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Mini-‐dashboard: task monitoring
4/27/11 9 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Mini-‐dashboard: task monitoring
4/27/11 10 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Mini-‐dashboard: error repor=ng
• Support teams need the full picture to help users.
• Grid jobs are complex beasts – Range of somware versions. – Configura=on senngs.
– Environment variables.
• Error repor=ng tool. – Repository for diagnos=c data. – Tradi=onal “expert-‐user” model.
– Or, community support model.
4/27/11 11 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Mini-‐dashboard: error-‐repor=ng tool
• Ganga func=on to POST a job error report to a central repository.
4/27/11 12 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Mini-‐dashboard: error-‐repor=ng tool
• Ganga func=on to POST a job error report to a central repository.
• Report available to community support team.
• Command history, detailed logs, environment...
4/27/11 13 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
4/27/11 14 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Case study 1 Geant4 code validation.
Community case study: Geant4
• Geant4 is a toolkit for simula=ng the trajectories of par=cles through maeer.
• Finds applica=on within HEP, nuclear and accelerator physics, medical and space science.
• Vast, object-‐oriented suite, with over 600,000 lines of complex source code. – Developers are distributed around the world.
4/27/11 15 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Geant4: Code valida=on
• Sta=s=cal regression tests compare simula=on results between the previous and pending releases. – Wide range of (physics) input parameters are used. – 1000 independent tasks generated, each one simula=ng 5000 events. – Total CPU =me required per release = a few CPU years.
• Intensive valida=on period prior to release (every 6 months). – Quick succession of candidate releases. – Tests par=ally or wholly repeated mul=ple =mes.
• The Ganga/DIANE framework has been used to test releases since June 2007.
• Valida=on metrics; performance of code (=me/event) and stability (applica=on crashes).
4/27/11 16 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Geant4: Code valida=on
• Last valida=on Dec. 2010; Geant4 v9.4 (report) – Ganga/DIANE used to run code valida=on on 18 grid sites. – 80 million events produced in 2-‐3 days (c.f. 1 week for previous releases).
– Exposed rare Geant4 somware crashes (1 crash per 10,000 generated events).
4/27/11 17 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Geant4: Value added by EGI IP
• Geant4 valida=on team observa=ons. – New DIANE/Ganga infrastructure [...] allows for robust and
reliable logging and monitoring of Ganga jobs. – DIANE’s error detec=on allows the automa=c exclusion of mis-‐
configured nodes (“the main problem in previous valida=on periods”).
– Grid experience substan=ally improved since last campaign; increase of 10x in number of jobs executed for the same allocated resources.
• “GRID usage has improved substan=ally... mainly due to the improved somware and to the increased ability to monitor and debug jobs.”
4/27/11 18 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
4/27/11 19 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Case study 2 OpenGate Virtual Laboratory Environment
Community case study: OpenGate
• GATE somware; nuclear medicine simula=ons for imaging. – Based on Geant4 toolkit. – Simulate par=cle tracks through maeer.
• Ini=al set of proper=es (type, loca=on, direc=on). • Large number of par=cles (simula=ons).
• Typical simula=ons take hours to days.
• Tasks well suited to being parallelized. – Split simula=on into sub-‐simula=ons (sta=c par==oning).
– But; all MC sub-‐tasks must complete to get the final result.
4/27/11 20 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Community case study: OpenGate
• OpenGate collabora=on developed a dynamic splinng method for GATE. – Camarasu-‐Pop et al. J Grid Compu=ng (2010) 8:241–259
4/27/11 21 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Monitors no. of simulated events; generates new tasks if required;
passes them to DIANE.
Task processing engine; pilot jobs on grid nodes; STDOUT & ERR returned periodically.
Community case study: OpenGate
• N tasks are dispatched by DIANE to the Grid. – Each task will con=nue to run un=l the desired number of par=cles (summed across all tasks) is reached.
• On average, a grid node will simulate a frac=on (1/N) of the total number of par=cles.
• DIANE performs a periodic check of total number of par=cles simulated.
• Run is terminated when all par=cles have been simulated.
4/27/11 22 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
OpenGate: Value added by EGI IP
• Makespan reduced; for simula=ng 20x106 events. – 8.5 hours on dual-‐core PC (2008). – ~24 hours on the grid with 100 classical jobs. 78% success rate. – 1.75 hours on the grid with 100 DIANE worker agents. 100% of
simula=on complete. • In period July 2009 – Aug 2010;
– 360 DIANE RunMaster instances were ac=vated in the backbone, handling 58,000 worker agent jobs.
• Generic solu=on; any applica=on managed by MOTEUR can now interface with DIANE.
• Solu=on has entered produc=on environment for non-‐clinical radia=on therapy researchers at Crea=s.
4/27/11 23 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
4/27/11 24 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Case study 3 CMS Error Reporting Tool.
CMS error repor=ng tool
• Users perform analysis via CMS Remote Analysis Builder (CRAB).
• Jobs running on heterogeneous services. – CMS services/middleware/batch systems. – Differing use cases.
• Dedicated analysis opera=on debugging team. – Proac=vely find and debug problems. – Handle support requests from CMS user community mailing list.
– 3,500 users from 40 countries across many =me zones.
4/27/11 25 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
CMS error repor=ng tool
• In order to op=mise the support procedure, CMS adopted the error-‐repor=ng tool.
• A CRAB plugin was developed that uploads debugging informa=on to a central repository.
4/27/11 26 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
CMS error repor=ng tool
4/27/11 27 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
CMS error-‐repor=ng: early feedback
• Posi=ve; the system works well, and is considered very useful.
• Streamlines a user-‐support request. – Reduces email traffic to support lists.
• Has allowed CMS to centralise and formalise their user-‐support mechanism.
4/27/11 28 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Summary
• EGI Introductory Package has been adopted and adapted by a wide range of user communi=es.
• Based around mature, stable and well documented tools.
• Startup overhead is surprisingly low (see demo later today).
• DIANE allows researchers to use resources more efficiently than direct job submission alone.
• Mini-‐dashboard provides customisable tools for monitoring jobs and op=mising user-‐support.
4/27/11 29 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.
Further informa=on
• EGI introductory package – heps://twiki.cern.ch/twiki/bin/view/ArdaGrid/EGIIntroductoryPackage
• Ganga and DIANE are registered in the EGI applica=on database, and are part of the EGEE Respect tool suite.
• Demonstra=on: Lambda room 4pm today.
4/27/11 30 The ULICE project is co-‐funded by the European Commission under FP7
Grant Agreement Number 228436.