Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | robyn-skinner |
View: | 218 times |
Download: | 1 times |
Grid TestbedActivities in US-CMS
Rick Cavanaugh
University of Florida
1. Infrastructure
2. Highlight of Current Activities
3. Future Directions
NSF/DOE Review
LBNL, Berkeley
14 January, 2003
14.01.2003 NSF/DOE Review 2
Fermilab– 1+5 PIII dual 0.700 GHz processor machines
Caltech– 1+3 AMD dual 1.6 GHz processor machines
San Diego– 1+3 PIV single 1.7 GHz processor machines
Florida– 1+5 PIII dual 1 GHz processor machines
Wisconsin– 5 PIII single 1 GHz processor machines
Total: ~41 1 GHz dedicated processors
UCSD
Florida
Wisconsin
Caltech
Fermilab
Operating System: Red Hat 6– Required for Objectivity
US-CMS Development Grid Testbed
14.01.2003 NSF/DOE Review 3
US-CMS Integration Grid Testbed
Fermilab– 40 PIII dual 0.750 GHz processor machines
Caltech– 20 dual 0.800 GHz processor machines– 20 dual 2.4 GHz processor machines
San Diego– 20 dual 0.800 GHz processor machines– 20 dual 2.4 GHz processor machines
Florida– 40 PIII dual 1 GHz processor machines
CERN (LCG site)– 72 dual 2.4 GHz processor machines
Total: 240 0.85 GHz processors: Red Hat 6 152 2.4 GHz processors: Red Hat 7
UCSD
Florida
Caltech
Fermilab
CERN
14.01.2003 NSF/DOE Review 4
DGT Participation by other CMS Institutes Encouraged!
UCSD
Florida
Caltech
FermilabWisconsin
MITRiceMinnesota Belgium BrazilSouth Korea
• Expression of interest:
14.01.2003 NSF/DOE Review 5
Grid Middleware
Testbed Based on the Virtual Data Toolkit 1.1.3– VDT Client:
– Globus Toolkit 2.0 – Condor-G 6.4.3
– VDT Server:– Globus Toolkit 2.0– mkgridmap– Condor 6.4.3– ftsh– GDMP 3.0.7
Virtual Organisation Management
– LDAP Server deployed at Fermilab– Contains the DN’s for all US-CMS Grid Users
– GroupMAN (from PPDG and adapted from EDG) used to manage the VO– Investigating/evaluting the use of VOMS from the EDG
– Use D.O.E. Science Grid certificates
– Accept EDG and Globus certificates
UCSD
Florida
Wisconsin
Caltech
Fermilab
14.01.2003 NSF/DOE Review 6
Non-VDT Software Distribution
DAR (can be installed “on the fly”)
– CMKIN
– CMSIM
– ORCA/COBRA– Represents a crucial step forward in CMS distributed computing!
Working to deploy US-CMS Pacman Caches for:
– CMS Software (DAR, etc)
– All other non-VDT Software required for the Testbed
– GAE/CAIGEE (Clarens, etc), GroupMAN, etc
UCSD
Florida
Wisconsin
Caltech
Fermilab
14.01.2003 NSF/DOE Review 7
MonaLisa (Caltech)
– Currently deployed on the Test-bed
– Dynamic information/resource discovery mechanism using agents
– Implemented in > Java / Jini with interfaces to SNMP, MDS,
Ganglia, and Hawkeye> WDSL / SOAP with UDDI
– Aim to incorporate into a “Grid Control Room” Service for the Testbed
Monitoring and Information Services
14.01.2003 NSF/DOE Review 8
Other Monitoring and Information Services
Information Service and Config. Monitoring: MDS (Globus)
– Currently deployed on the Testbed in a hierarchical fashion
– Aim to deploy the GLUE Schema when released by iVDGL/DataTAG
– Developing API's to and from MonaLisa
Health Monitoring: Hawkeye (Condor)
– Leverages the ClassAd system of collecting dynamic information on large pools
– Will soon incorporate Heart Beat Monitoring of Grid Services
– Currently deployed at Wisconsin and Florida
14.01.2003 NSF/DOE Review 9
Existing US-CMS Grid TestbedClient-Server Scheme
Use
r
VDT Client VDT Server
Monitoring
14.01.2003 NSF/DOE Review 10
Existing US-CMS Grid TestbedClient-Server Scheme
Replica Management
StorageResource U
ser
Compute Resource
Reliable Transfer
VDT Client VDT Server
Monitoring
14.01.2003 NSF/DOE Review 11
Existing US-CMS Grid TestbedClient-Server Scheme
Replica Management
StorageResource
Virtual Data Sys.
Executor
Use
r
Compute Resource
Reliable Transfer
VDT Client VDT Server
MOP
Monitoring
14.01.2003 NSF/DOE Review 12
Existing US-CMS Grid TestbedClient-Server Scheme
Replica Management
StorageResource
Virtual Data Sys.
Executor
Use
r
Compute Resource
Reliable Transfer
VDT Client VDT Server
MOP
Monitoring Performance HealthInfo.&Config.
14.01.2003 NSF/DOE Review 13
Replica Management
StorageResource
Virtual Data Sys.
Executor
Use
r
DAGManCondor-G
/Globus
Local GridStorage
Compute Resource
ReplicaCatalogue
Reliable Transfer
GlobusGRAM
/Condor
Pool
ftsh wrappedGridFTP
GDMP
VDT Client VDT Server
MOP
Monitoring Performance HealthInfo.&Config.MonaLisa MDS Hawkeye
Existing US-CMS Grid TestbedClient-Server Scheme
14.01.2003 NSF/DOE Review 14
Existing US-CMS Grid TestbedClient-Server Scheme
Replica Management
StorageResource
Virtual Data Sys.
Executor
Use
r
DAGManCondor-G
/Globus
Local GridStorage
Compute Resource
ReplicaCatalogue
Reliable Transfer
GlobusGRAM
/Condor
Pool
ftsh wrappedGridFTP
GDMP
VDT Client VDT Server
MOPmop_submitter
Monitoring Performance HealthInfo.&Config.MonaLisa MDS Hawkeye
14.01.2003 NSF/DOE Review 15
Replica Management
StorageResource
Virtual Data Sys.
Executor
Use
r
Virtual DataCatalogue
ConcretePlanner
AbstractPlanner
DAGManCondor-G
/Globus
Local GridStorage
Compute Resource
ReplicaCatalogue
Reliable Transfer
GlobusGRAM
/Condor
Pool
ftsh wrappedGridFTP
GDMP
VDT Client VDT Server
MOP
Monitoring Performance HealthInfo.&Config.MonaLisa MDS Hawkeye
Existing US-CMS Grid TestbedClient-Server Scheme
14.01.2003 NSF/DOE Review 16
Use
r
Client Server
Monitoring
Storage Resource
RelationalDatabase
PerformanceMonaLisa
Data Analysis
ROOT/Clarens
Data Movement
Clarens
ROOT files
Existing US-CMS Grid TestbedClient-Server Scheme
14.01.2003 NSF/DOE Review 17
Commissioning the Development Grid Testbed with "Real Production"
MOP (from PPDG) Interfaces the following into a complete prototype:
– IMPALA/MCRunJob CMS Production Scripts– Condor-G/DAGMan – GridFTP – (mop_submitter is generic)
Using MOP to "commission" the Testbed
– Require large scale, production quality results!> Run until the Testbed "breaks"> Fix Testbed with middleware patches> Repeat procedure until the entire Production Run finishes!
– Discovered/fixed many fundamental grid software problems in Globus and Condor-G (close cooperation with Condor/Wisconsin)
> huge success from this point of view alone
VDT Client
VDT Server 1
MCRunJob
DAGMan/Condor-G
Condor
GridFTP
VDT Server N
Condor
GridFTP
Globus
Globu
s
Glo
bus
GridFTP
mop-submitter
Linker ScriptGen
Config
Req.
Self Desc.
Master
Globus
14.01.2003 NSF/DOE Review 18
Integration Grid Testbed Success Story
Production Run Status for the IGT MOP Production
– Assigned 1.5 million events for “eGamma Bigjets”> ~500 sec per event on 750 MHz processor; all production stages from simulation to
ntuple
– 2 months continuous running across 5 testbed sites
Demonstrated at Supercomputing 2002
14.01.2003 NSF/DOE Review 19
Integration Grid Testbed Success Story
Production Run Status for the IGT MOP Production
– Assigned 1.5 million events for “eGamma Bigjets”> ~500 sec per event on 750 MHz processor; all production stages from simulation to
ntuple
– 2 months continuous running across 5 testbed sites
Demonstrated at Supercomputing 2002
1.5 Million Events
Produced !
(nearly 30 CPU years)
14.01.2003 NSF/DOE Review 20
Interoperability work with EDG/DataTAG
(1-1) Stage-in/out jobmanger grid015.pd.infn.it/jobmanager-fork (SE) or
grid011.pd.infn.it/jobmanager-lsf-datatag (CE)
(1-2) GLOBUS_LOCATION=/opt/globus
(1-3) Shared directory for mop files: /shared/cms/MOP (on SE and NFS exported to CE)
(2-1) Run jobmanager: grid011.pd.infn.it/jobmanager-lsf-datatag
(2-2) location of CMS DAR installation: /shared/cms/MOP/DAR,
(3-1) GDMP install directory = /opt/edg
(3-2) GDMP flat file directory = /shared/cms
(3-3) GDMP Objectivity file directory (not needed for CMSIM production)
(4-1) GDMP job manager: grid015.pd.infn.it/jobmanager-fork
• MOP Worker Site Configuration File for Padova (WorldGrid):
• MOP jobs successfully sent from a U.S. VDT WoldGrid site to Padova EDG site
• EU CMS production jobs successfully sent from EDG site to U.S. VDT WorldGrid site
• ATLAS Grappa jobs successfully sent from US to a EU Resource Broker and run on US-CMS VDT WorldGrid site.
14.01.2003 NSF/DOE Review 21
Chimera:The GriPhyN Virtual Data System
Abs. PlanVDC
RC C. Plan.
DAX
DAGMan
DAG
VDL
Log
ical
Ph
ysi
cal
XML
XML
Chimera currently provides the following prototypes:
– Virtual Data Language (VDL)> describes virtual data products
– Virtual Data Catalogue (VDC)> used to store VDL
– Abstract Job Flow Planner> creates a logical DAG (in XML) called DAX
– Concrete Job Flow Planner > interfaces with a Replica Catalogue> provides a physical DAG submission file to Condor-G/DAGMan
Generic and flexible: multiple ways to use Chimera– as a toolkit and/or a framework– in a Grid environment or just locally
14.01.2003 NSF/DOE Review 22
Direction of US-CMS Chimera Work
Monte Carlo Production Integration
– RefDB/MCRunJob
– Already able to perform all production steps
– "Chimera Regional Centre"> For quality assurance and scalability testing
> To be used with low priority actual production
assignments
User Analysis Integration
– GAE/CAIGEE work (Web Services, Clarens)
– Other generic data analysis packages
Two equal motivations:
– test a generic product for which CMS (and ATLAS, etc) will find useful !
– Experiment with Virtual Data and Data Provenance: CMS is an excellent use-case ! !
Encouraging and inviting more CMS input
– Ensure that the Chimera effort fits within CMS efforts and solves real (current and future) CMS needs !
Generator
Simulator
Formator
Reconstructor
ESD
AOD
Analysis
Pro
du
ctio
nA
naly
sis
para
ms
exec.
data
14.01.2003 NSF/DOE Review 23
Many promising alternatives: currently in the process of prototyping and choosing.
UserPhysics
Query flow
Local analysis tool: PAW/ROOT/…
Production system and data repositories
ORCA analysis farm(s) (or distributed `farm’ using
grid queues)
RDBMS based data
warehouse(s)
PIAF/Proof/..type analysis
farm(s)
Data extractionWeb service(s)
Query Web service(s)
Web browserLocal disk
TAGs/AODsdata flow
Productiondata flow
TAG and AOD extraction/conversion/transport (Clarens)
Clarens basedPlugin module
Picture taken from Koen Holtmanand Conrad SteenbergSee Julian Bunn's Talk
Data Processing Tools– interactive visualisation and data
analysis (ROOT, etc)
Data Catalog Browser– allows a physicist to find collections of
data at the object level
Data Mover– embeded window allowing physicist to
customise data movement
Network Performance Monitor– allows a physicist to optimise data
movement by dynamically monitoring network conditions
Computation resource browser, selector and monitor
– allows a physicist to view available resources (primarily for dev. stages of Grid)
Storage resource browser– enables a physicist to ensure that
enough disk space is available
Log browser– enables a physicist to get direct
feedback from jobs indicating success/failure, etc
Building a Grid-enabled Physics Analysis Desktop
14.01.2003 NSF/DOE Review 24
How CAIGEE plans to use the Testbed
Catalog
Web Client
Grid ServicesWeb Server
ExecutionPriority
Manager
Grid WideExecution
Service
GDMP
ConcretePlanner
AbstractPlanner
Web Client Web Client
Virtual Data Catalogue
MaterialisedData
Catalogue
Grid Processes
Monitoring
Based on client-server scheme
– one or more inter-communicating servers
– small set of of clients logically associated with each server
Scalable tiered architecture:
– Servers can delegate execution to another server (same or higher level) on the Grid
Servers offer "web-based services"
– ability to dynamically add or improve
14.01.2003 NSF/DOE Review 25
High Speed Data Transport
R&D work from Caltech, SLAC and DataTAG on data transport is approaching ~1 Gbit/sec per GbE port over long distance networks
Expect to deploy (including disk to disk) on the US-CMS Testbed in 4-6 months
Anticipate progressing from 10 to 100 MByte/sec and eventually 1 GByte/sec over long distance networks (RTT=60 msec across the US)
14.01.2003 NSF/DOE Review 26
Future R&D Directions
Workflow generator/planning (DISPRO) Grid-wide scheduling Strengthen monitoring infrastructure VO Policy definition and enforcement Data analysis framework (CAIGEE) Data derivation and data provenance (Chimera) Peer-to-peer collaborative environments High speed data transport Operations (what does it mean to operate a Grid?) Interoperability tests between E.U. and U.S.
solutions
14.01.2003 NSF/DOE Review 27
Conclusions
US-CMS Grid Activities reaching a healthy "critical mass" in several areas:– Testbed infrastructure (VDT, VO, monitoring, etc) – MOP has been (and continues to be) enormously successful– US/EU interoperability is beginning to be tested– Virtual Data is beginning to be seriously implemented/explored– Data Analysis efforts are rapidly progressing and being prototyped
Interaction with computer scientists has been excellent !
Much of the work is being done in preparation for the LCG milestone of 24x7 production Grid milestone
We have a lot of work to do, but we feel we are making excellent progress and we are learning a lot !