Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | laureen-goodwin |
View: | 220 times |
Download: | 0 times |
The SAM-Grid and the use of Condor-G as a grid job management middleware
Gabriele Garzoglio for the SAM-Grid TeamFermilab, Computing Division
Apr 16, 2004 Gabriele Garzoglio
Overview
Computation in High Energy PhysicsThe SAM-Grid computing infrastructureThe Job Management and Condor-GReal life experienceFuture work
Apr 16, 2004 Gabriele Garzoglio
High Energy Physics ChallengesHigh Energy Physics studies the fundamental interactions of Nature.Few laboratories around the world provide each unique facilities (accelerators) to study particular aspects of the field: the collaborations are geographically distributed.Experiments become every decade more challenging/expensive: the collaborations are large groups of people.The phenomena studied are statistical in nature and very rare events: a lot of data/statistics is needed
Apr 16, 2004 Gabriele Garzoglio
A HEP laboratory: Fermilab
Apr 16, 2004 Gabriele Garzoglio
FNAL Run II detectors
Apr 16, 2004 Gabriele Garzoglio
DZero
FNAL Run II detectors
Apr 16, 2004 Gabriele Garzoglio
The Size of the D0 Collaboration
~500 Physicists72 institutions18 Countries
DZero and CDF Institutions
Apr 16, 2004 Gabriele Garzoglio
Data size for the D0 ExperimentDetector Data
1,000,000 ChannelsEvent size 250KBEvent rate ~50 HzOn-line Data Rate 12 MBps100 TB/year
Total datadetector, reconstructred, simulated400 TB/year
Apr 16, 2004 Gabriele Garzoglio
Typical DZero activities
Activity Description Community Load time/jobReconstruction data filtering small CPU & I/O 10 hoursMontecarlo data simulation small CPU 10 hoursAnalysis data mining large CPU & I/O hours
Activity Input/Job Output/Job Input/Year Output/YearReconstruction GB GB 100s TB 100s TBMontecarlo None 10 GB None TBAnalysis 100 GB GB varies varies
Apr 16, 2004 Gabriele Garzoglio
Overview
Computation in High Energy PhysicsThe SAM-Grid computing
infrastructureThe Job Management and Condor-GReal life experienceFuture work
Apr 16, 2004 Gabriele Garzoglio
The SAM-Grid Project
Mission: enable fully distributed computing for DZero and CDFStrategy: enhance the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing (JIM)History: SAM from 1997, JIM from end of 2001Funds: the Particle Physics Data Grid (US) and GridPP (UK) People: Computer scientists and Physicists from Fermilab and the collaborating Institutions
Apr 16, 2004 Gabriele Garzoglio
Apr 16, 2004 Gabriele Garzoglio
Overview
Computation in High Energy PhysicsThe SAM-Grid computing
infrastructureThe Job Management and Condor-G
Real life experienceFuture work
Apr 16, 2004 Gabriele Garzoglio
Job Management: RequirementsFoster site autonomyOperate in batch mode: submit and disconnectReliability: handle the job request persistently; execute it and retrieve output and/or errors. Flexible automatic resource selection: optimization of various metrics/policiesFault tolerance: transient service disruption; automatic rematching and resubmitting capabilitiesAutomatic execution of complex interdependent job structures.
Apr 16, 2004 Gabriele Garzoglio
Service Architecture
SiteSite SiteSite SiteSite
Resource Selector
Info Collector
Info Gatherer
Match Making
User InterfaceUser Interface User InterfaceUser Interface
SubmissionGlobal Job Queue
Grid Client
SubmissionSubmission
User InterfaceUser Interface User InterfaceUser Interface
Global DH ServicesSAM Naming Server
SAM Log Server
Resource Optimizer
SAM DB ServerRC MetaData Catalog
Bookkeeping Service
SAM Stager(s)
SAM Station(+other servs)
Data Handling
Worker Nodes
Grid Gateway
Local Job Handler(CAF, D0MC, BS, ...)
JIM Advertise
Local Job Handling
Cluster
AAA
Dist.FS
Info Manager
XML DB server
Site Conf.Glob/Loc JID map...
Info Providers
MDS
MSS Cache Site
Web ServGrid Monitoring
User Tools
Flow of: job data meta-data
Apr 16, 2004 Gabriele Garzoglio
Technological choices (2001)
Low level resource management: Globus GRAM. Clearly not enough...Condor-G: right components and functionalities, but not enough in 2001...DZero and the Condor Team have been collaborating since, under the auspices of PPDG to address the requirements of a large distributed system, with distributively owned and shared resources.
Apr 16, 2004 Gabriele Garzoglio
Condor-G: added functionalities I
Use of the condor Match Making Service as Grid Resource Selector
Advertisement of grid site capabilities to the MMSDynamic $$(gatekeeper) selection for jobs specifying requirements on grid sites
Concurrent submission of multiple jobs to the same grid resource
at any given moment, a grid site is capable of accepting up to N jobs the MMS was modified to push up to N jobs to the same site in the same negotiation cycle
Apr 16, 2004 Gabriele Garzoglio
Condor-G: added functionalities II
Flexible Match Making logicthe job/resource match criteria should be arbitrarily complex (based on more info than what fits in the classad), statefull (remembers match history), “pluggable” (by administrators and users)Example: send the job where most of the data are. The MMS contacts the site data handling service to rank a job/site matchThis leads to a very thin and flexible “grid broker”
Apr 16, 2004 Gabriele Garzoglio
Condor-G: added functionalities III
Light clientsA user should be able to submit a job from a laptop and turn it offClient software (condor_submit, etc.) and queuing service (condor_schedd) should be on different machinesThis leads to a 3 tiers architecture for Condor-G: client, queuing, execution sites. Security was implemented via X509.
Apr 16, 2004 Gabriele Garzoglio
Condor-G: added functionalities IV
Resubmission/Rematching logic If the MMS matched a job to a site, which cannot accept it after trying the submission N times, the job should be rematched to a different siteFlexible penalization of already failed matches
Apr 16, 2004 Gabriele Garzoglio
Overview
Computation in High Energy PhysicsThe SAM-Grid computing
infrastructureThe Job Management and Condor-GReal life experience
Future work
Apr 16, 2004 Gabriele GarzoglioJO
B
Computing Element
Submission Client
User Interface
QueuingSystem
Job ManagementUser Interface
User Interface
BrokerMatch
Making Service
Information Collector
Execution Site #1
Submission Client
Submission Client
Match Making Service
Match Making Service
Computing Element
Grid Sensors
Execution Site #n
Queuing System
Queuing System
Grid Sensors
Storage Element
Storage Element
Computing Element
Storage Element
Data Handling System
Data Handling System
Storage Element
Storage Element
Storage Element
Storage Element
Information Collector
Information Collector
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Computing Element
Computing Element
Data Handling System
Data Handling System
ext.logic
ext.logic
MyType "Machine"TargetType "Job"Name "ccin2p3-analysis.d0.prd.jobmanager-runjob"gatekeeper_url_ "ccd0.in2p3.fr:2119/jobmanager-runjob"DbURL "http://ccd0.in2p3.fr:7080/Xindice"sam_nameservice_ "IOR:000000000000002a49444c3........."station_name_ "ccin2p3-analysis" station_experiment_ "d0" station_universe_ "prd" cluster_architecture_ "Linux+2.4" cluster_name_ "LyonsGrid" local_storage_path_ "/samgrid/disk" local_storage_node_ "ccd0.in2p3.fr" schema_version_ "1_1" site_name_ "ccin2p3" ...
MyType "Machine"TargetType "Job"Name "ccin2p3-analysis.d0.prd.jobmanager-runjob"gatekeeper_url_ "ccd0.in2p3.fr:2119/jobmanager-runjob"DbURL "http://ccd0.in2p3.fr:7080/Xindice"sam_nameservice_ "IOR:000000000000002a49444c3........."station_name_ "ccin2p3-analysis" station_experiment_ "d0" station_universe_ "prd" cluster_architecture_ "Linux+2.4" cluster_name_ "LyonsGrid" local_storage_path_ "/samgrid/disk" local_storage_node_ "ccd0.in2p3.fr" schema_version_ "1_1" site_name_ "ccin2p3" ...
MyType "Job" TargetType "Machine" ClusterId 304 JobType “montecarlo" GlobusResource "$$(gatekeeper_url_)" Requirements (TARGET.station_name_ == "ccin2p3-analysis" && ...) Rank 0.000000 station_univ "prd" station_ex "d0" RequestId "11866" ProjectId "sam_ccd0_012457_25321_0" DbURL "$$(DbURL)" cert_subject "/DC=org/DC=doegrids/OU=People/CN=Aditya Nishandar ..."Env "MATCH_RESOURCE_NAME=$$(name);\ SAM_STATION=$$(station_name_);\ SAM_USER_NAME=aditya;..." Args "--requestId=11866" "--gridId=sam_ccd0_012457" ......
MyType "Job" TargetType "Machine" ClusterId 304 JobType “montecarlo" GlobusResource "$$(gatekeeper_url_)" Requirements (TARGET.station_name_ == "ccin2p3-analysis" && ...) Rank 0.000000 station_univ "prd" station_ex "d0" RequestId "11866" ProjectId "sam_ccd0_012457_25321_0" DbURL "$$(DbURL)" cert_subject "/DC=org/DC=doegrids/OU=People/CN=Aditya Nishandar ..."Env "MATCH_RESOURCE_NAME=$$(name);\ SAM_STATION=$$(station_name_);\ SAM_USER_NAME=aditya;..." Args "--requestId=11866" "--gridId=sam_ccd0_012457" ......
job_type = montecarlostation_name = ccin2p3-analysisrunjob_requestid = 11866runjob_numevts = 10000d0_release_version = p14.05.01jobfiles_dataset = san_jobset2minbias_dataset = ccin2p3_minbias_datasetsam_experiment = d0sam_universe = prdgroup = testinstances = 1
job_type = montecarlostation_name = ccin2p3-analysisrunjob_requestid = 11866runjob_numevts = 10000d0_release_version = p14.05.01jobfiles_dataset = san_jobset2minbias_dataset = ccin2p3_minbias_datasetsam_experiment = d0sam_universe = prdgroup = testinstances = 1
Apr 16, 2004 Gabriele Garzoglio
Montecarlo Production Statistics
Started beginning of 2004.Ramped up in March.3 Sites: Wisconsin (...via Miron), Manchester, Lyon. New sites are joining (UTA, LU, OU, LTU,...)Inefficiency due to the Grid infrastructure « 5%30 GB/week = 80,000 events/week (about 1/4 of total production)
Apr 16, 2004 Gabriele Garzoglio
Overview
Computation in High Energy PhysicsThe SAM-Grid computing
infrastructureThe Job Management and Condor-GReal life experienceFuture work
Apr 16, 2004 Gabriele Garzoglio
Future work of DZero with Condor
Use of DAGMan to automate the management of interdependent grid job structures.Address potential scalability limits.Investigate non-central brokering service via grid flocking.Integrate/Implement a proxy management infrastructure (e.g. MyProxy).All the rest (...fix bugs, improve error reporting, hand holding, sailing...)
Apr 16, 2004 Gabriele Garzoglio
Conclusions
The collaboration between DZero and the Condor team has been very fruitful since 2001.DZero has worked together with Condor to enhance the Condor-G framework, in order to address the requirements on distributed computing of a large HEP experiment.DZero is running “production” jobs on the Grid.
Apr 16, 2004 Gabriele Garzoglio
Acknowledgments
Condor TeamPPDGDZeroCDF
Apr 16, 2004 Gabriele Garzoglio
More info at…
http://www-d0.fnal.gov/computing/grid/
http://samgrid.fnal.gov:8080/
http://d0db.fnal.gov/sam/