Date post: | 22-Dec-2015 |
Category: |
Documents |
Upload: | ashley-mckenzie |
View: | 212 times |
Download: | 0 times |
ExperimentSupport
Introduction to HammerCloud for The LHCb Experiment
Dan van der Ster
CERN IT Experiment Support
3 June 2010
ExperimentSupport Outline
• Introduction to HammerCloud– Motivation, History, Use-Cases
• How HammerCloud works– Design and Implementation Details
• Interface Tour for Users and Admins
• Possibilities for an LHCb Plugin
HammerCloud Introduction for LHCb – 2
ExperimentSupport Introduction to HammerCloud
• HammerCloud (HC) is a Distributed Analysis testing system serving two use-cases:– Robot-like Functional Testing: frequent “ping” jobs to all
sites to perform basic site validation– DA Stress Testing: on-demand large-scale stress tests
using real analysis jobs to test one or many sites simultaneously to:• Help commission new sites• Evaluate changes to site infrastructure• Evaluate SW changes• Compare site performances…
HammerCloud Introduction for LHCb – 3
ExperimentSupport HammerCloud and Job Robots
• HammerCloud is part of an evolution of job robots:– CMS Job Robot inspired the ATLAS GangaRobot (functional testing)– In ~Sept 2008, a form of the ATLAS GangaRobot was used to
manually stress test the Italian ATLAS Tier2’s:• 5 users manually submitting hundreds of instrumented jobs simultaneously
(SIMD)• Manual results collection and summarization• Early results were shown to be very useful:
– One early test showed a bimodal performance plot that was later traced to a faulty network switch which negatively affected the performance of some WNs. The need for an automated DA stress testing system was clear.
– HammerCloud was born in November 2008 to deliver on-demand stress tests to ATLAS sites:
• Since then HC has run >1300 “Tests” using more than 4 million jobs.• ATLAS has invested >200k CPU-days in HC tests
– CMS has also agreed to use HC: in April a prototype was delivered, and now scale tests are about to begin.
HammerCloud Introduction for LHCb – 4
ExperimentSupport HC and ATLAS during STEP’09
HammerCloud Introduction for LHCb – 5
STEP’09
ExperimentSupport HammerCloud Use-Cases
• Provides On-Demand and Automated Testing
• HC Operators define test templates: FUNCTIONAL and STRESS
• Functional Tests are automatically scheduled
– Results are published on the HC website and can be pushed to other systems (e.g. SAM)
• Stress tests are generally scheduled on demand as needed by:
– Central VO managers– Cloud/Regional managers– Site managers
• For all tests, a detailed report summarizing the job success rates and performances is produced.
HammerCloud Introduction for LHCb – 6
ExperimentSupport HammerCloud Components
• The HC UI is implemented as a Django web app:– View test results– View cloud/site evolution– DB Admin
• State is maintained in a MySQL DB
• HC Logic (job submission, monitoring, resubmission) implemented on top of the Ganga Grid Programming Interface (GPI)
HammerCloud Introduction for LHCb – 7
ExperimentSupport HammerCloud Logic
• An HC Test is described by:– The analysis code to run (typically a real analysis from the user community)– The dataset pattern (which can be resolved to a set of datasets appropriate
for the analysis code)– The list of sites to be tested, and the target number of jobs to run
concurrently per site– A start time and an end time
• Test execution proceeds in 4 steps:– Generate: Test description is converted to a set of submittable jobs (e.g.
Ganga job objects, one for each site under test)– Submit: the job objects are submitted– Run: jobs are monitored, outputs recorded to the HC DB, jobs are
resubmitted to achieve the target number of running jobs per site– Exit: at the test end time, leftover jobs are killed
• Concurrently, the HC Web shows real time test results
HammerCloud Introduction for LHCb – 8
ExperimentSupport An HC-LHCb Plugin
• What customizations would be needed for an HC-LHCb plugin?
• HC is built upon Ganga and exploits its job management features:– job repository, job configuration via
python, job submission, job monitoring in background thread(s)
• Given the existing GangaLHCb plugins, modifications to HC itself would be relatively minor, e.g.– HC Test Generation:
• Query a data discovery service to form a job processing random input data
– HC Test Running:• Changes to extract LHCb-specific job
metrics from Ganga
HammerCloud Introduction for LHCb – 9
ExperimentSupport
Interface Tour
1. The Public User Interface
HammerCloud Introduction for LHCb – 10
ExperimentSupport HC Home
• The HC Homepage lists the running and scheduled tests.
HammerCloud Introduction for LHCb – 11
ExperimentSupport Viewing a Test
• The test overview gives a quick summary of: Overall job efficiency, CPU/Walltime, Events/WrapperTime
• Also shows a summary of the jobs running at each site involved in the test.
HammerCloud Introduction for LHCb – 12
ExperimentSupport Viewing a Test: Summary Stats
• The Test Overview page also gives summary statistics by site• Here you can see some example metrics (for CMS)
HammerCloud Introduction for LHCb – 13
ExperimentSupport Viewing a Test: Per-Site Plots
• View plots of the recorded metrics for each site
HammerCloud Introduction for LHCb – 14
ExperimentSupport Viewing a Test: Metric Comparisons
• View the plots for all sites for a specific metric
• Used to compare site-by-site
HammerCloud Introduction for LHCb – 15
ExperimentSupport Modify a Running Test
• Authorized users can modify the parameters of a test at run time– E.g. change the end time, or number of running jobs per site
HammerCloud Introduction for LHCb – 16
ExperimentSupport Clone a Previous Test
• Cloning a previous test is simple– Useful to repeat the test or to run an identical test at a
different set of sites
HammerCloud Introduction for LHCb – 17
ExperimentSupport Overall HC Plots
• Historical plots show previous test statistics• Currently shows # running jobs per site. Plots showing the
evolution of the performance metrics are in development.
HammerCloud Introduction for LHCb – 18
ExperimentSupport HC Robot View
• The “Robot” view is used to show the success rates of functional test jobs over the past 24 hrs. (Similar to SSB)
• Clicking a site takes you to the list of Robot jobs executed at that site
HammerCloud Introduction for LHCb – 19
ExperimentSupport
Interface Tour
2. Admin Interface
HammerCloud Introduction for LHCb – 20
ExperimentSupport HC Admin: Operator and User Views
• HC Operators have access to admin all tables in the HC DB via a web interface
• HC Users have more limited access
HammerCloud Introduction for LHCb – 21
ExperimentSupport HC Admin: Tests and Templates
Above: List all Test Templates Below: List all Tests
HammerCloud Introduction for LHCb – 22
ExperimentSupport HC Admin: Edit a Test Template
• Test templates are defined via the Admin UI
• All of the parameters of a test are here, plus:– An active flag indicating that a
template should be auto-scheduled
– A default lifetime: auto-scheduled test instances of this template will run for this time period
• Normally, functional test templates include the list of sites to be tested, whereas stress test templates do not include a list of sites.
HammerCloud Introduction for LHCb – 23
ExperimentSupport HC Admin: Adding a new Test
• Adding a new test on-demand is simple. Select the test template of interest, a start time, and an end time.
• If needed, Tests can be further customized after the template is copied over.
HammerCloud Introduction for LHCb – 24
ExperimentSupport Summary
• HammerCloud is a DA functional and stress testing system used widely by ATLAS and coming soon for CMS
• Two basic use-cases:– Continuous stream of test jobs to measure site availability– Enable central managers to define standardized (stress)
tests, and empower site managers to invoke those tests on-demand.
• An HC-LHCb plugin would leverage the existing GangaLHCb work– A prototype plugin would not take significant effort
HammerCloud Introduction for LHCb – 25