Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | mackenzie-donaldson |
View: | 40 times |
Download: | 1 times |
1
A lightweight Monitoring and Accounting system for LHCb DC'04 production
V. GaronneR. Graciani Díaz
J. J. Saborido SilvaM. Sánchez GarcíaR. Vizcaya Carrillo
2
Outline
Manifesto Monitoring
Web interface Internals
Accounting Web interface Internals
Outlook URLs
3
Manifesto
Monitoring and Accounting are tasks in DIRAC377
DIRAC is a Production grid for LHCb The Monitoring reports the status of jobs while
in the WMS (Workload Management System)366
Instantaneous snapshot of the system No historic records
The Accounting records the status of jobs after leaving the WMS Provides historic record, accumulated statistics
and evolution of recorded variables with time Main users: production and site managers
4
Design choices
Monitoring Job information stored centrally in the WMS
Info Provided directly by the job and the WMS Passive services: no pushpushing of information
No need for a common consumer API Job and Application state stored together
Accounting Separate infrastructure from the monitoring
Jobs can never be on the Accounting and the Monitoring
Domain specific: LHCb production jobs
5
Information Flow
WMS
Web interface Web interface
Job Database Accounting Database
Cleaner Agent
Accounting
Write Read
Monitoring
Read Write
Job
Use
rsB
ack
en
dS
erv
ices
& A
gen
ts
Job Heart-beat
DIRAC
6
Monitoring Web Interface 1 Interface to query monitoring service
JobId popup a window with job details if clicked
7
Monitoring Web Interface 2
The overview shows predefined plots on the production Generated
every few minutes
PyPyCCharthart used as graphics engine
100% python Supports SVG
Running jobs by site
8
Monitoring Web Interface 3 Job status by site and production id
9
Monitoring Internals
It consists of a XML-RPC service exposing whatever parameters are known to DIRAC
Job parameters stored internally by DIRAC Primary parameters
Execution site, job status, job owner etc. Fixed, centrally defined: fast access Can query on them
Secondary parameters Number of steps, internal job state, etc Defined by the production job itself Stored as key-value pairs Slower access. Cannot query on them
10
JMS basic API example
from xmlrpclib import ServerProxyserver = ServerProxy(monitoring_url)
#Retrieve list of jobs verifying some conditionsconditions = {'Status': 'running', 'Site': 'DIRAC.CERN.ch' }jobreq = server.getJobs(conditions)
#Print some parameters for each jobif jobreq['Status']: for jobid in jobreq['Value']: print server.getJobSite(jobid) print server.getJobParameter(jobid, 'LocalBatchId')
#Bulk operationssum = server.getJobsPrimarySummary(jobreq['Value'])
~3 s to select 95 out of 50k jobs
~0.7 s
~40 s
11
Accounting Web Interface 1
GUI for querying the Accounting
Shows results As graphics As table As Excel sheet
Several types of report Only a few shown
here
12
Accounting Web Interface 2
Used resources by site
13
Accounting Web Interface 3
Used resources by event type Mb/job CPU/job Failed jobs CPU vs. Exec
time Input and
Output data vs. CPU
14
Accounting Web Interface 4
Produced data by production ID Rates Cumulative Number of
events Gb of output
15
Accounting Web Interface 5
WMS statistics on DIRAC's performance Plots
Job execution time vs. WMS waiting time Job execution time vs. WMS matching time
Granularity Per site Per production Integral
Allows assessment of DIRAC's performance
16
Accounting Internals
Job and DIRAC statistics kept in a database Site contribution Data produced and used by jobs and steps Timing for jobs, steps and DIRAC internals
Separate XML-RPC interfaces to populate and query the accounting tables Both interfaces have restricted access
Jobs are moved to the accounting system by a cleaner agent after being validated
17
Accounting Usage About 10 hits per day Time to generate daily static reports:
8 min 60-70% of the time querying the
database 30-40% of the time in the drawing
packageServer load<0.2
Total: 169 kjobs
18
Outlook
Monitoring page Transactions in monitoring updates Further optimisation (bulk operations...) Search for a faster rendering package Make the web page dynamic: Less
reloads Accounting
New report types Normalized CPU Contribution by country Rate by site, country etc...
19
URLs
Monitoring page http://fpegaes1.usc.es/dmon/DC04/joblist.
html Mirror on:
http://lhcb02.usc.cesga.es/dmon/DC04/joblist.html
Direct link to overview pages http://lhcb.ecm.ub.es/DC04/Monitoring
Accounting page http://lhcb.ecm.ub.es/DC04/Accounting/