Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | shannon-reed |
View: | 213 times |
Download: | 0 times |
Dave Kant
Monitoring and Accounting
Dave KantCCLRC e-Science Centre, UK
GridPP 12 Jan 31st - Feb 1st 2005
2
Overview
1. GOC Database
2. Monitoring Tools
3. Accounting
4. Issues
5. Future Plans
4
GOC Database
– What features? • Configuration of monitoring tools• Security• Organisations• Administrative Roles• Replication
– What role will it play in the future?• New site registration procedure• BDII generation
8
GRID Configuration Database
GOCDB
GridSite MySQL
Resource CentreResources & Site Information
EDG, LCG-1, LCG-2, …
ce
se
bdii
rb
Monitoring Services
• Operations Maps
• Configure other Tools
• Resource Provider
• Organisation Structures
• Secure services
- Site News
- Self Certification
- Accounting
Secure Database Management via HTTPS / X.509
Store a Subset of the Grid Information system
People, Contact Information, Resources
Maintenance Bit
RC
SQLhttps
SERVER
GOC DB can also contain information that is not present in the IS such as:Scheduled maintenance; News; Organisational Structures; Geographic coordinates for maps.
9
EGEE ROC Structure
• EGEE is made up of regions.• Each region contains many computing centres.• Regional Operational Centres are a focus for
operational activities.
USA
10
Developed a tool to manage organisational structures. Modelled on GridPP Tier1/2 Structure
Materialised Path Encoding Provide ROCs with a package to monitor the resources in the region
• Tailored Monitoring• Administrative roles to the coordinators in GOCDB
Organisational Structures
EGEE (1)
France (1.1) UK/I (1.2) S.E.E (1.3)
GridPP (1.2.1)
LondonT2
ScotGrid
IMPERIAL
QMUL
Edinburgh
11
• Total List of all sites is derived from GOCDB (via RGMA)• GOC bit: sites which have opted out e.g. scheduled maintenance• White List: Sites that failed one or more core tests but are well supported are put back in e.g. a Tier1 site • Core tests are a subset of the site functional tests run by CERN every day• Black List: Sites that are not trusted
100’s of Sites
Monitoring Services
Total List of all sites
Sites pass core tests
Trusted Sites
Black List
White List BDII
RGMA
GOC Bit
• GOC DB Site info• Gstat Data• Site Functional Tests• GOC Hourly Tests
Generation of BDII configuration file via feedback into IS
Adaptive Job Brokering Based on the Monitoring System
Environments Production, VO, GridPP, …
12
How Are New Sites Added?
Site
ROC
GOCDB
Site and ROC liaise
[1]
EGEE
1. JSPG have written a “Site Registration Policy & Procedure” Document2. https://edms.cern.ch/document/503198/3. New GOCDB portal to streamline the site registration process.
[3] Site installs middleware
[2] “candidate” site
[4] “uncertified” Site
[6] “certified” Site
[5] Certification Testing
13
ReplicationTwo replicas, each one has a different security
considerations• “Services” replica managed by Taipei
– Direct connections to the database by the monitoring tools from known hosts
• “Users” replica to be setup at IN2P3– Web portal based on X.509 certificates
– CIC on duty
14
Monitoring Tools
• What are the main tools that are used in the day-to-day operations of the LCG Grid? – GPPMON– GSTAT– Site Functional Tests
• Other monitoring tools exist, but I won’t discuss them here– GridIce
15
Operations Map – Job Submission Tests
GPPMON
Displays the results of tests against sites.
Test: Job Submission
Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements.
This kind of test deals with the functional behaviour core grid services – do simple jobs run. They are lightweight tests which run hourly. However, they have certain limitations e.g. Dteam VO; WN reach (specialised monitoring queues).
16
Operations Map – Certificate Lifetime
GPPMON
Displays the results of tests against sites.
Test:Certificate Lifetime
Many grid services require a valid certificate for security.
By probing the host certificates on CEs and SEs at sites with a simple SSL client service, we can identify certificates which are due to expire and send an early warning to them. A predictive tool!
23
GIIS Monitor• Developed by MinTsai (GOC Taipei)• Tool to display and check information published by the site GIIS (sanity
checks, fault detection)
• http://goc.grid.sinica.edu.tw/gstat/
Regional Plot:
http://map.gridpp.ac.uk
24
Site Certification Service
• In terms of middleware, the installation and configuration of a site is quite a complicated procedure. – When there is a new release, sites don’t upgrade at the same time– Some upgrades don’t always go smoothly– Unexpected things happen (who turned of the power?)– Day-to-day problems; robustness of service under load?
• Its necessary to actively hunt for problems • • Site certification testing is by CERN deployment team on a daily
basis. First step toward providing this service involves running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3rd party copies from remote SE.
• Unlike the simple job submission tests implemented in GPPMON, these tests are more heavy weight and attempt simulate the life cycle of real applications.
25
Certification Test Results
http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/listreports.cgi
26
Aggregator RSSReader (Windows Client)
GOC generates RSS feeds which clients can pull using an RSS aggregator.
How can we integrate feeds and ticketing systems?
Syndication of Monitoring Information
27
Real Time Grid Monitorhttp://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html
A Visualisation tool to track jobs currently running on the grid.
Applet queries the logging and bookkeeping service to get information about grid jobs.
Why are jobs failing?
Why are jobs queued at sites while others are empty?
28
Problems with Existing Tools
• Lots of monitoring tools around which have things in common:-- all the information which they generate is hidden away or difficult to access- limited interfaces: the data can only be accessed in specific ways
• Therefore, its difficult to build “on-demand” services to allow communities “Players” to interact with the data.
• The idea is for the services to collect information and put it into a common repository such as an RGMA Archiver. In this way, the information can be shared and accessible to all.
• Services (EGEE parlance: ROC and CIC services) munch the data and present it to the community.
• How much CPU in UKI ROC– How much in GridPP?
• How much in each Tier2?
=> Integrate data from different sources to provide this information
29
Monitoring Paradigm
A Better way to unify monitoring information.
GOC Services collect information and publish into an archiver.
ROC/CIC Services provide a means for the community to interact with this information on-demand. GOC provides services tailored to the requirements of the community.
Information Repository (RGMA)
Accounting
Monitoring
GSTATTesting
ROC Services
Self Certification
CIC Services
Communities
VOs
ROCs
EGEE
Sites
Organisations
GOC Services
30
Use Cases
• Monitoring services which use RGMA as the backbone for data transport and data location via the registry service.– Grid Event Monitoring System– “Site Functional Test” Reporting Tool– Accounting
31
UseCases - GEMS• Grid Event Monitoring System• List of resources to monitor is provided by GOCDB
Alert system that uses RGMA
Looks for changes of state in the monitoring data tables
Generates an alert and displays on the GEMS console.
Notification features
Event filtering
32
Reporting Tool PrototypeOrganisational Identities taken from GOCDB
36
Accounting• Information collected at each site from batch logs,
gatekeeper logs etc• Information joined at site level to select grid jobs and
stored in database on R-GMA MON box at site.• Information published through R-GMA and collected
centrally in an R-GMA archive at GOC• Web site presents various views of this data for
presentation
• Information schema based on GGF Usage Group • Structure of Grid taken from GOC DB – the grid
configuration database.• Only normalised cpu time collected (at the moment)
37
39
GOC Accounting Serviceshttp://goc.grid-support.ac.uk/gridsite/accounting/index.html
BaseCpuSeconds Aggregated across EGEE
Each Site, per VO, per Month
Simple interface to customise views of data: VO, time frame and Region (default = EGEE)
Each Region, per VO, per Month
On Demand Services to EGEE Community
Other Distributions
Normalised CPU
# Jobs
40
Web form to apply selection criteria on the data
Aggregate data across an organisation structure
(Default= All ROCs)
Select VOs (Default = All)
Select date range
41
VO Index
Summed CPU (Seconds) consumed by resources in selected Region
Selected Date Range
42
List of Sites Belonging to the Selected ROC
A breakdown of the resource usage per Site, per VO, per Month
43
Deployment
• Package was released to LCG in August 2004 and certified soon afterwards.
• There was no LCG release after that until LCG2_3_0 on 18th December 2004
• Today there are still very few 2_3_0 sites. There are 28 sites producing accounting records today.
• The 2_3_0 release has some bugs which are fixed in a new release that is available on the accounting home page
• Recommend that sites upgrade accounting to version APEL 3.4.40 available on the accounting homepage
http://goc.grid-support.ac.uk/gridsite/accounting/index.html
46
Future Plans
• Support for the LSF batch system. • Understand Normalisation issues; do we
have faith in the numbers we present?• Extend accounting schema to include
information about the worker node, Job efficiency and globalJobID.
• Integrate the LCG schema with de-facto grid accounting standards, namely GGF– Share data with other Grid Communities
• NorduGrid, Grid03
47
Summary
• GOCDB to take a more important role in operation environment
• A shift in the monitoring paradigm which relies on sharing data through RGMA
• Accounting Information gathering infrastructure and reporting web site
• Development towards on-demand services to provide the community with up-to-date information, aggregated at different levels.
• Development of Visualisation tools to enhance our understanding of the grid.
• Adaptive Job brokering based on the monitoring system