Experience with Site Functional Tests Piotr Nyczyk CERN IT/GD WLCG Service Workshop Mumbai, 10-12...

Experience with Site Functional Tests

Piotr Nyczyk

CERN IT/GD

WLCG Service Workshop

Mumbai, 10-12 February 2006

Experience with SFT, Mumbai, 10-12 February 2006 2

Outline

• Introduction• Brief history• Current status• Data schema• Site availability metric (prototype)• Experience• On-going development• Conclusions


Introduction

• Motivation: local site level monitoring is not enough– depends on site configuration, decisions,

etc. (lack of uniform view for all sites)

– cannot be fully trusted (!)

– usually focused on low-level details (daemons, disk space) and not on usability of the service itself

• Solution: need for high-level, service oriented monitoring (probing) to see if the grid is really usable

• What is SFT?– “agent” or test job sent to all sites (CEs)

on regular basis (every 3 hours)

– set of tests for different aspects of functionality

– high level - service or functionality oriented


SFT history

• Initial situation: – August 2004 - about 60 sites

– Grid split in 2 “zones”: • TestZone - new sites

• Production - tested and stable sites

• TestZone Tests (tztests) - predecessor of SFT– bash script test job based on manual tests from LCG release

notes: broker info, env. variables, software version, CA RPMs version, edg-rm operations, ...

– set of scripts to submit test job to all sites and retrieve outputs

– outputs stored in the directory structure on disk (CERN AFS): key-value pairs + text log

– simple CGI script to generate web report (sites/tests matrix)


SFT history (cont.)• TZTests -> SFT - just change of name, reason: no TestZone any

more

• Gradual improvements: new tests (R-GMA, ...), stability (detection of RB failures), scheduled maintenance, etc. - pragmatic approach - detection of most “painful” errors

• But all the time in operations! - Immediate deployment

• Integration (user interface level) with GStat

• Next step: SFT->SFT2– used until now

– whole framework rewritten almost from scratch

• Most important changes: – MySQL and R-GMA instead of directory structure for storage

– tests executed in parallel

– publishing results using asynchronous web service

– additional dimension: VO - possibility of VO specific tests (example: dirac installation test by LHCb)


Current status of SFT

• About 20 tests:– General: Software Version, CA certs version, CSH, BrokerInfo, R-

GMA (+ secure conn.), Apel

– Replica management (lcg-utils+RLS): infosys, copyAndRegister, copy, replicate, 3rd party replication, delete + similar set using LFC

– Security tests (separate restricted zone): CRL validity

• “Official” test jobs submitted in dteam VO (high priority queue) every 3 hours

• On demand submissions when necessary (SFT Admin page in Poznan)

• VO specific submissions - example: LHCb, dirac installation test

• Prototype CE availability metric aggregated on site, region, grid level


SFT report


SFT workflow

• Tests are performed on/from 1 WN

• Results are related to a single CE

• Only “default” nodes of other types are tested (SE, MON, ...)


SFT and other monitoring tools


SFT in operations (CIC-on-duty)

• SFT is the main source of “alarms” for CIC-on-duty

• SFT failures are filtered by CIC Dashboard

• Currently only dteam results are taken into account

• Operators must refer to “original” SFT report for details


SFT in operations (cont.)


Data schema - concept

• Designed for MySQL/R-GMA

• Initial goal: uniform schema for SFT, GStat and any future tests/sensors - flexibility

• Dimensions:– context (VO) - added later - who?

– sensor (testName) - what?

– place (nodeName) - where?

– time (timestamp/MeasurementDate,Time) - when?

• 3 levels of details for results:– status - severity: OK, INFO, NOTE, WARNING, ERROR, CRITICAL

– summaryData: simple textual information or value

– detailedData: details of test process (log) - for troubleshooting

• Possibility of adding new sensors/tests dynamically (discovery)


Data schema - tables

• TestDef - test definitions (added/updated dynamically):– key: testName (naming convention, example: sft-job, gstat-

GIISQuery)

– for users: testTitile, friendlyName, testHelp (URL)

– for tools: unit, dataType (not used by SFT)

• TestData - results (all in one table):– VO

– testName

– nodeName

– envID (additional information on test session)

– status

– summaryData

– detailedData - in SFT URL pointing to LOG file (static HTML)

– timestamp


Data schema - tables (cont.)

• TestEnv - additional information about test session:– key: session ID - unique, generated for a single

session (every 3 hours)– name,value pairs– published by the framework after “environment” tests

and before actual test job submissions– example: sft-RB, sft-central-SE

• Additional tables (not used really):– TestDefRelation - relations between tests (supertests,

subtests)– TestDataRelation - relations between results: if failure

of one test affects other tests


Sites Avail. Metric - overview• Monitoring information from various tools is collected in R-GMA

archiver

• Summary generator calculates overall status of each monitored object (site, CE, ...) - update: 1h

• Metric generator calculates numerical value for each monitored object + aggregation (CE → site → region → grid) - update: 1 day

SFT

GStat

???

R-GMA

Summary generator

Metric generator


Metric - summary generator• performs “reduce” operation (using logical “and”) on results for all VOs, tests, sites

• uses rules defined by choice of critical tests (FCR) and by relations between objects (site owns 1 or more CEs, etc.)

• summaries are also results → a hierarchy of rules, example:

– SFT critical tests → sft-summary

– gstat tests → gstat-summary

– sft-summary + gstat-summary → site-summary

– operation is repeated every 1 hour: hourly snapshot for all sites,

– 24 snapshots per day

6

Summary generator

VO

testName

nodeName

status

time

dteam

sft-job ce1.badsite.org

ERROR

T1-

gstat-GIIS

BADSITE-LCG2

OK

T2

.

.

.

.

.

.

.

.

.

.

.

.

dteam

site-summary

BADSITE-LCG2

ERROR

T3

.

.

.

.

.

.

.

.

.

.

.

.

results

rules

summary


Metric - calculation

• Sites availability:– for each site: 24 summaries per day, each has two possible values 0 -

ERROR or 1 - OK

– summary values integrated (averaged) for the whole day ⇒• 0 - site was failing tests for the whole day

• 0.75 - site was passing tests for 18 hours and failing for 6 hours

• 1 - site was OK for the whole day

– metric values for sites in a single region are added together to calculate the values for the whole region

– metric values of all regions are summed to get overall metric for the grid

• CPUs availability– similar process, but: site value is multiplied by the number of CPUs in that

site (published in the information system)

– example: site A has 1000 CPUs but the daily site availability metric was 0.80 CPUs availability metric is 800⇒

– interpretation: How many cpus were effectively provided by a site/region/grid during the day?


Metric - calculation (cont.)

• Long term metrics:– metric results are archived in R-GMA table (MetricData)– weekly, monthly, quarterly metric is calculated by

averaging daily metric for site/region/grid

• Potential CPUs availability metric:– indicates how many CPUs could be provided by the

site/region/grid if all tests were always passed– is calculated by taking 1 as the value of site availability

metric for all sites


Metric - graphs


SFT Experience• Reliability:

– occasional failures - once ~2 months: disk full, hardware failure

– intervention time - around 1 day (time to fix)

– BUT! Plans for replicated service (CNAF), local monitoring support (CERN-FIO) and moving to more suitable hardware

• Scalability:– most problems caused by MySQL/R-GMA performance and data

schema - max.1 month of full history is possible

– history queries most expensive - slow response

• Security:– no publishing security at the moment: anyone can potentially

publish any results

– certificate based authentication for reports access

– BUT! No serious incidents observed - few cases of corrupted data due to testers mistakes


SFT Experience

• Maintenance:– Main services running at CERN

– SFT Admin Page for on demand tests submission running in Poznan (RB at CNAF, integration with CIC Portal, Lyon)

– about 3 people at CERN taking care of main SFT Server and Client

– additional clients maintained by VOs and ROCs

• Usefulness:– main source of information for CIC-on-duty operations

– CE monitoring only, but helpful in detection of problems with other services (R-GMA, SEs, file catalogs)

– used by ROCs for initial sites certification process (on-demand submission to uncertified sites) - but not software certification!!!


On-going development• Goals:

– extend for other services and metrics

– move to production quality: Oracle DB, dedicated hardware, better scalability

• First steps:– refined data schema - just minor changes: naming convention,

additional attributes (dataThreshold, etc.) - DONE

– hardware/software decisions: Oracle DB, dedicated mid-range servers, programming languages - DONE

• Parallel development:– service sensors - distributed

– extended framework (SAME/SFT) - CERN, Lyon

– data storage and processing - CERN, GridView team

– visualisation (metrics, operator’s display) - CERN, GridView team, Lyon


On-going development (cont.)

• Some guidelines:– small steps, immediately in production– concentrate on concept and not on producing world-

most-complex-software with thousands lines of code

– better rewrite from scratch than make it even more obscure by trying to fix it

• SFT was rewritten several times - now it’s next iteration!


Conclusions

• Currently only Computing Elements are really monitored - other services indirectly

• Basic concept unchanged for 1.5 year now• Data schema flexible enough to serve for

monitoring of other services• No serious scalability problems observed• More and more people actively involved in

SAME/SFT development

Date post:	26-Dec-2015
Category:	Documents
Upload:	derek-freeman
View:	243 times
Download:	24 times

Experience with Site Functional Tests Piotr Nyczyk CERN IT/GD WLCG Service Workshop Mumbai, 10-12...

Documents