+ All Categories
Home > Documents > WLCG Service Report

WLCG Service Report

Date post: 28-Jan-2016
Category:
Upload: lizina
View: 33 times
Download: 0 times
Share this document with a friend
Description:
WLCG Service Report. [email protected] ~~~ WLCG Management Board, 13 th July 2010. WLCG Operations Report – Summary. - PowerPoint PPT Presentation
Popular Tags:
16
WLCG Service Report [email protected] ~~~ WLCG Management Board, 13 th July 2010 1
Transcript
Page 1: WLCG Service Report

WLCG Service Report

[email protected] ~~~

WLCG Management Board, 13th July 2010

1

Page 2: WLCG Service Report

WLCG Operations Report – Summary

2

KPI Status Comment

GGUS tickets Numerous real alarms

Drill-down on real alarms;

Site Usability Minor issues Drill-down to be provided

SIRs & Change assessments

Several SIRs

…and quite a few pending…

VO User Team Alarm Total

ALICE 3 0 0 3

ATLAS 30 70 7 107

CMS 6 5 1 12

LHCb 4 25 1 30

Totals 43 100 9 152

The response to alarms – expert intervention & problem resolution – continues to be(well) within targets. Should we establish rather a metric related to the frequencyand nature of such alarms? (Want to see progress – even if slow…)

Page 3: WLCG Service Report

0.1

4.1 4.1 4.1

1.1

0.1

4.2

1.2

1.3

1.4

3.1

Page 4: WLCG Service Report

Analysis of the availability plots

COMMON FOR THE ALL EXPERIMENTS0.1 CERN-PROD: Castor related problem to export data from T0, all attempts of writing and reading from T0 have been timing out. Problem was identified regarding very high levels of logging. Logging daemon reset.

ATLAS1.1 NDGF-T1: Schedule downtime from 1200Hrs to 1400 Hrs. Upgrade of dCache on head nodes as well as OS patching. Aiming to keep actual outage much shorter, if all goes well.1.2 Taiwan-LCG2: Temporary SRMv2 Test timeout.1.3 INFN-T1: SRM Test terminated due to temporary communication error.1.4 NIKHEF: SRM was overloaded with ls operations by a biomed user. Other users got time outs. Fixed by asking user to stop.

ALICENothing to report.

CMS3.1 KIT: Problem with CMS head nodes for dCache - down for about 3 hours. H/W failure.

LHCb4.1 CERN-PROD: SAM tests failing against CERN since a week due to a diskserver in the lhcbuser pool used for the tests that has a filesystem problem.4.2 CNAF: LHCb storage out due to network (switch) failures. Fixed early morning around.

Page 5: WLCG Service Report

0.1

0.11.2

1.1

1.3 1.3

0.20.2 0.2

0.2

2.1

2.2

3.13.2

4.1 4.1 4.1 4.1

Page 6: WLCG Service Report

Analysis of the availability plots

COMMON FOR THE ALL EXPERIMENTS0.1 RAL-LCG2: Unscheduled outage. Site in downtime due to site wide networking issue.0.2 FZK-LCG2: GridKa had a complete power failure. Compute node down till Monday.ATLAS1.1 INFN-T1: Temporary test failures1.2 TAIWAN-LCG2: Temporary test failures1.3 RAL-LCG2: Some problems with ATLAS s/w server end of week and into weekend

ALICE2.1 FZK-LCG2: Momentarily VOBOX-Proxy-Registration test failure2.2 NIKHEF: alice-box-proxyrenewal service text failedCMS3.1 KIT: Temporary test failures.3.2 CERN: Problems with the srm-cern which caused transfers to CERN to fail.

LHCb4.1 NIKHEF: SRM outage. Extended until Monday - difficult to pinpoint and reproduce it. Vendor suspects firmware issue.

Page 7: WLCG Service Report

GGUS summary (2 weeks)

VO User Team Alarm Total

ALICE 3 0 0 3

ATLAS 30 70 7 107

CMS 6 5 1 12

LHCb 4 25 1 30

Totals 43 100 9 152

8

Page 8: WLCG Service Report

04/22/23 WLCG MB Report WLCG Service Report 9

Support-related events since last MB

• The SIR by KIT for the 2010/05/12 .de DNS incident is still pending. Details in savannah:114518

•Prolonged infrastructure downtimes should IOHO be included as part of “WLCG prolonged downtime strategies” WLCG T1SCM

• The cases of failing GGUS email notifications To SARA and From CERN are now traced down to parsing scripts in both locations and fixed. Successful ALARM test tickets GGUS:59769 and GGUS:59775 confirm this. Details in savannah:115137 • The GridKa cooling system failure incident of 2010/07/10 requires a SIR.

Page 9: WLCG Service Report

ATLAS ALARM->CERN CASTOR

•https://gus.fzk.de/ws/ticket_info.php?ticket=59441

04/22/23 WLCG MB Report WLCG Service Report 10

What time UTC What happened

2010/06/28 05:00 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN

2010/06/28 05:05 Service mgr working on the problem.

2010/06/28 08:07 Pb traced down to excessive logging information. Service mgr puts ticket ‘solved’.

2010/06/28 09:26 Submitter puts ticket to status ‘verified’.

Page 10: WLCG Service Report

CMS ALARM->CERN AFS

•https://gus.fzk.de/ws/ticket_info.php?ticket=59547

04/22/23 WLCG MB Report WLCG Service Report 11

What time UTC What happened

2010/06/29 18:45 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN

2010/06/29 18:49 Operator contacts Service mgr on the problem.

2010/06/29 21:43 Pb traced down to an afs disk array failure. Service mgr makes a reset and puts ticket ‘solved’.

2010/06/29 22:21 Submitter puts ticket to status ‘verified’.

Page 11: WLCG Service Report

LHCB ALARM->INFN-T1 STORM

•https://gus.fzk.de/ws/ticket_info.php?ticket=59643

04/22/23 WLCG MB Report WLCG Service Report 12

What time UTC What happened

2010/07/02 08:09

GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_Italy.

2010/07/02 09:40

Submitter says the problem went away. Reason was an unavailable 10Gbit link between 2 gridftp servers.

2010/07/02 14:11

Supporter puts ticket to status ‘solved’.

2010/07/08 10:36

Submitter puts ticket to status ‘verified’.

Page 12: WLCG Service Report

ATLAS ALARM->CERN CASTOR SRM

•https://gus.fzk.de/ws/ticket_info.php?ticket=59848

04/22/23 WLCG MB Report WLCG Service Report 13

What time UTC What happened

2010/07/07 21:34 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN

2010/07/07 21:59 Submitter reports service degradation. Operator contacts service mgr on the problem.

2010/07/07 22:01 Service mgr confirms, in the ticket, on-going pb investigation.

2010/07/07 22:50 Developer finds a process stuck due to a rsyslog bug. Process restarted. Service mgr puts ticket to status ‘solved’.

Page 13: WLCG Service Report

ATLAS ALARM->CERN CASTOR SRM

•https://gus.fzk.de/ws/ticket_info.php?ticket=59850

04/22/23 WLCG MB Report WLCG Service Report 14

What time UTC What happened

2010/07/08 12:32 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN

2010/07/08 12:45 Submitter reports T0 data export problems. Same as GGUS:59848,59850. Operator contacts the service piquet.

2010/07/08 12:49 Service mgr investigating.

2010/07/08 12:59 … discovery of a rsyslog bug and its config. change as per above-mentioned tickets.

2010/07/08 15:07 Service mgr puts the ticket to status ‘solved’.

2010/07/08 16:34 [ Problems with GGUS<->Remedy ticket exchange to be followed up ]

Page 14: WLCG Service Report

Alarm Summary

15

Date Site Service

28/06 CERN CASTOR

29/06 CERN AFS

02/07 CNAF StoRM

07/07 CERN CASTOR SRM

08/07 CERN CASTOR SRM

Site Service Area Frequency

CERN Data / Storage 4

CNAF Data / Storage 1

TOTALS Data / Storage 100%

Page 15: WLCG Service Report

Summary

• Good response to GGUS alarms continues – frequency high (but bearable in the short term?) for support staff as well as for users…

• No significant reduction can be expected without an analysis of where the most impact could be achieved – and change – which comes with risk

• Good match between Site Usability plots and problems reported through daily meetings

16

Page 16: WLCG Service Report

Workshop Actions - Draft• SIR template and MoU-based wording to categorize service

degradation / downtimes;

• Monitoring;

• Prolonged site (service) downtimes;

• Squid “as a WLCG service”

• None of these are new items – most have been discussed explicitly at WLCG T1 SCM meetings earlier this year

• Need to review summary slides to ensure that list is exhaustive and prioritize – including matching against EGI InSPIRE SA3 manpower (now largely in place or agreed)

• Also proposed to hold daily WLCG operations meetings – chaired e.g. by a Tier1 – when CERN closed (Jeune Genevois etc.)

17


Recommended