+ All Categories
Home > Documents > GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE100111 ATLAS331697209 CMS117220 LHCb325129...

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE100111 ATLAS331697209 CMS117220 LHCb325129...

Date post: 31-Dec-2015
Category:
Upload: leon-watts
View: 215 times
Download: 0 times
Share this document with a friend
8
GGUS summary (4 weeks) VO User Team Alarm Total ALICE 10 0 1 11 ATLAS 33 169 7 209 CMS 11 7 2 20 LHCb 3 25 1 29 Totals 57 201 11 269 1
Transcript

GGUS summary (4 weeks)

VO User Team Alarm Total

ALICE 10 0 1 11

ATLAS 33 169 7 209

CMS 11 7 2 20

LHCb 3 25 1 29

Totals 57 201 11 269

1

04/19/23 WLCG MB Report WLCG Service Report 2

Support-related events since last MB

•A reminder of the TEAM tickets’ meaning and workflow for the Tier0 was presented at the 2011/03/17 T1SCM. Slide available here. Their advantage to ‘user’ tickets is only the co-ownership of the ticket by all TEAMers. They do not imply a higher ‘importance’. Direct site notification by email is triggered by GGUS also for ‘user’ tickets, provided the ‘Notify site’ field is used.•There were 6 real ALARM tickets since the 2011/03/08 MB (4 weeks), all submitted by ATLAS, notified sites IN2P3 (1 ticket) and CERN-PROD (5 tickets). Afs performance became an issue for all experiments.•The GGUS ALARM test suite was issued on 2011/03/30 (Release date). A special GGUS-to-SNOW route entered production allowing service managers to get direct ticket assignment in SNOW.Details follow…

ATLAS ALARM->IN2P3 DATA COPY FROM CERN FAILSGGUS:68794

04/19/23 WLCG MB Report WLCG Service Report 3

What time UTC What happened

2011/03/19 19:35SATURDAY

GGUS ALARM ticket, automatic email notification to [email protected] AND automatic assignment to NGI_France.

2011/03/19 19:40 Automatic email acknowledgement of ALARM registration.

2011/03/19 19:54 Service manager identifies a problem with SRM.

2011/03/19 21:16 Service manager suggests to put site at risk as the SRM database problem persists and is not understood.

2011/03/19 22:15 ATLAS stops using the site for the rest of the weekend.

2011/03/20 12:38 Site reports things are better now.

2011/03/21 08:09 Ticket set to ‘solved’. A Friday intervention was the reason for this incident as IN2P3 reported on Monday.

ATLAS ALARM->CERN LSF NO JOB ACCEPTEDGGUS:68795

04/19/23 WLCG MB Report WLCG Service Report 4

What time UTC What happened

2011/03/19 21:22SATURDAY

GGUS ALARM ticket, automatic email notification to [email protected] AND automatic assignment to ROC_CERN.

2011/03/19 21:40 Operator acknowledges and records in the GGUS ticket that [email protected] were contacted.

2011/03/19 23:05 CMS expert comments in the GGUS ticket that a user submitted by mistake 180K jobs.

2011/03/20 05:34 Service manager set ticket to ‘solved’ once the number of jobs queued was reduced.

2011/03/20 06:11 Submitter puts the ticket to status ‘verified’. In the days following the incident, a limit to the number of jobs was put in LSF to avoid such blockage in the future.

ATLAS ALARM->CERN CASTOR DOWN GGUS:68949

04/19/23 WLCG MB Report WLCG Service Report 5

What time UTC What happened

2011/03/25 11:59 GGUS ALARM ticket, automatic email notification to [email protected] AND automatic assignment to ROC_CERN.

2011/03/25 12:05 Operator acknowledges and records in the GGUS ticket that the Castor piquet was contacted.

2011/03/25 12:15 Expert on call records in the ticket that the problem is understood and fixed (it also affected CMS).

2011/03/25 14:09 Service manager set ticket to ‘solved’ with description: ‘incident caused by an incorrect conf. that was loaded at the wrong time. A mistake made as part of the SL5 upgrade. ‘

CMS ALARM->CERN CASTOR DOWN GGUS:68952

04/19/23 WLCG MB Report WLCG Service Report 6

What time UTC What happened

2011/03/25 12:32 GGUS ALARM ticket, automatic email notification to [email protected] AND automatic assignment to ROC_CERN.

2011/03/25 12:37 Operator acknowledges and records in the GGUS ticket that the Castor piquet was contacted.

2011/03/25 12:41 Expert on call records in the ticket that the problem is understood and fixed (as per ATLAS GGUS:68949).

2011/03/25 14:42 Service manager set ticket to ‘solved’. Reason was human error. Details in slide 5.

2011/03/25 14:51 Submitter sets the ticket to ‘verified’. He had already dropped the ticket priority at 12:37 as problem went quickly away.

ATLAS ALARM->CERN AFS NOT RESPONDINGGGUS:69121

04/19/23 WLCG MB Report WLCG Service Report 7

What time UTC What happened

2011/03/29 11:13 GGUS ALARM ticket, automatic email notification to [email protected] AND automatic assignment to ROC_CERN.

2011/03/29 11:26 Operator acknowledges and records in the GGUS ticket that email was sent to the afs service.

2011/03/29 13:43 Service manager set ticket to ‘solved’. Reason was a hardware failure that rendered 3 partitions and 110 ATLAS volumes inaccessible.

2011/03/29 13:49

Submitter sets the ticket into status ‘verified‘.

ATLAS ALARM->CERN AFS S/W REL. AREA UNAVAILABLE GGUS:69192

04/19/23 WLCG MB Report WLCG Service Report 8

What time UTC What happened

2011/03/31 7:25 GGUS ALARM ticket, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. No entry by the operator in the ticket!! Maybe forgot to record the call.

2011/03/31 8:39 Service manager records in the ticket that investigation has started.

2011/03/31 8:39 Experiment member complains in the ticket for the afs problem frequency.

2011/03/31 10:46 Afs expert records ‘problem found on server afs151:device mapper s/w RAID layer was stuck in a loop after a h/w error, blocking all I/O’.

2011/03/31 12:46 Service manager sets the ticket to status ‘solved’.

2011/03/31 15:25

Submitter sets the ticket to status ‘verified’.


Recommended