Introduction
• This report covers the service since last week’s MB• GGUS ticket rate back to normal. A few unscheduled
interventions occurred during this period, but no serious events that triggered Service Incident Reports
2
GGUS Summaries
3
VO concerned USER TEAM ALARM TOTAL
ALICE 1 0 0 1
ATLAS 20 25 1 46
CMS 3 0 0 3
LHCb 8 3 1 12
Totals 32 28 2 62
GGUS tickets per VO
0
10
20
30
40
50
60
7-Jan 27-Jan 16-Feb 8-Mar 28-Mar 17-Apr 7-May
ALICE
ATLAS
CMS
LHCb
Once again sites (e.g. IN2P3) stress that experiments should submit GGUS tickets
ATLAS alarm to FZK - details• Problem: (2009-04-16 18:07) Disk buffer in front of ATLASMCTAPE in FZK
is full
• Detailed description:The buffer in front of ATLASMCTAPE in FZK is full. There are errors starting approx at 20:00 like
[FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] at Thu Apr 16 19:40:08 CEST 2009 state Failed : space with id=331572 does not have enough space] Source Host [se-goegrid.gwdg.de]
The problem was reported already yesterday and was due to tape migration being interrupted. I should also mention that a 1TB buffer in front of tape for both MCDISK and DATADISK for a 10% ATLAS T1 seems a bit low to me.
• Solution: (2009-04-17 06:10) Hi,
yes, the space tokengot filled up, but there was no technical or software problem, as far as I can see. According to our own monitoring graphs ( http://grid.fzk.de/monitoring/dcache_spacetokens.php ) data can in faster as it could be flushed to tape. If this is a matter of to small space token capacity or to slow tape writing speed has to be discussed elsewhere.
Kind regards,Xavier. 4
ATLAS alarm to FZK - detailsFZK follow-up comments of Andreas Heiss on 20/4• “At Thursday late afternoon, around 16:20 local time, Atlas indeed
sent an email requesting the increase of the spacetoken ATLASMCTAPE, but unfortunately all revceivers of this email had already left home or were on holidays. At that time, the tape backend was working with low efficiency and writing only about 120MB/s which was not sufficient. At 18:07 UTC (20:07 local), Atlas sent an alarm ticket which triggered an SMS to my mobile phone at 21:02 local time (unclear where the delay of 1h came from) which I unfortunately saw not earlier than 23:45 local time, since the phone was laying at a 'dead spot' where it didn't get the mobile network. I opened the ticket (GGUS id 47937) and saw the quoted error message concerning failing FTS transfers on a T2->T1 channel. That's why I wrote into the ticket that I do not agree to open an alarm ticket because of failing T2->T1 transfers However, the main problem for Atlas was obviously not the failing T2-T1 transfers but failure of many local jobs which could not write their output to dCache.”
• “However, independant of this incident, I think we should discuss at some point, under what circumstances an alarm ticket is ok and if there are problems where alarm tickets are not ok (e.g. failing MC data transfers or MC production jobs). We heard at the WLCG workshop that most sites trigger mobile phones of more than one person when an alarm is coming in. GridKa will also do so soon. So, a problem which justifies e.g. waking up several people at night should be _really_ a severe one. I don't want to start an email discussion, but maybe we can put that point on the agenda of e.g. a GDB?”
5
LHCb alarm to FZK - details
• Problem: (2009-04-11 18:36) All lhcb jobs to ce-2-fzk.gridka.de failing • Detailed description:
the command "glite-brokerinfo getCE" is failing with following error:glite-brokerinfo: error while loading shared libraries: libclassad_ns.so.0: cannot open shared object file: No such file or directory
The command is used to determine where the pilot is running.
• Solution: (2009-04-14 08:25) We have added the missing library and installed a missing package.Please let us know, if you still have problems with the new 64 Bit SL5 WNs.
Regards,Angela
6
7
Poor availability for LHCb at several sites
LHCb problems reported: WMS submission is failures were traced to problems with short
CRL for certificates created by the CERN CA (thanks to Michel Jouvin, GRIF). Fixed
CNAF BDII publishing wrong information making match-making impossible. Fixed
CERN CVS system failing: Fixed? The job failures yesterday at Nikhef and IN2p3 are now
explained by the pre-pended "root:" string to the returned tURL. ??
The problem of jobs crashing accessing the LFC@CERN is still under investigation but seems to be that the thread pool in LFC becomes exhausted due the the way CORAL is accessing it. Understood? (“seems to be due to the suboptimal access of LFC from CORAL”)
8
Various site issues/highlights
• Some confusions about effect of using ‘At Risk’ for transparent interventions. Not counted as site downtime
• BNL: degraded efficiency due to large number tape staging requests from the ATLAS production tasks (pile up and HITS merging) and this causes a high load on the dCache/pnfs server resulting in an unnacceptably high failure rate for DDM transfers.
• CERN: Good news on Castor Oracle BIGID problem (next slide)
Good news on Castor Oracle BIGID problem
• From https://savannah.cern.ch/support/?106879 • After joint work with Sebastien and excellent feedback
from some people from Oracle including Oracle development, it looks now clear that the problem is linked with the usage of "DML returning" statements accessed from OCCI. Basically it works for single row but with different types and combination of single row / multiple rows, it can “not work” and lead to issues like the Big Id issue.
• Oracle has opened a documentation bug (public and accessible with Metalink account) about the issue: OCCI does not support 'retuning into' …
[ Important to stress collaborative work of many people / several groups/teams ]
Summary
• After a quiet Easter week, the number of GGUS tickets is back at its ‘normal’ level
• A number of site issues reported for ATLAS and LHCb• FZK (both ATLAS, LHCb)• CNAF (LHCb)• Some LHCb issues also at SARA, PIC, NIKHEF, IN2P3 and
CERN (LFC)
• Some problems solved (or almost)• WMS submission is failing for LHCb certificate at some
sites
• Related to the short CRL for certificates created by the CERN CA
• CASTOR “Bid-Id” problem understood