Post on 25-Dec-2015
transcript
Peter Kreuzer, RWTH Aachen/CERNOliver Gutsche, Fermilab
CMS Computing Shift Personnel
(CSP) Tutorial
10. January 2011
2CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Tutorial Structure‣ Today :
‣ Brief Introduction to CMS Computing
‣ General Description of Computing Shift Procedure
‣ Subscription to the CMS Computing E-Log
‣ Organization of Vidyo access from local CMS center
‣ Questions
‣ After this tutorial and >= 2 months prior to 1st shift :
‣ New shifters go through the Shift Procedure and shadow experienced CSP by taking „passive“ shifts (only E-log reports, NO alarms)
‣ After 2 „passive“ shifts :
‣ Sign off by Peter/Oli
‣ Full participation as CSP
‣ Possibility to sign-up via the WEB
4CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Overview of the CMS Distributed Computing System
CAFCAF
‣ Multi-tiered distributed computing infrastructure based on GRID technologies for resource access and data movement
‣ Many new challenges compared to established HEP experiments:
‣ Data distribution, user localization, site monitoring, support responsibilities
5CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Overview of the CMS Distributed Computing System
‣ Data archival (cold copy)
‣ Prompt reconstruction
‣ Time critical calibration & alignment
CAFCAF
Tier-0 / CAF
6CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Overview of the CMS Distributed Computing System
‣ Data archival (hot copy)
‣ Reprocessing, skimming, MC production
‣ Data serving
CAFCAF
Tier-1
7CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Overview of the CMS Distributed Computing System
‣ Centralized Simulation
‣ Distributed Data Analysis
CAFCAF
Tier-2
8CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Overview of the CMS Distributed Computing System
‣ Transfer rates
‣ Processing resources
CAFCAF
Tier-1 level:~35k jobs/day
Tier-2 level:~100k
jobs/day
300 MB/s
600 MB/s
Down: 50-500 MB/s burstsUp: 20 MB/s sustained
Resources
9CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Overview of the CMS Distributed Computing System
In total:7 Tier-1 across 3 In total:7 Tier-1 across 3 continents~50 Tier-2 continents~50 Tier-2 across 4 continentsacross 4 continents
11
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
CSP Role and Required expertise‣ The CSP is mainly monitoring systems and raising alarms
‣ monitor computing infrastructure and services at checkpoint hours by going through a set of checklists
‣ identify problems
‣ Create E-Log reports
‣ trigger actions
‣ open Savannah tickets, in particular to CMS Sites
‣ contact CRC, Core Computing Operators & Experts, Computing Experts On Call
➞ We are working on making the CSP role even more active in problem trouble-shooting
‣ Required expertise of the CSP
‣ Fair understanding of CMS distributed computing infrastructure + services required for data processing, transfers and analysis
‣ Physicist or technician from a collaborating CMS institute
‣ Tutorial + 2-3 assisted “passive” shifts
12
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
CMS policy for Computing Shifts‣ The Computing Shifts are accounted within standard MoA
service work defined by CMS (as Central CMS Shifts) see http://cms.cern.ch/iCMS/admin/moamanagement
‣ Standard requirement : 8 points per author per institute
‣ 1 CSP shift == 0.75 points week / 1.25 points week-end
‣ no extra credit for night shifts since covering all time zones
‣ (special arrangements not excluded)
‣
‣ During Data taking computing shifts are carried out :
‣ From Main CMS Centres : CMS CC or FNAL/ROC
‣ From Remote CMS Centres : see http://lucas-nice.web.cern.ch/lucas-nice/cms-centre/www/CMS-Centres-Worldwide.pdf
‣ In 8 hours shifts (09-17/17-01/01-09), with 1 CSP per shift
‣ With the support of a Computing Run Coordinator who is on duty at CERN during 1 week periods
‣ With the support of CMS Core Computing Operators & Experts
13
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Other Roles & interactions with CSP‣ Computing Run Coordinator (CRC)
‣ Subscribes to all CSP E-log sub-sections
‣ Assists CSP in raising alarms/tickets for complex cases
‣ Calls EOC during off-working hours (see below)
‣ Core Computing Operator or Expert (FacOps, DataOps, AnaOps)
‣ Subscribes to relevant CSP E-Log sub-sections
‣ Supports CSP during working hours
‣ Computing Expert On Call (EOC)
‣ Responsible of a particular service
‣ Alarmed by CSP via Email/IM/Tel during working hours
‣ Alarmed by CRC if really needed off-working hours
‣ CMS Site Contact Person
‣ Responds to alarms (e.g. Savannah, GGUS tickets)
‣ Other shifters (DQM, Online, Detector, …)
‣ In temporary absence of CRC, the CSP is the Core Computing contact for any shifter at P5/CMS Center/FNAL ROC
‣ CSP procedure responsible
‣ Assigns CSP shifts
15
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Prerequisites‣ The CSP should be
‣ CMS member
‣ if you don’t, please fill up the WEB registration form http://cms.cern.ch/iCMS/jsp/secr/reg/reg.jsp
‣ After the form has been submitted, an email is sent to your Institute Representative (Team Leader) for approval
‣ If you have never been to CERN, it is necessary to send a copy of your passport to Anastasia Dolya, CMS Secretariat, CERN - PH Department, CH -1211 Geneva 23, Switzerland
‣ have a CMS Computer account
‣ for the Computer account, please contact CMS.Computing@cern.ch
‣ a Hypernews account
‣ a GRID certificate + CMS VO registration
‣ Please follow the link https://twiki.cern.ch/twiki/bin/view/CMS/WorkBookRunningGrid#Get_a_Grid_certificate_and_the_r for a guideline on how to proceed
16
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Most important CSP tools‣ Main CSP Shift Instructions
‣ https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShifts
‣ Vidyo connection to the Tandberg system (other CMS Centres)
‣ https://vidyoportal.cern.ch/
‣ Shift Sign-Up tool
‣ http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Shiftlist/daily
‣ Instant Messenger under “FacOpsShifter” account
‣ https://twiki.cern.ch/twiki/bin/view/CMS/InstructionsForAIMForComputingShifters
‣ Computing Plan of the Day
‣ http://cmsdoc.cern.ch/cmscc/shift/today.jsp
‣ Account in the CSP E-log
‣ https://prod-grid-logger.cern.ch/elog/
‣ Savannah account ( “cmscompinfrasup” member) for opening tickets
‣ https://savannah.cern.ch/projects/cmscompinfrasup/
‣ Membership in e-group cms-csp-shifters@cern.ch
‣ subscribe via https://e-groups.cern.ch/e-groups/EgroupsSearch.do
17
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Shift Subscription tool‣ http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Shiftlist/ShiftSel
ection?shift_type=25
‣ Shift selection : Blue == available on any slot that day / Green == available on a particular slot that day
‣ Preferably, please always check the Green box corresponding to your time zone slot to avoid being approved for other time zones
‣ Warning : when selecting Green, Blues get automatically selected, so please deselect it to avoid confusion
18
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Shift Subscription policies
By end 2010, we actually have more demand for shifts than available slots (95 potential shifters !), so approvals need to follow stricter policies :
➡ shift requests can be made anytime for any open shift period➡ shift approvals will follow a monthly schedule, where shifts are approved two months in advance to allow for a reasonable planning horizon for all shifters
- example : all shift requests for January are reviewed beginning of November, the shift requests are balanced between the different groups/regions and shifts are approved
➡ In the monthly approval process, we would like to follow the following procedure:-shift requests from shifters in their own time zone have priority-within a time zone, balance shift requests first on group/institute level, then on the level of individual shifters➡We are also regularly publishing the CSP shift planning and accounting tables, per time zone, per group and per shifter, see next slide.
19
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
CSP Planning and Accounting‣ https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShiftContacts#CSP_Planning_and_Accounting
Example for European time zone :
20
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
The CMS Computing Logbook‣ https://prod-grid-logger.cern.ch/elog/
‣ 2 (unpleasant) features : need to enter your elog pwd the first time accessing a given section
‣ need to regularly re-load your browser to see updates
21
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
The Savannah ticketing tool
‣ main tool to communicate with sites and DataOps/FacOps/AnaOps to solve infrastructure problems
‣ Savannah Instructions for CSP :
‣ https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShifts
Submit a ticket
22
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
SavannahCategory: mostly
SAM tests, Job Robot, Data transfers, ...
Severity: You judge !
Privacy: “Public”
Assigned to: either DataOps, FacOps,
AnaOps or T1/T2 site squad
Use GGUS: YES for T1s, NO for T2s
Site: T1/T2 site squad
‣ Subject: if connected to a specific site, begin with [SITE]
‣ Example: [T1_US_FNAL]
‣ For Tier-1, please systematically bridge to GGUS (WLCG ticketing) via Use GGUS: Yes
‣ More information about that here : https://twiki.cern.ch/twiki/bin/view/CMS/FacOpsSavannahGGUS
23
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
The Vidyo interface‣ We have setup a
permanent Vidyo ➞ MCU video bridge
‣ Connects to the permanent video feed between the main CMS Centers and P5
‣ Remote shifters can be in direct contact with CMS Centers at CMS CC, P5, FNAL ROC shifters
‣ To avoid having too many connections, only one CSP shifter is allowed to connect at all times
‣ CSP has to log on at the beginning of shift and log off at end
‣ Every remote CMS Center needs a Remote Video Admin (to connect to MCU) :
‣ Responsible to check that system is used properly and holding the connection details
‣ Vidyo-capable PC (Window and MAC client OK, Linux client still Beta version)
‣ Sites with existing “Tanberg” or “Polycom” devices will be connected to MCU directly
26
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Checklist I: Core
‣ CERN/Core infrastructure monitoring :
‣ Main checks: CERN/IT SSB, CMS Service Gridmaps, CMS Services scheduled upgrade, CASTORCMS instances
27
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Checklist 2 : Tier-0
‣ Tier-0 workflows monitoring :
‣ Main checks: Storage Manager, T0Mon, tier0export pool, networking, batch/LSF farm, jobs
28
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Checklist 3 : CAF
‣ CAF workflows monitoring :
‣ Main checks: free space/usage per CAF stakeholder on cmscaf pool, networking, batch/LSF farm, jobs
29
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Checklist 4 : Data Transfers
‣ Distributed Data Transfer monitoring :
‣ Main checks: Queued based monitoring for Tier-1s (not for T2s), Status of PhEDEx agents at sites
Soon O
bsol
ete,
see
next
slide
30
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
New Checklist 4 : Data Transfers
‣ Distributed Data Transfer monitoring. Main checks :
‣ Status of PhEDEx agents at sites
‣ Queued based monitoring for Tier-1s (not for T2s)
‣This new tool will be tested with shifters during November and deployed by end of 2010, replacing the existing tool.
31
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Checklist 5 : Grid Sites
‣ Distributed Grid sites monitoring :
‣ Main checks: SAM, JobRobot, Downtimes, Commissioning links, Savannah
32
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Checklist 5 : Grid Sites
‣ Important
‣ CSP is asked to investigate the problem in as much detail as possible
‣ This helps the admin which will receive any Savannah tickets to quickly and easily solve the problem
‣ DON’T REPORT THAT SITE X HAS A MEDIUM SIZE RED BALL!!!
‣ Report that site x shows failures in the <to be filled> SAM test
‣ In the body, investigate further what the problem is by clicking through the information provided till you reach the detailed error report
1 2
33
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Checklists 6&7 : T1/T2 workflows
‣ Tier-1 workflows monitoring :
‣ Main checks: not covered so far, currently relying on T1 admins, T1 coordinators, DataOps
‣ Plan to add ProdMon/Dashboard monitoring + GlideIn Fabric monitoring
‣ Tier-2 workflows monitoring :
‣ Main checks: not covered so far, currently relying on T2 admins, T2 coordinators and CRAB support team
‣ Plan to collaborate with AnalysisOps monitoring
‣ Plan to add ProdMon/Dashboard monitoring
35
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
CAF monitoring• Free space on CMS CAF disk starts to shrink, due to an unexpected
reason
• CSP instructions (CAF) : If the fraction of free space on cmscaf as shown in URL1 goes below 10% and if this was not already mentioned in the Computing Plan of the Day and there is no already opened Savannah ticket, open an ELOG in the "CAF" category
10%
• If no detection/alarm by CSP, the free space might shrink to 0, with the consequence that the critical Tier-0 to CAF data flow breaks
• This really happened ! …and some uncontrolled emergency data flushing on the CAF had to be done ➞ WORST CASE
SCENARIO !
36
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Computing Plan of the Day
• Note : 3 Russian sites in downtime !
37
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Grid Site Monitoring• Example CMS Site Status
Board :
JINR in Scheduled downtime Ignore Waiting Room
T2_CN_Beijing shows a red ball !Known by Comp. Plan of Day?
No ! So what to do ?
38
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Grid Site Monitoring‣ Investigate further:
‣ Click on link next to “red ball”
‣ Check the different problem categories and even drill further down to check for the real problem
‣ Report in E-log
‣ Advanced CSP can open Savannah ticket to site
‣ Subject should include: [SITE] and as specific short description of the problem as possible
‣ Do not only mention that the site has a “red ball” !!!
‣ Ticket should contain as many details as found out during investigation
39
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Other news on GRID site monitoring
• “lens symbol” == already known issue. NO Elog/ticket needed (still check if it is still the same problem)
• “At work symbol” == Site scheduled downtime. NO Elog/ticket needed
Note : Unscheduled downtimes are not yet marked with the “At work symbol”, so double-check with the Computing Plan of the Day and with CMS Google Downtime Calendar (see next slide) before opening Elog/ticket.
• If T1 red, small ball, CSP should open Elog/Savannah quasi immediately (1-2h)
• If T2, follow instructions when/how open Elog/Savannah
40
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Other news on GRID site monitoring
CMS Google Downtime Calendar
41
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
PhEDEx Components Status Page All Russian T2s have their PhEDEx componentsdown since ~3h What to do ?
Check Computing Plan of the Day!
43
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Where we stand and where we go‣ Summer 08: CMS Computing shift procedures created
‣ Fall 08: introduced the concept of Computing Shift Person (CSP) and Computing Run Coordinator (CRC)
‣ Winter 08: ~100 shifts done by pool of ~30 computing experts at CMS Centre@CERN & FNAL/ROC
‣ 2009: CSP shifts covered by CMS collaborators at remote CMS Centres
‣ Pool of 45 CSPs from 3 time-zones (Asia, America, Europe)
‣ CMS Centres : Beijing, Rio, Sao Paulo, Texas Tech, Univ. of Florida, Aachen, DESY, FNAL, CERN
‣ 2010: extend above philosophy
‣ Pool of 70 CSPs (new remote Centres: GridKa, INFN Bologna, ... )
‣ Encourage strong remote teams who can provide local CSP support
‣ Strengthen role of CSP in trouble-shooting issues
‣ Enforce 24/7 coverage of critical services in shift procedures
‣ Move away from “Twiki” to DQM-like monitoring (in progress)
44
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Critical Services and Sites• We are currently revising the Criticality Level of all
CMS services• CSP instructions will be adapted accordingly
– Frequency of checks– List of experts to contact– Type of alarm : Elog, Savannah, telephone to CRC (who
might raise GGUS alarm or call Expert on Call)• As a general rule : the closer you are to the
detector data stream, the more critical :– Tier-0 : processing and storage– CAF : processing and storage– Central Services at CERN (Core) : DBS, PhEDEx, …– Tier-0 – Tier-1 transfers– Tier-1 Site Availability
➞ Please pay special attention to these workflows• And always read the Computing Plan of the Day
carefully
45
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
24/7 Critical Services&Sites Coverage (II)Service/Facilities
MonitoringCSP checks
every 2 hoursStatus Green ?
E-LogBook & Ticketing tool
Expert answer within
1 hour ?
No
No
Service/Site Alarm Procedure
Yes
Expert Computing Operations
Problem solved ?NoCore System Alarming
Yes
Yes
Computing Run Coordinator (CRC) reachable 24/7 for :- Critical Service recovery procedure- Priority (GGUS-Team) ticket to site
CMS Core Computing experts / CMS Site admins(*) : - Apply routine service / infrastructure operations and monitoring- Respond as On-Call Experts to Alarms
CSP
CSP
CRC
CERN/IT
(*) CMS has dedicated site-contacts and site-admins(**) highly critical alarms to Tier-0/1s are sent via GGUS-Alarm tickets and can trigger phone calls(***) CRC, Service Expert or Site Admin actions are systematically reported back to the E-LogBook or Savannah or GGUS, for transparency purposes.
(**)
(***)
(***)
46
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
What CSP should always do ?• Subscribe to CSP shifts well in advance (> 1 week). If
cancel, consult P.Kreuzer/O.Gutsche AND remove shift subscription
• Carefully read the Computing Plan of the Day and keep an eye on it during the whole shift. If Plan missing, read report by previous shifter and complain via AIM or email to CRC
• Always connect to the instant messenger CSP account “FacOpsShifter”. When leaving the shift desk, inform outside world by changing status of messenger (e.g. to “away for lunch”)
• When reporting an issue in the proper Elog section, provide details of the observed problem (not just the link)
• Regularly read Elog responses or announcements by CRC or Computing Experts, in all Elog sections (reload browser !)
• Write detailed final shift reports in Elog; even if nothing new has occurred during shift, report on main open issues
• Once trained (2-3 passive shifts), open Savannah tickets in case of well identified site issue, by carefully following the instructions http://twiki.ihep.ac.cn/twiki/bin/view/CMS/Savannah
47
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
What CSP should never do ?
• Ignore a suspicious problem because too complex to understand solution : inform CRC or Computing experts via Elog
• Open a Savannah ticket without following the CSP instruction to identify a site problem (PhEDEx Component, SAM) or if confused about an observed problem solution : consult CRC, Computing Experts via Elog
• Cancel shifts or being replaced without reporting solution : inform shift responsible in advance and cancel subscription in shiftlist
49
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Passive shifts
‣ Passive shifts
‣ Go through already signed up shifts and determine CSP time slot for doing passive shifts
‣ Contact CSP shifter and check if she/he is willing to act as passive shift host
‣ Confirm with O.Gutsche/P.Kreuzer
‣ Shift Subscription
‣ Once passive shifts done, subscribe to shifts (ideally 2 months in advance) via http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Shiftlist/ShiftSelection?shift_type=25
50
CMS Computing Shift Personnel (CSP) Tutorial01/10/11
Subscriptions‣ Assumption:
‣ Shifter already has CERN account and HyperNews account
‣ Sign up for elog access:
‣ https://prod-grid-logger.cern.ch/elog/
‣ Sign up for e-group cms-csp-shifters@cern.ch
‣ https://e-groups.cern.ch/e-groups/EgroupsSearch.do
‣ Sign up for correct Savannah access to write tickets:
‣ Login to Savannah (CERN afs login)
‣ https://savannah.cern.ch/my/groups.php
‣ under "Request for inclusion" type "CMS" and "search", this will display all groups, then click on "CMS Computing Infrastructure Support"
‣ Peter & Oli will approve the request
‣ Get a valid Grid Certificate and CMS VO registration
‣ https://twiki.cern.ch/twiki/bin/view/CMS/WorkBookRunningGrid#Get_a_Grid_certificate_and_the_r