Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | seth-james |
View: | 214 times |
Download: | 0 times |
David Britton, 28/May/09.
2
•14 TeV Collisions
•27 km circumference
•1200 14m 8.36 Tesla SC dipoles
•8000 cryomagnets
•40,000 tons of metal at -271c
•700,000L liquid He
•12,000,000L liquid N2
•800,000,000 proton-proton collisions/sec.
The Large Hadron Collider at CERN
3
Data from the LHC Experiments
55 million channelsRaw data = 220 MB/s
18 million channelsRaw data = 100 MB/s
ATLAS (7,000 tonnes) CMS (12,500 tonnes)
ALICE (10,000 tonnes)LHCb (5,600 tonnes)
1.2 million channelsRaw data = 50 MB/s
Concorde(15 Km)
Mt. Blanc(4.8 Km)
One year’s data from LHC would fill a stack of CDs 20km high
Raw data flow
~700 MB/s
Total ~15 PB ofdata per year
100 million channelsRaw data = 320 MB/s
4
Data Driven Grid Computing
10/04/23
Grid architecture chosen because:
• Costs of maintaining and updating resources more easily shared in a distributed environment.
• Funding bodies can provide local resources and contribute to global goal.
• More easy to build redundancy and fault tolerance and minimise risks from single point of failure.
• LHC will operate around the clock for 8 months each year. Spanning of time zones means that monitoring/support more readily provided.
ALICE
ATLAS CMS
LHCb
5
Worldwide LHC Computing Grid
28/May/09
Tier 0
Tier 1National centres
Tier 2Regional groups
Institutes
Workstations
Offline farm
Online system
CERN computer centre
RAL,UK
ScotGrid NorthGridSouthGrid London
FranceItalyGermanySpain
Glasgow Edinburgh Durham
11 T1 centres
Simulation, Analysis
Primary Data Store
Reconstruction, Storage, Analysis
6
WorldWide Resources
55 Countries
283 Sites
180,000 CPUs
Worldwide:
23 Sites
20,000 CPUs
UK:
7
How does it work? Components
Tier 0, Tier 1, Tier 2
DATA MOVEMENT – FILE TRANSFER SERVICE (FTS)
STORAGE INTERFACE – STORAGE RESOURCE MANAGER (SRM)
AUTHORISATION/ROLES – VIRTUAL ORGANISATION MEMBERSHIP (VOMS)
METADATA/REPLICATION – LCG FILE CATALOGUE (LFC)
BATCH SUBMISSION – WORKLOAD MANAGEMENT SYSTEM (WMS)
DISTRIBUTED CONDITIONS DATABASES – ORACLE STREAMS (3D)
GRID INTERFACES (e.g. Ganga)
PRODUCTION/ANALYSIS SYSTEMS
GRID
MID
DLE
WAR
E
EXPERIMENTFRAMEWORKS
WLCGFABRIC
8
How does it work? Workflow
griduiJDL
VOMS
WLMS
JS
RB
LFC
BDII
Logging & Bookkeeping
33
CPU Nodes Storage
Grid Enabled Resources
CPU Nodes Storage
Grid Enabled Resources
CPU Nodes Storage
Grid Enabled Resources
CPU Nodes Storage
Grid Enabled Resources
44
55
Submitter
6677
88 99
1010
00 VOMS-proxy-init
11 Job Submission
22
Job
Stat
us?
1111Job Retrieval
9
Availability: The UK Tier-1
Availability fraction of time the site is up (so even scheduled maintenance counts against this metric).
Target is 97% (achieved).
Measured by SAM tests (Service Availability Monitor).
There are also experiment-specific SAM tests which are more demanding. Example shown here is from ATLAS.
Target is also 97%. Performance is improving but was degraded by the CASTOR mass storage system.
10
Availability: Full UK Picture
11
Resilience and Disaster Planning
• The Grid must be made resilient to failures and disasters over a wide scale, from simple disk failures up to major incidents like the prolonged loss of a whole site.
• One of the intrinsic characteristics of the Grid is the use of inherently unreliable and distributed hardware in a fault-tolerant infrastructure. Service resilience is about making this fault-tolerance a reality.
28/May/09
12
Strategy
• Fortifying the Service
• Duplicate services or machines
• Increase the hardware’s capacity (to handle faults)
• Use (good) fault detection
• Implement automatic restarts
• Provide fast intervention
• Fully investigate failures
• Report bugs -> ask for better middleware
Disaster Planning
• Taking control early enough. • (Pre-) establishing possible
options.• Understanding user
priorities.• Timely Action.• Effective Communication.
Hardware; Software; Location
13
Duplicating Services or Machines
Multiple WMS’es
14
Hardware Capacity and Fault Tolerance
Examples:
Storage – Use raid arrays: RAID5 RAID6 for storage arrays; RAID1 for system disks. Use of hot-spares allows automatic rebuilds.
Memory – Increase memory capacity; use ECC (error-correction-code) memory and monitor for a rise in error correction rate.
Power – Use redundant power supplies connected to different circuits if possible. UPS for critical systems.
Interconnects - Use two or more bonded network connections with cables routed separately.
CPU – Use more powerful machines.
Databases – Use Oracle RAC’s (Real Application Clusters) which enable multiple servers to access database simultaneously.
Resilient hardware will help services survive common failure modes and keep it operating until you can replace the component and make the service resilient again.
15
Fault Detection
• If it can be monitored, monitor it!• Catch problems early e.g. with nagios alarms.
Load alarms; File systems near to full; Certificates close to expiry; Failed drives
• Look for signatures of impending problems to predict component failure.
• Idle disks hide their faults–Regular low-level verification runs to push sick drives over the edge–Replace early in failure cycle
• So it doesn’t fail during a rebuild…
• Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation–If you have redundant links, you can replace the faulty one and keep the service going
• Call-out system for problems that impact services
16
Intervention and Investigation
28/May/09
• Run 24 x 7 call out system connected to a pager that is triggered by automatic alarms.
• 2 hour response time for critical failures.• All incidents are examined to learn lessons: Call-out
rate has dropped from 10/day to as low as 1/week.• Reports written up on serious incidents (reported to
the wLCG so other sites around the world can see).
17
Despite everything, disasters will happen”
28/May/09
wLCG weekly operations report, Feb-09
(Taiwan)
18
Disaster Planning
28/May/09
• Need a Disaster Response plan which is well understood – use it regularly for anything that could turn into a disaster!
• Stage 1: Disaster Potential Identified– Informally Assess/Monitor/Set deadlines/Do not interfere.
• Stage 2: Possible Disaster– Add internal management oversight/Formally assess/Divert
resources
• Stage 3: Disaster Likely– Add external experts and stakeholder representation to
oversight. – Regular meetings with the experiments.– Prepare contingencies; Communicate widely.
• Stage 4: Actual Disaster– Manage disaster according to high level disaster plan and
contingencies identified at Stage-3. Communicate widely.
19
Summary
• In the UK we have spent the last 6 years preparing for the LHC data challenge and have deployed 20,000 CPUs as part of a world-wide Grid of 180,000 CPUs: The largest scientific computing Grid in the world.
• The last year has focused on making the service reliable and resilient: Our Tier-1 centre currently delivers 97% availability and our Tier-2 centres average over 90%.
• We have initiated planning to understand the possible responses to major disaster and to set up a disaster management process to handle such incidents.
• We look forward to the arrival of LHC data!
1/Apr/09
LHC Data