+ All Categories
Home > Documents > RAL Site Report

RAL Site Report

Date post: 07-Jan-2016
Category:
Upload: malory
View: 24 times
Download: 0 times
Share this document with a friend
Description:
RAL Site Report. Martin Bly HEPiX Fall 2009, LBL, Berkeley CA. Overview. New Building Tier1 move Hardware Networking Developments. New Building + Tier1 move. New building handed over in April Half the department moved in to R89 at the start of May - PowerPoint PPT Presentation
18
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA
Transcript
Page 1: RAL Site Report

RAL Site Report

Martin BlyHEPiX Fall 2009, LBL, Berkeley CA

Page 2: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 2

Overview

• New Building• Tier1 move• Hardware• Networking• Developments

Page 3: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 3

Page 4: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 4

Page 5: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 5

New Building + Tier1 move

• New building handed over in April– Half the department moved in to R89 at the start of May– Tier1 staff and the rest of the department moved in 6 June

• Tier1 2008 procurements delivered direct to new building– Including new SL8500 tape silo (commissioned then moth-

balled)– New hardware entered testing as soon as practicable

• Non-Tier1 kit including HPC clusters moved starting early June

• Tier1 moved 22 June – 6 July– Complete success, to schedule– 4 contractor firms, all T1 staff– 43 racks, a c300 switch and 1 tape silo– Shortest practical service down times

Page 6: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 6

Building issues and developments

• Building generally working well but it is usual to have teething troubles in new buildings…– Two air-con failures

• Machine room air temperature reached >40 ºC in 30 minutes

– Moisture where it shouldn’t be

• The original building plan included a Combined Heat and Power unit (CHP) so only enough chilled water capacity was installed until the CHP was installed and working– Plan changed to remove CHP => shortfall in chilled water

capacity– Two extra 750kW chillers ordered for installation early in 2010– Provide planned cooling until 2012/13– Timely – planning now underway for first water-cooled racks

(for non-Tier1 HPC facilities)

Page 7: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 7

Page 8: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 8

Recent New Hardware

• CPU– ~3000kSi2K (~1850 cores) in Supermicro ‘twin’ systems

• E5420/San Clemente & L5420/Seaburg: 2GB/core, 500GB HDD• Now running SL5/x86_64 in production

• Disk– ~2PB in 4U 24-bay chassis, 22 data disks in RAID6, 2 system

disks in RAID1– – 2 vendors:

• 50 with single Areca controller and 1TB WD data drives– Deployed

• 60 with dual LSI/3ware/AMCC controllers and 1TB Seagate data drives

• Second SL8500 silo, 10K slots, 10PB (1TB tapes)– Delivered to new machine room – pass-through to existing

robot– Tier1 use – GridPP tape drives have been transferred

Page 9: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 9

Recent / Next Hardware

• ‘Services’ nodes– 10 ‘twins’ (20 systems), twin disks– 3 Dell PE 2950 III servers and 4 EMC AX4-5 array units for

Oracle RACs– Extra SAN hardware for resilience

• Procurements running– ~15000 HEP-SPEC06 for batch, 3GB RAM and 100GB disk per

core• => 24GB RAM and 1TB drive for 8 core system

– ~3PB disk storage in two lots of two tranches, January and April

– Additional tape drives: 9 x T10KB• Initially for CMS• Total 18 x T10KA and 9 x T10KB for PP use

• To come– More services nodes

Page 10: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 10

Disk Storage• ~350 servers

– RAID6 on PCI-e SATA controllers, 1Gb/s NIC– SL4 32bit with ext3– Capacity ~4.2PB in 6TB, 8TB, 10TB, 20TB servers – Mostly deployed for Castor service

• Three partitions per server

– Some NFS (legacy data, xrootd (Babar)• Single/multiple partitions as required

• Array verification using controller tools– 20% of capacity in any Castor service class done in a week– Tuesday to Thursday, servers that have gone longest since

last verify– Fewer double throws, decrease in overall throw rates – Also using CERN fsprobe to look for silent data corruption

Page 11: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 11

Hardware Issues I

• Problem during acceptance testing of part of 2008 storage procurement– 22 x 1TB SATA drives on PCI-e RAID controller– Drive timeouts, arrays inaccessible

• Working with supplier to resolve issue– Supplier is working hard on our behalf– Regular phone conferences

• Engaged with HDD and controller OEMs• Appears to be two separate issues

– HDD– Controller

• Possible that resolution of both issues is in sight

Page 12: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 12

Hardware Issues II – Oracle databases

• New resilient hardware configuration for Oracle Databases SAN using EMC AX4 array sets– Used in ‘mirror’ pairs at Oracle ASM level.

• Operated well for Castor pre-move and for non-Castor post-move but increasing instances of controller dropout on Castor kit– Eventual crash of one castor array, followed some time later

but the second array– Non-Castor array pair also unstable, eventually both crashed

together– Data loss from Castor databases due to side effect of having

arrays crashing at different times and therefore being out of sync. No unique files ‘lost’.

• Investigations continuing to find cause – possibly electrical

Page 13: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 13

Networking

• Force10 C300 in use as core switch since Autumn 08– Up to 64 x 10GbE at wire speed (32 ports fitted)

• Not implementing routing on C300– Turns out the C300 doesn’t support policy-base routing …– … but policy-based routing is on roadmap for C300 software

• Next year sometime

• Investigating possibilities for added resilience with additional C300

• Doubled up link to OPN gateway to alleviate bottleneck caused by routing UK T2 traffic round site firewall– Working on doubling links to edge stacks

• Procuring fallback link for OPN to CERN using 4 x 1GbE– Added resilience

Page 14: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 14

Developments I - Batch Services

• Production service:– SL5.2/64bit with residual SL4.7/32bit (2%)– ~4000 cores, ~32000 HEP-SPEC06

• Opteron 270, • Woodcrest E5130• Harpertown E5410, E5420, L5420 and E5440

– All with 2GB RAM/core– Torque/Maui on SL5/64bit host with 64bit Torque server– Deployed with Quattor in September– Running 50% over-commit on RAM to improve occupancy

• Previous service:– 32bit Torque/Maui server (SL3) and 32bit CPU workers all

retired– Hosts used for testing etc

Page 15: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 15

Developments II - Dashboard

• A new dashboard to provide an operational overview of services and the Tier1 ‘state’ for operations staff, VOs …

• Constantly evolving– Components can be added/updated/removed– Pulls data from lots of sources

• Present components– SAM Tests

• Latest test results for critical services• Locally cached for 10 minutes to reduce load

– Downtimes– Notices

• Latest information on Tier 1 operations• Only Tier 1 staff can post

– Ganglia plots of key components from the Tier1 farm

• Available at http://www.gridpp.rl.ac.uk/status

Page 16: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 16

Page 17: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 17

Developments III - Quattor

• Fabric management using Quattor– Will replace existing hand crafted PXE/kickstart and payload

scripting– Successful trial of Quattor using virtual systems– Production deployment of SL5/x86_64 WNs and Torque / Maui

for 64bit batch service in mid September– Now have additional nodes types under Quattor management– Working on disk servers for Castor

• See Ian Collier’s talk on our Quattor experiences:

http://indico.cern.ch/contributionDisplay.py?contribId=52&sessionId=21&confId=61917

Page 18: RAL Site Report

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 18

Towards data taking

• Lots of work in last 12 months to make services more resilient– Take advantage of LHC delays

• Freeze on service updates – No ‘fiddling’ with services– Increased stability– Reduced downtimes– Non-intrusive changes

• But need to do some things such as security updates– Need to manage to avoid service down time


Recommended