RAL Site Report

RAL Site Report

Martin BlyHEPiX Fall 2009, LBL, Berkeley CA

26-30 October 2009

HEPiX Fall 2009 LBL - RAL Site Report 2

Overview

• New Building• Tier1 move• Hardware• Networking• Developments

26-30 October 2009


26-30 October 2009


26-30 October 2009


New Building + Tier1 move

• New building handed over in April– Half the department moved in to R89 at the start of May– Tier1 staff and the rest of the department moved in 6 June

• Tier1 2008 procurements delivered direct to new building– Including new SL8500 tape silo (commissioned then moth-

balled)– New hardware entered testing as soon as practicable

• Non-Tier1 kit including HPC clusters moved starting early June

• Tier1 moved 22 June – 6 July– Complete success, to schedule– 4 contractor firms, all T1 staff– 43 racks, a c300 switch and 1 tape silo– Shortest practical service down times

26-30 October 2009


Building issues and developments

• Building generally working well but it is usual to have teething troubles in new buildings…– Two air-con failures

• Machine room air temperature reached >40 ºC in 30 minutes

– Moisture where it shouldn’t be

• The original building plan included a Combined Heat and Power unit (CHP) so only enough chilled water capacity was installed until the CHP was installed and working– Plan changed to remove CHP => shortfall in chilled water

capacity– Two extra 750kW chillers ordered for installation early in 2010– Provide planned cooling until 2012/13– Timely – planning now underway for first water-cooled racks

(for non-Tier1 HPC facilities)

26-30 October 2009


26-30 October 2009


Recent New Hardware

• CPU– ~3000kSi2K (~1850 cores) in Supermicro ‘twin’ systems

• E5420/San Clemente & L5420/Seaburg: 2GB/core, 500GB HDD• Now running SL5/x86_64 in production

• Disk– ~2PB in 4U 24-bay chassis, 22 data disks in RAID6, 2 system

disks in RAID1– – 2 vendors:

• 50 with single Areca controller and 1TB WD data drives– Deployed

• 60 with dual LSI/3ware/AMCC controllers and 1TB Seagate data drives

• Second SL8500 silo, 10K slots, 10PB (1TB tapes)– Delivered to new machine room – pass-through to existing

robot– Tier1 use – GridPP tape drives have been transferred

26-30 October 2009


Recent / Next Hardware

• ‘Services’ nodes– 10 ‘twins’ (20 systems), twin disks– 3 Dell PE 2950 III servers and 4 EMC AX4-5 array units for

Oracle RACs– Extra SAN hardware for resilience

• Procurements running– ~15000 HEP-SPEC06 for batch, 3GB RAM and 100GB disk per

core• => 24GB RAM and 1TB drive for 8 core system

– ~3PB disk storage in two lots of two tranches, January and April

– Additional tape drives: 9 x T10KB• Initially for CMS• Total 18 x T10KA and 9 x T10KB for PP use

• To come– More services nodes

26-30 October 2009


Disk Storage• ~350 servers

– RAID6 on PCI-e SATA controllers, 1Gb/s NIC– SL4 32bit with ext3– Capacity ~4.2PB in 6TB, 8TB, 10TB, 20TB servers – Mostly deployed for Castor service

• Three partitions per server

– Some NFS (legacy data, xrootd (Babar)• Single/multiple partitions as required

• Array verification using controller tools– 20% of capacity in any Castor service class done in a week– Tuesday to Thursday, servers that have gone longest since

last verify– Fewer double throws, decrease in overall throw rates – Also using CERN fsprobe to look for silent data corruption

26-30 October 2009


Hardware Issues I

• Problem during acceptance testing of part of 2008 storage procurement– 22 x 1TB SATA drives on PCI-e RAID controller– Drive timeouts, arrays inaccessible

• Working with supplier to resolve issue– Supplier is working hard on our behalf– Regular phone conferences

• Engaged with HDD and controller OEMs• Appears to be two separate issues

– HDD– Controller

• Possible that resolution of both issues is in sight

26-30 October 2009


Hardware Issues II – Oracle databases

• New resilient hardware configuration for Oracle Databases SAN using EMC AX4 array sets– Used in ‘mirror’ pairs at Oracle ASM level.

• Operated well for Castor pre-move and for non-Castor post-move but increasing instances of controller dropout on Castor kit– Eventual crash of one castor array, followed some time later

but the second array– Non-Castor array pair also unstable, eventually both crashed

together– Data loss from Castor databases due to side effect of having

arrays crashing at different times and therefore being out of sync. No unique files ‘lost’.

• Investigations continuing to find cause – possibly electrical

26-30 October 2009


Networking

• Force10 C300 in use as core switch since Autumn 08– Up to 64 x 10GbE at wire speed (32 ports fitted)

• Not implementing routing on C300– Turns out the C300 doesn’t support policy-base routing …– … but policy-based routing is on roadmap for C300 software

• Next year sometime

• Investigating possibilities for added resilience with additional C300

• Doubled up link to OPN gateway to alleviate bottleneck caused by routing UK T2 traffic round site firewall– Working on doubling links to edge stacks

• Procuring fallback link for OPN to CERN using 4 x 1GbE– Added resilience

26-30 October 2009


Developments I - Batch Services

• Production service:– SL5.2/64bit with residual SL4.7/32bit (2%)– ~4000 cores, ~32000 HEP-SPEC06

• Opteron 270, • Woodcrest E5130• Harpertown E5410, E5420, L5420 and E5440

– All with 2GB RAM/core– Torque/Maui on SL5/64bit host with 64bit Torque server– Deployed with Quattor in September– Running 50% over-commit on RAM to improve occupancy

• Previous service:– 32bit Torque/Maui server (SL3) and 32bit CPU workers all

retired– Hosts used for testing etc

26-30 October 2009


Developments II - Dashboard

• A new dashboard to provide an operational overview of services and the Tier1 ‘state’ for operations staff, VOs …

• Constantly evolving– Components can be added/updated/removed– Pulls data from lots of sources

• Present components– SAM Tests

• Latest test results for critical services• Locally cached for 10 minutes to reduce load

– Downtimes– Notices

• Latest information on Tier 1 operations• Only Tier 1 staff can post

– Ganglia plots of key components from the Tier1 farm

• Available at http://www.gridpp.rl.ac.uk/status

26-30 October 2009


26-30 October 2009


Developments III - Quattor

• Fabric management using Quattor– Will replace existing hand crafted PXE/kickstart and payload

scripting– Successful trial of Quattor using virtual systems– Production deployment of SL5/x86_64 WNs and Torque / Maui

for 64bit batch service in mid September– Now have additional nodes types under Quattor management– Working on disk servers for Castor

• See Ian Collier’s talk on our Quattor experiences:

http://indico.cern.ch/contributionDisplay.py?contribId=52&sessionId=21&confId=61917

26-30 October 2009


Towards data taking

• Lots of work in last 12 months to make services more resilient– Take advantage of LHC delays

• Freeze on service updates – No ‘fiddling’ with services– Increased stability– Reduced downtimes– Non-intrusive changes

• But need to do some things such as security updates– Need to manage to avoid service down time

Date post:	07-Jan-2016
Category:	Documents
Upload:	malory
View:	24 times
Download:	0 times

RAL Site Report

Documents