Oxford & SouthGrid Update HEPiX Pete Gronbech GridPP Project Manager October 2015.

transcript

Oxford & SouthGrid UpdateHEPiX

Pete GronbechGridPP Project Manager

October 2015

Oxford Particle Physics - Overview

• Two computational clusters for the PP physicists.– Grid Cluster part of the SouthGrid Tier-2– Local Cluster (AKA Tier-3)

• A common Cobbler and Puppet system is used to install and maintain all the SL6 systems.

BNL - October 20152

Oxford Grid Cluster

• No major changes since last time.

• All of the Grid Cluster now running HT Condor behind ARC CE. The last CREAM CE’s using torque and maui were decommissioned by 1st August.

BNL - October 20153

Current capacity 16,768HS06 1300TB

Oxford Local Cluster

• Almost exclusively SL6 now.• New procurement this summer.• 4 Supermicro twin squared CPU boxes provide 256CPU physical

cores• Chose Intel E5-2630v3’s should provide approx. ~4400 HS06

upgrade.

• Storage will be from Lenovo. – 1 U server with two disk shelves, containing 12 * 4TB SAS disks.– Should provide an increased capacity of ~350TB– ~88TB for NFS and the rest for Lustre

BNL - October 20154

Current capacity 6000HS06 680TB

Intel E5-2650 v2 SL6 HEPSPEC06

Average result 361.4

BNL - October 20155

Intel E5-2630 v3 SL6 HEPSPEC06

BNL - October 20156

Peak average result 347 with 31 threads.

(~2% drop in compute power wrt 2650v2)

Power Usage – Twin squared chassis

2650v2

Max 1165W

Idle 310W

BNL - October 20157

Power Usage – Twin squared chassis 2630v3

BNL - October 20158

Max 945W

Idle 250W

(~19% drop in electrical power wrt 2650v2)

SouthGrid Sites

BNL - October 20159

RAL PPD

General• Migrating all services to puppet managed SL6, actively killing off

cfengine managed SL5 Services• In the process of migrating our Virtualisation cluster from Windows

2008R2 Failover Cluster to Windows 2012R2Tier 2• Not much change, continue to be ArcCE/Condor/dCache - these

services now very reliable and low maintenance• Currently deploying ELK• New nagios and ganglia monitoring integrated with puppetDepartment• Prototype analysis cluster - 5 node BeeGFS cluster• ownCloud storage - See Chris’s Talk

BNL - October 201510

Current capacity 41719 HS06 2610TB

• Very small cluster remains at JET.• Closure of the central fusion VO glite services by BIFI

and CIEMAT will mean reduction in work done at this site.

Current capacity 1772 HS06 1.5TB

Birmingham Tier 2 Grid Site

Grid site running ATLAS, ALICE, LHCb and 10% other VOs

The Birmingham Tier 2 site hardware consists of:1200 cores across 56 machines at two campus locations ●

~470 TB of disk, mostly reserved for ATLAS and ALICE ●Using some Tier 3 hardware to run Vac during slow periods

●Status of the site:

Some very old workers are starting to fail, still have plenty left ●

Disks have been lasting surprisingly well, replacing less than expected ●

Future plans and upcoming changes:Replace CREAM and Torque/Maui with ARC and HTCondor ●Get some IPV6 address from central IT and start some tests BNL - October 2015

Birmingham Tier 3 Systems

The Birmingham Tier 3 site hardware consists of:Batch Cluster Farm, 8 nodes, 32 (logical) cores, 48GB per

node ●160TB 'New' Storage + 160TB 'Old' Storage ●

~60 Desktop Machines ●5 1U VM hosts ●

The following services are used, running on VMs where appropriate:

Cluster management through Puppet ●New storage accessed through Lustre, old storage over NFS

●Condor Batch system set up with Dynamic queues ●

Authentication through LDAP ●DHCP and mail servers ●

Desktops run Fedora 21 but can chroot into SL5 and SL6 images ●

Cambridge (1)GRID• Currently have 288 CPU cores and 350TB storage.• Major recent news is move of all GRID kit to new University Data Centre• Move went reasonably smoothly• Had to make a small reorganisation of servers in the racks to avoid exceeding

power and cooling limits– “Standard” Hall is limited to 8kW per rack– “HPC” Hall has higher limits, but there is limited space. Also we would be charged

for rack space in the HPC Hall (we are currently not charged for rack space in the Standard Hall).

• Kit is still on the HEP network as before (networking provision in the Data Centre is “work in progress”). Hence we’ve had to buy a circuit from the University between the department and the Data Centre.

• At present we are not being charged for electricity and cooling in the Data Centre

14BNL - October 2015

Cambridge (2)HEP Cluster• Currently have ~40 desktops, 112 CPU cores in a batch farm and ~200TB

storage• Desktops and farm are in a common HTCondor setup• Running SLC6 on both desktop and farm machines• Storage is on a (small) number of DAS servers. Considering a trial of Lustre

(though it may be a “sledgehammer to crack a walnut”)• Backup is currently to a LTO5 autoloader. This needs replacement in the near-

ish future. The University is making noises about providing Storage As A Service which might be a way forward, but the initial suggested pricing looked uncompetitive. Also as usual, things move at a glacial pace and the replacement may be necessary before a production service is available

Bristol

• Bristol switching from StoRM SE to DM-Lite, which has been a big change.

• The Storage Group keen for a dm-lite with HDFS test site• Looking for an SE that will work with HDFS and the European

authentication infrastructure (so not BestMan)• Suggested to try DMLite• First iterations did not work, but the developers followed up• Now run the DMLite SE with 2x GridFTP servers (no SRM)• The GridFTP servers we use tmpfs as a fast buffer• Missing bits are (but in the queue): BDII fixes (proper reporting), High

Availability (HA) config (currently can only specify one name node)• Performance improvements might come for the GridFTP servers (in case

of single streams skip buffering)

University of Sussex - Site Report

- Mixed-use cluster with both local and WLCG Tier2 Grid usage

- ~3000 cores across 112 nodes, mixture of Intel and AMD processors

- Housed inside University’s central Datacentre across 5 racks

- Infiniband network, using Qlogin/Intel Truescale 40Gb QDR

- ~380TB Lustre 2.5.3 filesystem used as scratch area, with NFS for other storage

- Grid uses a few dedicated nodes predominantly for Atlas and SNO+

- Batch System

- Univa Grid Engine (UGE) is used as Batch system. Currently on version 8.3.0.

- Started using UGE’s support for cgroup resource limiting to give proper main-memory limits to user jobs. Very happy with how it’s working, along with a default resource allocation policy implemented via JSV script, making usage fairer.

University of Sussex - Site Report

- Lustre

- Core filesystem for users to perform all I/O on. Nodes have very limited local scratch, all expected to use Lustre

- Have had a very good experience with Lustre 1.8.9 for past few years, self-maintained using community releases

- Move to Lustre 2.5.3

- Upgrade held back by LU-1482, which prevents the Grid StoRM middleware from working on it. A recent unofficial patch to Lustre that fixes this issue enabled us to perform the upgrade

- Bought new MDS servers enabling us to set up completely separate system to preexisting one. We can then mount both filesystems and copy data between the two.

- Currently have 380TB Lustre 2.5.3 system and 280TB Lustre 1.8.9 filesystem. After decommissioning the old system will have ~600TB unified Lustre 2.5.3 filesystem.

- Experimenting with robinhood policy engine for Lustre filesystem usage analysis. Had good experience so far.

Begbroke Computer Room houses the Oxford Tier 2 Cluster

Spot the difference

Air Con Failure

Saturday 16:19Sunday 16:13

Monday 17:04

• People came into site or remotely switched off clusters very quickly.

• Building Services reset the A/C in 1-2 hours on both weekend days.

• The bulk of the load comes from University and Physics HPC clusters, but it turns out some critical University Financial services were also being run on this site.

• The incident was taken seriously and backup system ordered on Tuesday morning and installed from 10pm to 2am that night.

• Provides ~100KW back up in case of further trips. Normal load is ~230KW so main clusters currently restricted.

Additional temporary A/C

• Pressurization unit had two faults, both repaired. New improved unit to be installed on Monday.

• A 200KW computing load will heat up very very quickly when the A/C fails.

• It always seems to do this out of hours. You need to react to this very quickly.

• Really needs to be automated, even fast response from staff not fast enough.

• Even risk mitigation has it’s own risks.

Water and Electrics!!

Oxford & SouthGrid Update HEPiX Pete Gronbech GridPP Project Manager October 2015.

Documents