RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department...

RHIC/US ATLAS Tier 1 Computing Facility

Site Report

Christopher HollowellPhysics DepartmentBrookhaven National [email protected]

HEPiXUpton, NY, USAOctober 18, 2004

Facility Overview

● Created in the mid 1990's to provide centralized computing services for the RHIC experiments

● Expanded our role in the late 1990's to act as the tier 1 computing center for ATLAS in the United States

● Currently employ 28 staff members: planning on adding 5 additional employees in the next fiscal year

Facility Overview (Cont.)

● Ramping up resources provided to ATLAS: Data Challenge 2 (DC2) underway

● RHIC Run 5 scheduled to begin in late December 2004

Centralized Disk Storage

● 37 NFS Servers Running Solaris 9: recent upgrade from Solaris 8

● Underlying filesystems upgraded to VxFS 4.0– Issue with quotas on filesystems larger than 1 TB in

size● ~220 TB of fibre channel SAN-based RAID5

storage available: added ~100 TB in the past year

Centralized Disk Storage (Cont.)

● Scalability issues with NFS (network-limited to ~70 MB/s max per server [75-90 MB/s max local I/O] in our configuration): testing of new network storage models including Panasas and IBRIX in progress– Panasas tests look promising. 4.5 TB of storage on

10 blades available for evaluation by our user community. DirectFlow client in use on over 400 machines

– Both systems allow for NFS export of data

Centralized Disk Storage (Cont.)

Centralized Disk Storage: AFS

● Moving servers from Transarc AFS running on AIX to OpenAFS 1.2.11 on Solaris 9

● The move from Transarc to OpenAFS motivated by Kerberos4/Kerberos5 issues and Transarc AFS end of life

● Total of 7 fileservers and 6 DB servers: 2 DB servers and 2 fileservers running OpenAFS

● 2 Cells

Mass Tape Storage

● Four STK Powderhorn silos provided, each with the capability of holding ~6000 tapes

● 1.7 PB data currently stored● HPSS Version 4.5.1: likely upgrade to version

6.1 or 6.2 after RHIC Run 5● 45 tape drives available for use● Latest STK tape technology: 200 GB/tape ● ~12 TB disk cache in front of the system

Mass Tape Storage (Cont.)

● PFTP, HSI and HTAR available as interfaces

CAS/CRS Farm

● Farm of 1423 dual-CPU (Intel) systems– Added 335 machines this year

● ~245 TB local disk storage (SCSI and IDE) ● Upgrade of RHIC Central Analysis

Servers/Central Reconstruction Servers (CAS/CRS) to Scientific Linux 3.0.2 (+updates) underway: should be complete before next RHIC run

CAS/CRS Farm (Cont.)● LSF (5.1) and Condor (6.6.6/6.6.5) batch systems

in use. Upgrade to LSF 6.0 planned● Kickstart used to automate node installation● GANGLIA + custom software used for system

monitoring● Phasing out the original RHIC CRS Batch

System: replacing with a system based on Condor● Retiring 142 VA Linux 2U PIII 450 MHz

systems after next purchase

CAS/CRS Farm (Cont.)

CAS/CRS Farm (Cont.)

Security

● Elimination of NIS, complete transition to Kerberos5/LDAP in progress

● Expect K5 TGT to X.509 certificate transition in the future: KCA?

● Hardening/monitoring of all internal systems● Growing web service issues: unknown services

accessed through port 80

Grid Activities

● Brookhaven planning on upgrading external network connectivity to OC48 (2.488 Gbps) from OC12 (622 Mbps) to support ATLAS activity

● ATLAS Data Challenge 2: jobs submitted via Grid3

● GUMS (Grid User Management System)– Generates grid-mapfiles for gatekeeper hosts– In production since May 2004

Storage Resource Manager (SRM)

● SRM: middleware providing dynamic storage allocation and data management services– Automatically handles network/space allocation

failures● HRM (Hierarchical Resource Manager)-type

SRM server in production– Accessible from within and outside the facility– 350 GB Cache– Berkeley HRM 1.2.1

dCache

● Provides global name space over disparate storage elements– Hot spot detection– Client software data access through libdcap library or

libpdcap preload library● ATLAS & PHENIX dCache pools

– PHENIX pool expanding performance tests to production machines

– ATLAS pool interacting with HPSS using HSI: no way of throttling data transfer requests as of yet

Date post:	17-Jan-2018
Category:	Documents
Upload:	kimberly-glenn
View:	219 times
Download:	0 times

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department...

Documents