Date post: | 17-Jan-2018 |
Category: |
Documents |
Upload: | kimberly-glenn |
View: | 219 times |
Download: | 0 times |
RHIC/US ATLAS Tier 1 Computing Facility
Site Report
Christopher HollowellPhysics DepartmentBrookhaven National [email protected]
HEPiXUpton, NY, USAOctober 18, 2004
Facility Overview
● Created in the mid 1990's to provide centralized computing services for the RHIC experiments
● Expanded our role in the late 1990's to act as the tier 1 computing center for ATLAS in the United States
● Currently employ 28 staff members: planning on adding 5 additional employees in the next fiscal year
Facility Overview (Cont.)
● Ramping up resources provided to ATLAS: Data Challenge 2 (DC2) underway
● RHIC Run 5 scheduled to begin in late December 2004
Centralized Disk Storage
● 37 NFS Servers Running Solaris 9: recent upgrade from Solaris 8
● Underlying filesystems upgraded to VxFS 4.0– Issue with quotas on filesystems larger than 1 TB in
size● ~220 TB of fibre channel SAN-based RAID5
storage available: added ~100 TB in the past year
Centralized Disk Storage (Cont.)
● Scalability issues with NFS (network-limited to ~70 MB/s max per server [75-90 MB/s max local I/O] in our configuration): testing of new network storage models including Panasas and IBRIX in progress– Panasas tests look promising. 4.5 TB of storage on
10 blades available for evaluation by our user community. DirectFlow client in use on over 400 machines
– Both systems allow for NFS export of data
Centralized Disk Storage (Cont.)
Centralized Disk Storage: AFS
● Moving servers from Transarc AFS running on AIX to OpenAFS 1.2.11 on Solaris 9
● The move from Transarc to OpenAFS motivated by Kerberos4/Kerberos5 issues and Transarc AFS end of life
● Total of 7 fileservers and 6 DB servers: 2 DB servers and 2 fileservers running OpenAFS
● 2 Cells
Mass Tape Storage
● Four STK Powderhorn silos provided, each with the capability of holding ~6000 tapes
● 1.7 PB data currently stored● HPSS Version 4.5.1: likely upgrade to version
6.1 or 6.2 after RHIC Run 5● 45 tape drives available for use● Latest STK tape technology: 200 GB/tape ● ~12 TB disk cache in front of the system
Mass Tape Storage (Cont.)
● PFTP, HSI and HTAR available as interfaces
CAS/CRS Farm
● Farm of 1423 dual-CPU (Intel) systems– Added 335 machines this year
● ~245 TB local disk storage (SCSI and IDE) ● Upgrade of RHIC Central Analysis
Servers/Central Reconstruction Servers (CAS/CRS) to Scientific Linux 3.0.2 (+updates) underway: should be complete before next RHIC run
CAS/CRS Farm (Cont.)● LSF (5.1) and Condor (6.6.6/6.6.5) batch systems
in use. Upgrade to LSF 6.0 planned● Kickstart used to automate node installation● GANGLIA + custom software used for system
monitoring● Phasing out the original RHIC CRS Batch
System: replacing with a system based on Condor● Retiring 142 VA Linux 2U PIII 450 MHz
systems after next purchase
CAS/CRS Farm (Cont.)
CAS/CRS Farm (Cont.)
Security
● Elimination of NIS, complete transition to Kerberos5/LDAP in progress
● Expect K5 TGT to X.509 certificate transition in the future: KCA?
● Hardening/monitoring of all internal systems● Growing web service issues: unknown services
accessed through port 80
Grid Activities
● Brookhaven planning on upgrading external network connectivity to OC48 (2.488 Gbps) from OC12 (622 Mbps) to support ATLAS activity
● ATLAS Data Challenge 2: jobs submitted via Grid3
● GUMS (Grid User Management System)– Generates grid-mapfiles for gatekeeper hosts– In production since May 2004
Storage Resource Manager (SRM)
● SRM: middleware providing dynamic storage allocation and data management services– Automatically handles network/space allocation
failures● HRM (Hierarchical Resource Manager)-type
SRM server in production– Accessible from within and outside the facility– 350 GB Cache– Berkeley HRM 1.2.1
dCache
● Provides global name space over disparate storage elements– Hot spot detection– Client software data access through libdcap library or
libpdcap preload library● ATLAS & PHENIX dCache pools
– PHENIX pool expanding performance tests to production machines
– ATLAS pool interacting with HPSS using HSI: no way of throttling data transfer requests as of yet