+ All Categories
Home > Documents > BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000...

BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000...

Date post: 23-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
21
BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL EXPERIENCE & PERSPECTIVE Ofer Rind RHIC/ATLAS Computing Facility BNL NSLS-II Workshop April 20, 2010
Transcript
Page 1: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL

EXPERIENCE & PERSPECTIVEOfer Rind

RHIC/ATLAS Computing FacilityBNL

NSLS-II WorkshopApril 20, 2010

Page 2: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

THE PROBLEM...Modern experiments are generating increasingly large amounts of data - on the Petabyte scale - requiring ever greater computing power and creating ever greater storage management issues.

The basic goal hasn’t changed:Get the data to the user or the user to

the data in the fastest, most efficient and transparent way possible

Range of methodologies: Network, Storage Solutions, Job Control, Middleware, Monitoring, Accounting....

Page 3: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

OUTLINE• A functional overview of the RACF

• Computing models at the RACF (ATLAS, PHENIX)

• The future and some general thoughts about running a large-scale scientific computing facility

Page 4: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

• Formed in the mid-1990’s to provide centralized computing resources for the four RHIC experiments

• Located in ITD’s Brookhaven Computing Facility

• Role was expanded in the late 1990’s to act as the US Tier-1 computing center for the ATLAS experiment at the LHC

THE RACF: AN OVERVIEW

RACF

You are here

Page 5: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

RACF OVERVIEW (CONT.)• Currently 38 FTEs. Full-range of services for ~3000 users

• Data production, analysis and archiving• WAN data distribution, file catalog support• General services (email, web, user directories, NFS, AFS, ...)

The Latest:• RHIC Run 10: 200 GeV Au-Au ended 3/18, followed by low

energy scanning (62 GeV and below)

• LHC: 7 TeV run underway (next 18-24 months)

• Small but growing presence from Daya Bay, LBNE, LSST

Page 6: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

SETTING THE SCALE...• Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3

• Access by CONDOR batch locally and via GRID job submission; some interactive

• RHIC nodes disk heavy (up to 6 TB/server) for local storage

• Network: Force10, 60 Gbps inter-switch links for ATLAS, 20 Gbps for RHIC, up to 40 Gbps total internet bandwidth

• RHIC Distributed Storage: 1.1 PB local storage on PHENIX farm (dCache); ~1 PB local storage on STAR farm (XRootd)

• ATLAS Distributed Storage: 4.5 PB ATLAS dCache on ~100 Sun/Nexsan NAS servers + 2 PB Data Direct Networks Storage

Page 7: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

• Centralized Storage: ~580 TB BlueArc NFS (home directories and work space); ~7.5 TB AFS (global software repositories)

• Tape Archive: 6 Sun/StorageTEK Tape Libraries, ~50K tape slots, ~15 PB currently on tape

• General: ~250 machines for web servers, DB servers, gateways, centralized monitoring, mail servers, LDAP servers, testbeds, etc.

• Grid Services: Open Science Grid support for ATLAS and STAR

Page 8: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

FACILITY COMPONENTS

Storage Element

(Disk & Tape)

Computing

Element

The “Grid”

Network

Architecture should fit the computing model...

Page 9: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

A GENERALIZED COMPUTING MODEL...

Page 10: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

...BUT DETAILS WILL VARY• Different experiments, different requirements, different

approaches....

• RHIC - more centralized, all “tiers” represented locally but still significant offsite computing

• ATLAS - highly distributed, “transparent, global batch environment”

• LSST/Daya Bay/LBNE - model still in development...

• Good communication between facilities and users is essential � shared effort

A brief look at a couple of illustrative examples....

Page 11: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

DATA REDUCTION (ATLAS EXAMPLE)

Page 12: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

FROM DATA MODEL TO COMPUTING INFRASTRUCTURE...

Page 13: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

US CLOUD TRAFFIC RELATIONSHIPS ACCORDING TO THE ATLAS TIERED COMPUTING MODEL

Page 14: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

ATLAS COMPUTING AT WORK...

1.5 GB/secthrough BNL50K Jobs

450 TB 5 GB/sec

Page 15: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

PHENIX COMPUTING

• Largely Centralized: RACF primary data store at all levels• Data intensive: Rely on cost-effective aggregation of local

storage distributed across the computing farm (dCache)• “Analysis Train”: Formally ordered process - jobs split and

clustered by data subset, then results aggregated•

Page 16: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

PHENIX COMPUTING

Run 10: Expect 8.2B evts, 250 TB on disk for user analysis input

3 GB/sec333 G

2 PB

Train leaves the station...runs on locally copied data

Page 17: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

HPSS

dCachescratch

dCacheregular pools

from PHENIX

reconstruction

aggregation

thumpers analysis train

1-2x per run

once a week

working group areas (bluearc)

copyingreading/writing

The PHENIX Computing Model

c/o Carla Vale

Page 18: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

RACF GROWTH

0

17500.0000

35000.0000

52500.0000

70000.0000

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010 (

est.

)

2011 (

est.

)

2012 (

est.

)

KSpecInt2000

0

10000.00

20000.00

30000.00

40000.00

19

99

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

(es

t.)

20

11

(es

t.)

20

12

(es

t.)S

tora

ge C

ap

acit

y (

TB

)

Rapid ramp-up of both computing and storage capacity in recent years

Bumping up against the “Big Three”:

Space, Power, Cooling

Page 19: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

THE CDCE PROJECT

- 6000 sq. ft. expansion, completed in Fall ’09- 2.3 MW flywheel-based UPS; 2 MW diesel generator- 3-ft raised floors; AC real-estate- Facility footprint has nearly tripled since 2007

Page 20: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

A LOOK AHEAD...AND SOME OBSERVATIONS...

Toward the future:

• Balancing computing power vs. storage vs. network

• Growth of multi-core...what are the new bottlenecks?

• Role of flash storage?

• Increasing role of virtualization....”grid” vs. “cloud”?

• What are the “future” computing models?

Observations:

• Flexibility in planning is key

• Importance of communication cannot be overstated

� Amongst facilities; between stakeholders and service providers

• Explore synergies between fields/disciplines (astro, bio, sls....)

Page 21: BIG SCIENCE IN AN ERA OF LARGE DATASETS : BNL … · SETTING THE SCALE... • Processor Farm: ~2000 servers, ~10,000 cores, 64-bit SL5.3 • Access by CONDOR batch locally and via

Recommended