The OSG Data Federation - Internet2€¦ · 06-03-2019 · The OSG Data Federation Frank...

transcript

The OSG Data FederationFrank Würthwein

OSG Executive DirectorUCSD/SDSC

OSG Data Federation

• This talk provides a superficial introduction to the OSG Data Federation.

- Functionality- Deployed scale- Science Use so far- First attempts at quantifying use

Functionality

• Open to any University or National Lab to add their data to the federation hosted on their storage.- Decentralized control of federated namespace and the storage hosting it.- Supporting public and proprietary parts of the namespace.

• Scientists see a single “read-only filesystem” across all data origins- Data publication into namespace- Data in global federation namespace is visible across the entire OSG compute

infrastructure, including XSEDE resources at SDSC, TACC, PSC- Direct random access into the data from the compute nodes on OSG.

• Caches in the network backbone and at endpoints reduce access latencies.- GeoIP is used to automatically select the “closest” cache.- Caches can be configured to cache only part of namespace => Domain Science

specific overlays are possible.• Multiple Origins can serve the same namespace

- Allows for redundancy through replication that is transparent to the users.- Allows data to be moved around within the network of origins for operational convenience.

OSG Compute Resources

In aggregate ~ 200,000 Intel x86 cores used by ~400 projects across 36 fields of science.

OSG Data Origins

U.ChicagoFNAL

FNAL: Fermilab based HEP experimentsU.Chicago: general OSG communityCaltech: Public LIGO Data ReleasesUNL: Private LIGO DataSDSC: Simons FoundationNCSA: DES and NASA Earth Science (planned)

Caltech

Caches in network & at endpoints

CalTech

FNALU Chicago

Amazon Direct Connect

Google Dedicated Interconnect

Microsoft Azure ExpressRoute

In Service Planned

OSG Data Origin

Internet 2

CENIC Internet2/Commercial Cloud cross connects

OSG Data Cache

Amsterdam

Cache at I2 peering point with Cloud providers in Chicago

5 top user communities

May 2018 September 2018

“Genomics”Jeanjackwpoehlm

Public LIGOgwdata/01

Private LIGOligo

Astronomydes

HEPminerva

We monitor accesses by namespace, cache location, compute site

Last 30 days

LIGO public Data

LIGO private Data

788 TB Total in last 30 days

Last 6 months

180TB/day

90TB/day

Different colors are different parts of the namespace

A case for cachingOSG enabled LIGO to seamlessly use VIRGO resources

55% of OSG enabled LIGO CPU hours are in Europe.NIKHEF + SurfSara represent half of that.

The LIGO workflow reuses each file O(100) times.Total data is only few TB but we moved many petabytes worth out of UNL

before LIGO started using the caches in OSG.

Cache in Amsterdam is effective way to reduce transatlantic network traffic.

Cache performance for LIGO

• Synthetic workload that behaves like PYCBC• Each job pulls 4 randomly selected files into

worker node tmp space.• Measure the time the transfer takes per job vs

number of parallel jobs in cluster.

With local cache

Without local cache, Running at UCSD getting data from UNL

Concurrency of jobs: 100, 500, 1000, 1500

More work ongoing to quantify performance of data federation for different applications

Summary & Conclusion

• OSG operates a Data Federation open to all of science.

• Supporting private and public data.• Supporting data publication.• Supporting random IO into files anywhere on OSG.• Supporting caching in network and at compute

clusters to improve performance:- Reduce IO on WAN- Hide access latencies for random IO

• Looking for researchers to try us out.

The OSG Data Federation - Internet2€¦ · 06-03-2019 · The OSG Data Federation Frank...

Documents