Post on 23-Jun-2020
transcript
The OSG Data FederationFrank Würthwein
OSG Executive DirectorUCSD/SDSC
OSG Data Federation
• This talk provides a superficial introduction to the OSG Data Federation.
- Functionality- Deployed scale- Science Use so far- First attempts at quantifying use
2
Functionality
• Open to any University or National Lab to add their data to the federation hosted on their storage.- Decentralized control of federated namespace and the storage hosting it.- Supporting public and proprietary parts of the namespace.
• Scientists see a single “read-only filesystem” across all data origins- Data publication into namespace- Data in global federation namespace is visible across the entire OSG compute
infrastructure, including XSEDE resources at SDSC, TACC, PSC- Direct random access into the data from the compute nodes on OSG.
• Caches in the network backbone and at endpoints reduce access latencies.- GeoIP is used to automatically select the “closest” cache.- Caches can be configured to cache only part of namespace => Domain Science
specific overlays are possible.• Multiple Origins can serve the same namespace
- Allows for redundancy through replication that is transparent to the users.- Allows data to be moved around within the network of origins for operational convenience.
3
OSG Compute Resources
4
In aggregate ~ 200,000 Intel x86 cores used by ~400 projects across 36 fields of science.
OSG Data Origins
5
SDSC
U.ChicagoFNAL
FNAL: Fermilab based HEP experimentsU.Chicago: general OSG communityCaltech: Public LIGO Data ReleasesUNL: Private LIGO DataSDSC: Simons FoundationNCSA: DES and NASA Earth Science (planned)
Caltech
UNL
Caches in network & at endpoints
6
CalTech
SDSC
UNL
FNALU Chicago
Amazon Direct Connect
Google Dedicated Interconnect
Microsoft Azure ExpressRoute
In Service Planned
OSG Data Origin
Internet 2
CENIC Internet2/Commercial Cloud cross connects
OSG Data Cache
Amsterdam
Cache at I2 peering point with Cloud providers in Chicago
5 top user communities
7
May 2018 September 2018
“Genomics”Jeanjackwpoehlm
Public LIGOgwdata/01
Private LIGOligo
Astronomydes
HEPminerva
We monitor accesses by namespace, cache location, compute site
Last 30 days
8
LIGO public Data
LIGO public Data
LIGO private Data
788 TB Total in last 30 days
Last 6 months
9
180TB/day
90TB/day
Different colors are different parts of the namespace
A case for cachingOSG enabled LIGO to seamlessly use VIRGO resources
55% of OSG enabled LIGO CPU hours are in Europe.NIKHEF + SurfSara represent half of that.
The LIGO workflow reuses each file O(100) times.Total data is only few TB but we moved many petabytes worth out of UNL
before LIGO started using the caches in OSG.
Cache in Amsterdam is effective way to reduce transatlantic network traffic.
Cache performance for LIGO
• Synthetic workload that behaves like PYCBC• Each job pulls 4 randomly selected files into
worker node tmp space.• Measure the time the transfer takes per job vs
number of parallel jobs in cluster.
11
With local cache
Without local cache, Running at UCSD getting data from UNL
Concurrency of jobs: 100, 500, 1000, 1500
More work ongoing to quantify performance of data federation for different applications
Summary & Conclusion
• OSG operates a Data Federation open to all of science.
• Supporting private and public data.• Supporting data publication.• Supporting random IO into files anywhere on OSG.• Supporting caching in network and at compute
clusters to improve performance:- Reduce IO on WAN- Hide access latencies for random IO
• Looking for researchers to try us out.
12