nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
nci.org.au
@NCInews
Working with High Performance Datasets & Collections
Ben Evans
- L. Wyborn
- J. Wang, C. Trenham, K. Druken,
R Yang, P. Larraondo, …, and many in NCI team
• Our partners ANU, BoM, CSIRO and GA and
many community stakeholders
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
What is NCI:
• Australia’s most highly integrated e-infrastructure environment
• Petascale supercomputer + highest performance research cloud
+ highest performance storage in the southern hemisphere
• Comprehensive and integrated expert service
• Experienced Advanced Computing Scientific & Technical Teams
NCI is important to Australia because it:
• NCI is National and Strategic capability
• Enables research that otherwise would be impossible
• Enables delivery of world-class science
• Enables interrogation of big data, otherwise impossible
• Enables high-impact research that matters; informs public policy
• Attracts and retains world-class researchers for Australia
• Catalyses development of young researchers’ skills
Research Outcomes
Communities and Institutions/
Access and Services
Expertise Support and
Development
HPC Services Virtual Laboratories/
Data-intensive Services
Integration
Compute (HPC/Cloud) Storage/Network
Infrastructure
Res
ea
rch
Ob
jec
tive
s
NCI: World-class, high-end computing services for research & innovation
Particular focus on Earth System, Environment and Water Mgmt
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
2002 2004 2006 2008 2010 2012 2014
20
40
60
80
100 Compute Data Tools Networks
Mill
ion
s o
f D
olla
rs
Australian Research Infrastructure Funding 2006-2015
• Two main tranches of funding: • National Collaborative Research Infrastructure Strategy (NCRIS)
– $542M for 2006-2011 ($75 M for cyberinfrastructure)
• Super Science Initiative
– $901 million for 2009-2013 ($347M for cyberinfrastructure)
• Annual operational funding of around $180M pa since 2014-2015
– All infrastructure programmes were designed ensure that Australian research continues to be competitive and rank highly on an international scale.
•
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
RDSI Phase 1 (2011): Infrastructure
• The NCI Proposal in September 2011 was for a High Performance Data Node
• The goal was to: • Enable dramatic increases in the scale and reach of
Australian research by providing nationwide access to enabling data collections;
• Specialise in nationally significant research collections requiring high-performance computational and data-intensive capabilities for their use in effective research methods; and
• Realise synergies with related national research infrastructure programs.
RDSI Phase 1 (2011): Infrastructure
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Example of Letter of Support for NCI HPD Node
• “will work with the partners to develop a shared data environment” … where … “there will be agreed standards to enable interoperability between the partners”
• “ it now make sense to explore these new opportunities within the NCI partnership, rather than as a separate agenda that GA runs independently”
…Chris Pigram, CEO, Geoscience Australia 26 July 2011
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Organisational Steward of the Data Collection
Mutually agreed plan on how the collection will be managed and published
The DMP enables federated governance of the collection
NCI Data Collections Manager
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
The Research Data Storage Infrastructure
Source: https://www.rds.edu.au/
Progress on Data Ingest as of 16 October, 2015: ~43 Petabytes in 8 distributed nodes
© National Computational
Infrastructure 2015
10 PBytes
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
• Researchers are able to share, use and reuse significant collections of data that were previously either unavailable to them or difficult to access
• Researchers will be able to access the data in a consistent manner which will support a general interface as well as discipline specific access
• Researchers will be able to use the consistent interface established/funded by this project for access to data collections at participating institutions and other locations as well as data held at the Nodes
Source: https://www.rds.edu.au/project-overview
RDS (Phase 2) targeted outcomes from this infrastructure
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
10PB+ Research Data
Server-side analysis
and visualization
Data Services
THREDDS
VDI: Cloud scale user
desktops on data
Web-time analytics software
Integrated World-class Scientific Computing Environment
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Data Collections Approx. Capacity
CMIP5, CORDEX ~3 Pbytes
ACCESS products 2.4 Pbytes
LANDSAT, MODIS, VIIRS, AVHRR, INSAR, MERIS 1.5 Pbytes
Digital Elevation, Bathymetry, Onshore Geophysics 700 Tbytes
Seasonal Climate 700 Tbytes
Bureau of Meteorology Observations 350 Tbytes
Bureau of Meteorology Ocean-Marine 350 Tbytes
Terrestrial Ecosystem 290 Tbytes
Reanalysis products 100 Tbytes
1. Climate/ESS Model Assets and Data Products 2. Earth and Marine Observations and Data Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management Collections http://geonetwork.nci.org.au.au
National Environment Research Data Collections (NERDC)
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
10+ PB of Data for Interdisciplinary Science
GA CSIRO ANU Inter-
national Other
National
CMIP5
3PB
Astronomy
(Optical)
200 TB
Water
Ocean
1.5 PB
Atmosphere
2.4 PB
Earth
Observ.
2 PB
Marine
Videos
10 TB
Geophysics
300 TB
Weather
340 TB
Bathy,
DEM
100
TB
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
10+ PB of Data for Interdisciplinary Science
GA CSIRO ANU Inter-
national Other
National
CMIP5
3PB
Astronomy
(Optical)
200 TB
Water
Ocean
1.5 PB
Atmosphere
2.4 PB
Earth
Observ.
2 PB
Marine
Videos
10 TB
Geophysics
300 TB
Weather
340 TB
Bathy,
DEM
100
TB
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Managing 10+ PB of Data for Scalable In-situ Access
• Combined and integrated, the NCI collections are too large to move • bandwidth limits the capacity to move them easily
• the data transfers are too slow, complicated and too expensive
• even if our data can be moved, few can afford to store 10 PB on spinning disk
• We need to change our focus to: • moving users to the data (for sophisticated analysis)
• moving processing to data
• having online applications to process the data in-situ
• Improving the sophistication of users – with our help
• We called for a new form of system design where: • storage and various types of computation are co-located • systems are programmed and operated to allow users to interactively invoke different
forms of analysis in-situ over integrated large-scale data collections
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Connection HPC Infrastructure for Data-intensive Science
• highlighted the need for balanced systems to enable Data-intensive Science including:
• Interconnecting processes and high throughput to reduce inefficiencies
• The need to really care about placement of data resources
• Better communications between the nodes
• I/O capability to match the computational power
• Close coupling of cluster, cloud and storage
NCI’s Integrated High Performance Environment
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Earth System Grid Federation:
Exemplar of an International Collaboratory for large scientific data and analysis
Ben Evans, Geoscience Australia, August 2015
42/58
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
‘Big Data’ vs ‘High Performance Data’
http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/big-data-meets-big-data-analytics-105777.pdf
• Big Data is a relative term where the volume, velocity and variety of data exceed an organisations storage or compute capacity for accurate and timely decision making
• We define High Performance Data (HPD) as data that is carefully prepared, standardised and structured so that it can be used in Data-Intensive Science on HPC (Evans et al., 2015)
• Need to convert ‘Big data’ collections into HPD by
• Aggregating data into seamless high-quality data products
• Creating intelligent access to self describing data arrays
1964: 1KB = 2m of tape or ~ 20 cards
2014: a 4 GB Thumb drive = ~8000 Km of Tape
or ~83 million cards
2014: 20 PB of modern storage = ~ 32 trillion metres of tape
~ 320 trillion cards
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
NCI, BoM and Fujitsu Collaborative Project 2014-6 Project A: ACCESS Optimisation
• Evaluate and Improve Computational methods and performance for ACCESS-Opt
NWP • UM, APS3 (Global, Regional, City)
Seasonal Climate • ACCESS-GC2 (GloSea)
Data Assimilation • 4D-VAR (Atmosphere), EnKF (Ocean), DART (NCAR)
Ocean Forecasting and Research • MOM5, CICE/SIS, WW3, ROMS
Fully Coupled Earth System Model • ACCESS-CM2, ACCESS-ESM, CMIP5/6
49/58
Next Gen and Performance Analysis of Earth Systems codes
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Project B: Scalability of algorithms, hardware And other earth systems and geophysics codes • Tsunami – NOAA MOST and comMIT
• Data Assimilation – NCAR DART
• Ocean - MOM6, MITGCM, MOM5(WOMBAT), SHOC
• Water Quality and BioGeochemical models – particularly for eReefs
• Hydrodynamic and ecological models - Relocatable Coastal modelling project (RECOM)
• Weather and Convection Research – Non-access (e.g., WRF)
• Groundwater
• Hydrology - ?
• Natural Hazards - Tropical Cyclone (TCRM), Volcanic ash code, ANUGA, EQRM
• Shelf Reanalysis
• Onshore/Offshore seismic data processing
• Earthquake and Seismic waves
• 3D Geophysics: Gravimetric, Magnetotelluric, AEM, Inversion (Forward and Back)
• Earth Observation Satellite data processing
• Hydrodynamics, Oil and Petroleum
• Elevation, Bathymetry, Geodesy – data conversions, grids and processing
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Data Platforms of today need to scale down to small users
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
High-Res, Multi-Decadal, Continental-Scale Analysis
• 27 Years of data from LS5 & LS7(1987-2014)
• 25m Nominal Pixel Resolution
• Approx. 300,000 individual source scenes in approx. 20,000 passes
• Entire archive of 1,312,087 ARG25 tiles => 93x1012 pixels
• can be processed in ~3 hours
Water Detection from Space
c/- Geoscience Australia
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Scaling down to the smaller users – e.g. AGDC
Do we enable individual scenes to be downloaded for locally hosted small scale analysis? Or do we facilitate small scale analysis, in-situ on data sets that are dynamically updated?
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
HDF5 MPI-enabled HDF5 Serial
netCDF-4 Climate/Weather/Ocean
Libgdal EO
Data Library
Layer 1
HP Data Library Layer 2
SEG-Y Airborne
Geophysics Line data
FITS BAG LAS
LiDAR
Metadata Layer netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.
VGL AGDC
VL
Services Layer (expose data models & semantics)
Fast “whole-of-library” catalogue
Lustre Other Storage (options)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Systems Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
W*S
OG
C
WP
S
OG
C
WC
S
OG
C
WM
S
Op
en
D
AP
RD
F, LD
VHIRL Globe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
netCDF-4 EO
AuScope Portal
TERN Portal
AODN/IMOS Portal
eMAST Speddexes
All Sky Virtual Observatory
ANDS/RDA Portal
eReefs
Models Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data. gov.au
Open Nav Surface
Tools Data Portals
Dire
ct A
ccess
OG
C
SOS
Introducing the National Environmental Data Interoperability Research Platform (NERDIP)
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
HDF5 MPI-enabled HDF5 Serial
netCDF-4 Climate/Weather/Ocean
Libgdal EO
Data Library
Layer 1
HP Data Library Layer 2
SEG-Y Airborne
Geophysics Line data
FITS BAG LAS
LiDAR
Metadata Layer
netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.
VGL AGDC
VL
Services Layer (expose data models & semantics)
Fast “whole-of-library” catalogue
Lustre Other Storage (options)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Systems Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
W*S
OG
C W
PS
OG
C W
CS
OG
C
WM
S
Op
en
D
AP
RD
F, LD
VHIRL Globe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
netCDF-4 EO
AuScope Portal
TERN Portal
AODN/IMOS Portal
eMAST Speddexes
All Sky Virtual Observatory
ANDS/RDA Portal
eReefs
Models Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data. gov.au
Open Nav Surface
Tools Data Portals
Dire
ct A
ccess
OG
C
SOS
Infrastructure to Lower Barriers to Entry
Ace Users
Data Platform
Data Discovery
NERDIP: Enabling Multiple Ways to Interact with the Data
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
NERDIP: Enabling Multiple Ways to Interact with the Data
HDF5 MPI-enabled HDF5 Serial
netCDF-4 Climate/Weather/Ocean
Libgdal EO
Data Library
Layer 1
HP Data Library Layer 2
SEG-Y Airborne
Geophysics Line data
FITS BAG LAS
LiDAR
Metadata Layer
netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.
VGL AGDC
VL
Services Layer (expose data models & semantics)
Fast “whole-of-library” catalogue
Lustre Other Storage (options)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Systems Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
W*S
OG
C W
PS
OG
C W
CS
OG
C
WM
S
Op
en
D
AP
RD
F, LD
VHIRL Globe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
netCDF-4 EO
AuScope Portal
TERN Portal
AODN/IMOS Portal
eMAST Speddexes
All Sky Virtual Observatory
ANDS/RDA Portal
eReefs
Models Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data. gov.au
Open Nav Surface
Tools Data Portals
Dire
ct A
ccess
OG
C
SOS
Infrastructure to Lower Barriers to Entry
Ace Users Data Portals
Data Platform
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Platforms Free Data from the “Prison of the Portals”
• Portals are for visiting, platforms are for building on
• Portals present aggregated content in a way that invites exploration, but the experience is pre-determined by a set of decisions by the builder about what is necessary, relevant and useful.
• Platforms put design decisions into the hands of users: there are innumerable ways of interacting with the data
• Platforms offer many more opportunities for innovation: new interfaces can be built, new visualisations framed, ultimately new science rapidly emerges
Tim Sherratt http://www.nla.gov.au/our-publications/staff-papers/from-portal-to-platform
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
HDF5 MPI-enabled HDF5 Serial
netCDF-4 Climate/Weather/Ocean
Libgdal EO
Data Library Layer 1
HP Data Library
Layer 2
SEG-Y Airborne
Geophysics Line data
FITS BAG LAS
LiDAR
Metadata
Layer netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.
VGL AGDC
VL
Services Layer (expose data models & semantics)
Fast “whole-of-library” catalogue
Lustre Other Storage (options)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Systems Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
W*S
OG
C W
PS
OG
C W
CS
OG
C
WM
S
Op
en
D
AP
RD
F, LD
VHIRL Globe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
netCDF-4 EO
AuScope Portal
TERN Portal
AODN/IMOS Portal
eMAST Speddexes
All Sky Virtual Observatory
ANDS/RDA Portal
eReefs
Models Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data. gov.au
Open Nav Surface
Tools Data Portals
Dire
ct A
ccess
OG
C
SOS
Infrastructure to Lower Barriers to Entry
Data Platform
Data Discovery
NERDIP: Enabling Ace Users to Interact with the Data
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
NERDIP: Applications Replicating Ways of Interacting with the Data
HDF5 MPI-enabled HDF5 Serial
netCDF-4 Climate/Weather/Ocean
Libgdal EO
Data Library
Layer 1
HP Data Library Layer 2
SEG-Y Airborne
Geophysics Line data
FITS BAG LAS
LiDAR
Metadata Layer
netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.
VGL AGDC
VL
Services Layer (expose data models & semantics)
Fast “whole-of-library” catalogue
Lustre Other Storage (options)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Systems Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
W*S
OG
C W
PS
OG
C W
CS
OG
C
WM
S
Op
en
D
AP
RD
F, LD
VHIRL Globe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
netCDF-4 EO
AuScope Portal
TERN Portal
AODN/IMOS Portal
eMAST Speddexes
All Sky Virtual Observatory
ANDS/RDA Portal
eReefs
Models Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data. gov.au
Open Nav Surface
Tools Data Portals
Dire
ct A
ccess
OG
C
SOS
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
NERDIP: Loosely coupling Applications and Data via a Services Layer
HDF5 MPI-enabled HDF5 Serial
netCDF-4 Climate/Weather/Ocean
Libgdal EO
Data Library
Layer 1
HP Data Library Layer 2
SEG-Y Airborne
Geophysics Line data
FITS BAG LAS
LiDAR
Metadata Layer
netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.
VGL AGDC
VL
Services Layer (expose data models & semantics)
Fast “whole-of-library” catalogue
Lustre Other Storage (options)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Systems Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
W*S
OG
C W
PS
OG
C W
CS
OG
C
WM
S
Op
en
D
AP
RD
F, LD
VHIRL Globe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
netCDF-4 EO
AuScope Portal
TERN Portal
AODN/IMOS Portal
eMAST Speddexes
All Sky Virtual Observatory
ANDS/RDA Portal
eReefs
Models Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data. gov.au
Open Nav Surface
Tools Data Portals
Dire
ct A
ccess
OG
C
SOS
Infrastructure to Lower Barriers to Entry
Ace Users
Data Platform
Data Discovery
APPLICATION
FOCUSSED DEVELOPERS
DATA MANAGEMENT
FOCUSSED DEVELOPERS
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
NERDIP: Infrastructure to Lower Barriers to Entry
HDF5 MPI-enabled HDF5 Serial
NetCDF-4
Climate/Weather/Ocean
Libgdal
EO
Data Library Layer 1
HP Data Library Layer 2
[SEG-Y] Airborne
Geophysics Line data
[FITS] BAG LAS
LiDAR
Metadata Layer
netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.
VGL AGDC
VL
Services Layer (expose data model & semantics)
Fast “whole-of-library” catalogue
Lustre Other Storage (options)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Systems Lab
Biodiversity & Climate Change VL
OG
C W
FS
OG
C SO
S
OG
C W
PS
OG
C W
CS
OG
C W
MS
Op
en
D
AP
RD
F, LD
VHIRL Globe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
NetCDF-4
EO
AuScope Portal
TERN Portal
AODN/IMOS Portal
eMAST Speddexes
All Sky Virtual Observatory
ANDS/RDA Portal
eReefs
Models Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data. gov.au
Open Nav Surface
Tools Data Portals
Dire
ct A
ccess
Data Discovery Ace Users
Data Platform
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
http://www.roughlydrafted.com/RD/RDM.Tech.Q2.07/BA1B46C4-4014-44DE-ACBB-61D49A926D00.html
Bring the collections together….
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
data size/format: potential 9.3 PBytes in NetCDF
Format NetCDF Non NetCDF
Size 7717 TB (6799TB available) 2579 TB (2221TB available) 1600 TB will be converted (bathymetry, Landsat, geophysics)
Domain Climate Models Weather Models and Obs Water Obs Satellite Imagery Other Imagery
Satellite Imagery Astronomy Medical Biosciences Social sciences Geodesy Video Geological Model Hazards Phenology
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Straightening out the data format standards
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
NCI Compliance Checking: Data & Metadata
• Metadata: Use the Attribute Convention for Dataset Discovery (ACDD) (previously Unidata Dataset Discovery Conventions)
• Data: Based on the CF-Checker developed at the Hadley Centre for Climate Prediction and Research (UK) by Rosalyn Hatcher
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Auditing the system for data standards compliance
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Grid Diversity in CMIP5
Downstream communities may not wish to deal with different
grids, but the modelling communities generate data appropriate to them. Mercator grid in south
Tripolar grid in north
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
CMIP6 WGCM Infrastructure Panel recommendations
• Use netCDF4 with lossless compression as the data format for CMIP6.
• Lossless compression from zlib (settings deflate=2 and shuffle) expected to generate roughly 2X decrease in data volumes (varies depending on data entropy or noisiness).
Requires upgrading entire toolchain (data production and consumption) to netCDF4.
• Recommends the use of standard grids for datasets where native-grid data is not strictly required. • For example: the Clivar Ocean Model
Development Panel (OMDP) may request the use of World Ocean Atlas (WOA) standard grids (1◦×1◦, 0.25◦×0.25◦) as the target grid of choice.
• No progress on adoption of standard calendars.
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
10PB+ Research Data
Server-side analysis
and visualization
Data Services
THREDDS
VDI: Cloud scale user
desktops on data
Web-time analytics software
Integrated World-class Scientific Computing Environment
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
IO Performance Tuning Metrics
NFS-CEPH NFS-Lustre Local SSD Lustre Lustre
VDI:/short VDI:/g/data1 VDI:/local Raijin:/short Raijin:/g/data2
2 cores 32GB
2 cores 32GB
2 cores 32GB
16 cores/node 32GB/node
16 cores/node 32GB/node
Lustre MPI-IO HDF5 General
Stripe count Data sieving Chunk pattern File type
Stripe size Collective buffer Chunk cache size File size
Alignment Transaction size
Metadata cache Access pattern
Compression Concurrency
File systems
Local (Direct) Methods for Accessing the data
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Serial Write Throughput
0
100
200
300
400
500
600
700
800
900
GTIFF HDF5 NC_CLASSIC NC4 NC4_CLASSIC
Thro
ugh
pu
ts
Interfaces
write
GDAL_FILL
PURE_FILL
GDAL_NOFILL
PURE_NOFILL
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Serial Read Throughput
0
100
200
300
400
500
600
700
800
900
1000
GTIFF HDF5 NC_CLASSIC NC4 NC4_CLASSIC
Thro
ugh
pu
ts (
MB
/s)
Interfaces
read
GDAL_FILL
PURE_FILL
GDAL_NOFILL
PURE_NOFILL
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M 0
200
400
600
800
1000
1200
1400
1600
1800
1 8 16 32 64 128
MB
/s
Stripe count
Independent write
HDF5 MPIIO POSIX
Low performance when using default parameters
Parallel Write Throughput
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Low performance when using default parameters
MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M
0
1000
2000
3000
4000
5000
6000
7000
1 8 16 32 64 128
MB
/s
Stripe count
Independent Read
HDF5 MPIIO POSIX
Parallel Read Throughput
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
A Simple Comparison of File formats and compression
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Contiguous Access: compressed vs non-compressed
Transfer
Count
1×
5500
5×
5500
10×
5500
20×
5500
40×
5500
55×
5500
100×
5500
550×
5500
1000×
5500
5500×
5500
Transfer
Size 22kB 110KB 220KB 440KB 880KB 1.21MB 2.2MB 12.1MB 22MB 121MB
Raijin
(src) 182.39 229.19 216.08 218.45 220.58 220.74 222 203.86 192.56 189.49
Vdi
(src) 199.8 248.23 235.42 238.15 239.28 241.25 228.93 219.08 217.09 220.75
raijin
(nrm) 479.77 790.84 804.79 848.09 888.82 887.62 889.31 800.65 710.48 544.84
vdi
(nrm) 1473.45 3965.39 4521.17 5182.22 5785.31 4972.67 5898.41 3818.48 3162.54 2066.02
• Read Throughput (MB/s) • Source file (19 MB)
• Normal file (121 MB)
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Chunk Cache
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
HDF5 Chunked Storage
• Data is stored in chunks of predefined size • Two-dimensional instance may be referred to as data tiling
• HDF5 library writes/reads the whole chunk
Contiguous Chunked
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Chunk shape: 1100×1100 Subset shape: 1×5500,275×275,1100×1100
0
50
100
150
200
250
300
350
400
0 1 2 3 4 5 6 7 8 9
Thro
ugh
pu
t (M
B/s
)
Deflate Level
1×5500_4M
1×5500_32M
275×275_4M
275×275_32M
1100×1100_4M
1100×1100_32M
Subset Access – compression, subsetting and caches
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
OpenDAP
DAP2 -> NetCDF on the wire
DataArray [...] Shape (...) DType GeoTransform Projection Metadata
NetCDF File
HTTP Conn OpenDAP Format
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
DAP2 -> NetCDF on the wire
NetCDF File
HTTP Conn OpenDAP Format
DataArray [...] Shape (...) DType GeoTransform Projection Metadata
OpenDAP
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Tile Map Servers
Serving Maps
WMTS Server
Client (Browser)
THREDDS Server
1 2
3 4
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Serving Maps
WMTS Server
Client (Browser)
THREDDS Server
1 2
3 4
Tile Map Servers
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Examples of on-the-fly data delivery
nci.org.au © National Computational
Infrastructure 2015 Ben Evans, OzEWEX 2015
Key Messages on Accessing High Performance Data
• Data at scales of today have to be built as shared global facilities based around national institutions.
• Domain-neutral international standards for data collections and interoperability are critical for allowing complex interactions in HP environments both within and between HPD collections
• No one can do it alone. No one organisation, no one group, no one country has the required resources or the expertise.
• Shared collaborative efforts such as Research Data Alliance, the Earth Systems Grid Federation (ESGF), the Belmont Forum, EarthServer, the Oceans Data Interoperability Platform (ODIP), EarthCube, GEO and OneGeology are needed to realise the full potential of the new data intensive science infrastructures
• It now takes a ‘village of partnerships’ to raise a ‘HPD data center’ in a Big Data World
http://www.onegeology.org/
https://www.sfwa.org/wp-content/uploads/2010/06/iStock_000012734413XSmall.jpg