Post on 01-Jan-2016
transcript
AR5 Data and Product AR5 Data and Product Access ArchitectureAccess Architecture
Concepts for DiscussionConcepts for Discussion
Steve Hankin (NOAA/PMEL)Steve Hankin (NOAA/PMEL)
(Not including metadata architecture or security)
June '07June '07 GO-ESSPGO-ESSP 22
You’ve just heard Bryan’s thoughts on You’ve just heard Bryan’s thoughts on requirementsrequirements (which probably resemble the following)(which probably resemble the following)
– User needs -- by IT sophistication level (WG*) User needs -- by IT sophistication level (WG*) WG1 - physical processesWG1 - physical processes
– Raw files (on native grids)Raw files (on native grids)– CF subsets (potentially large – e.g. global)CF subsets (potentially large – e.g. global)
Native grid and regriddedNative grid and regridded– Broad range of analyses (scope tbd by science community)Broad range of analyses (scope tbd by science community)– Intercomparison on hi-res global fieldsIntercomparison on hi-res global fields– Visualizations, tables, animations, …Visualizations, tables, animations, …
WG2,3 – WG2,3 – regionalregional impacts on life and societies; mitigationimpacts on life and societies; mitigation– CF subsets (regional)CF subsets (regional)– Basic analysis (e.g. area averages, extrema)Basic analysis (e.g. area averages, extrema)– Intercomparison on regional scaleIntercomparison on regional scale– Visualizations, tablesVisualizations, tables– tab-delimited (“Excel”) tab-delimited (“Excel”) – viz on globe (e.g.Google Earth), animations, …viz on globe (e.g.Google Earth), animations, …
June '07June '07 GO-ESSPGO-ESSP 33
Requirements, cont’dRequirements, cont’d
– Provider needs by IT capabilities levelProvider needs by IT capabilities level20-30 (est. 28?) contributing orgs20-30 (est. 28?) contributing orgs
Some providers not able to serve own dataSome providers not able to serve own data
Deployable AR5 components (if any) must install Deployable AR5 components (if any) must install easily at various infrastructureseasily at various infrastructures
User authentication/access controlUser authentication/access control
– Data volumesData volumes200+ TB (ESG proposal) – 20,000 TB (Bryan)200+ TB (ESG proposal) – 20,000 TB (Bryan)
June '07June '07 GO-ESSPGO-ESSP 44
How AR4 did itHow AR4 did it– Central DBCentral DB– Data sent on hard drives by postal serviceData sent on hard drives by postal service– All data regridded to same gridAll data regridded to same grid– QC via CMOR -- run at sites (scalable)QC via CMOR -- run at sites (scalable)– Some central analysis (summaries)Some central analysis (summaries)– Massive data distribution from a central pointMassive data distribution from a central point
AR4 Data Base:
• 30 Tbyte data collection
• 61,000 files
June '07June '07 GO-ESSPGO-ESSP 55
AR4 stumbling blocksAR4 stumbling blocks
Show stoppers:Show stoppers:– Some ocean models could not be regridded to Some ocean models could not be regridded to
the AR4 grid without information lossthe AR4 grid without information loss(solved?)(solved?)
DifficultiesDifficulties– Unreliable disk drivesUnreliable disk drives– Headache to match CMOR requirementsHeadache to match CMOR requirements– No doubt many other war stories ….No doubt many other war stories ….
Could we adapt the AR4Could we adapt the AR4approach to AR5?approach to AR5?
ESG proposal asserts, “No”.ESG proposal asserts, “No”. ““With an increasing number of users and an increasing With an increasing number of users and an increasing
quantity of data, it will no longer be feasible to carry out quantity of data, it will no longer be feasible to carry out the requirements of AR5 with the centralized data the requirements of AR5 with the centralized data management strategy utilized for AR4.”management strategy utilized for AR4.”
Well, that’s the party line, anyway.Well, that’s the party line, anyway.
Assertion:Assertion:if necessary a centralized solution is again if necessary a centralized solution is again possiblepossible
June '07June '07 GO-ESSPGO-ESSP 77
Centralized approachCentralized approach
Ship disks againShip disks again– Disk drives today: $250 = 500 GbytesDisk drives today: $250 = 500 Gbytes– By AR5 time (24 months?) , say, 2-5 Tbytes of By AR5 time (24 months?) , say, 2-5 Tbytes of
disk could reasonably be mailed from each disk could reasonably be mailed from each modeling sitemodeling site
– With insistence on a standard drive model, With insistence on a standard drive model, might retain data on original disksmight retain data on original disks
– Up to 150Tbyte by this meansUp to 150Tbyte by this means– Who would step forward to take this burdenWho would step forward to take this burden
June '07June '07 GO-ESSPGO-ESSP 88
Centralized approachCentralized approach
All data regridded to standard gridAll data regridded to standard grid– Accept a sub-optimal resolution, but add Accept a sub-optimal resolution, but add
GODAE-style hi-res fields (surface-only , GODAE-style hi-res fields (surface-only , selected sections and time series, etc.)selected sections and time series, etc.)
Hi-res analysis results. E.g. vertical Hi-res analysis results. E.g. vertical integralsintegrals
June '07June '07 GO-ESSPGO-ESSP 99
Could we adapt the AR4Could we adapt the AR4approach to AR5?approach to AR5?
Major burdens on [whatever] host organizationMajor burdens on [whatever] host organization– FinancialFinancial– Sysadmin headachesSysadmin headaches– Network loadsNetwork loads– IO loads from subsettingIO loads from subsetting
Compromises in the flexibility of analysesCompromises in the flexibility of analyses(due to pre-computed fields)(due to pre-computed fields)
But it could work …But it could work …
June '07June '07 GO-ESSPGO-ESSP 1010
Why make this point ?Why make this point ?
The IT challenges that we are debating are anThe IT challenges that we are debating are an opportunityopportunity to demonstrate a new way of to demonstrate a new way of doing thingsdoing things– The risk is that we disappoint ourselvesThe risk is that we disappoint ourselves
(as much as to AR5 science)(as much as to AR5 science)
What we want to demonstrate:What we want to demonstrate:– A “data grid” – a scalable, distributed approachA “data grid” – a scalable, distributed approach– The potential of IT to improve how science is doneThe potential of IT to improve how science is done– Enhanced collaborationEnhanced collaboration
June '07June '07 GO-ESSPGO-ESSP 1111
Time TablesTime Tables
Distributed technology has to be demonstrated in Distributed technology has to be demonstrated in time for AR5 planners to make decisions.time for AR5 planners to make decisions.
18 months from now (18 months from now (“early 2009” in the SciDAC “early 2009” in the SciDAC proposal) for functioning testbedproposal) for functioning testbed
– Conclusions:Conclusions:Few (if any) new “standards” can be considered. Must Few (if any) new “standards” can be considered. Must work with the ones we have.work with the ones we have.Consider areas in need of further standardization as Consider areas in need of further standardization as testing opportunitiestesting opportunitiesCode components should be running at at least a BETA Code components should be running at at least a BETA level by (?when? 12 months?) [group sense?]level by (?when? 12 months?) [group sense?]
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
products (viz, etc.)
services (protocols)
FTP
OPeNDAP & WCS (*)
OPeNDAP & WCS
* - analysis embedded in URL. No syntax standard. (F-TDS?)
multiple (**)
** - LAS request protocol; TDS/netCDF “fileout”; WMS?
Services(protocols)
Proposal:ESG Data and Product Access
Stack
June '07June '07 GO-ESSPGO-ESSP 1313
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
products (viz, etc.)
rawfiles
desktop access & subsets
desktop access & subsets
Visualizations, tables & scripts
Products
ESG Data and Product Access
Stack
June '07June '07 GO-ESSPGO-ESSP 1414
Data suppliersData suppliers
internet
Gateway node
Gateway node
Data node
Data node
Data node
June '07June '07 GO-ESSPGO-ESSP 1515
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
products (viz, etc.)
rawfiles
desktop access & subsets
desktop access & subsets
Visualizations, tables & scripts
O(1TB)
How to distribute the layers on the
nodes?
O(10GB)
O(0.1-10GB)
O(1-10MB)
Size of single data
requests
Which operations are feasible over
the internet?
June '07June '07 GO-ESSPGO-ESSP 1616
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
products (viz, etc.)
Gateway node
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
Data node
Proposed Proposed deployment of deployment of stack layers stack layers
based on output based on output sizessizes
Server-side Server-side analysisanalysis
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
any node
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
any node
Differencing:a standard analysis
operation (and a perennial issue for model
intercomparisons)
Difference
Regrid
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
products (viz, etc.)
Gateway node
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
any node
Difference
Regrid
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
RegridDifferencing:
also doable in the product layer
June '07June '07 GO-ESSPGO-ESSP 1919
netCDF-CF files
atomic datasets (aggregations)
analyses (incl. regridding)
products (viz, etc.)
An Existing ImplementationAn Existing Implementation
TDS(w/ HYRAX?)
F-TDS(a TDS plug-in)(“F” for ferret, but applicable to other legacy apps, too)
LAS(using ferret, CDAT
and other legacy apps)
F-TDSF-TDSTDS
IOServiceProvider
Ferret
(or other legacy app.)
http://server/_expr_{levitus}{Tave=TEMP[Z=@AVE]}http://server/_expr_{levitus}{Tave=TEMP[Z=@AVE]}
http://server/_expr_{model(s)}{<http://server/_expr_{model(s)}{<expressionexpression>}>}
Data provider supplies own regridding and
analysis tools.
Java CDAT
Ferret
Java
Matlab
Java
(We need to standardize an analysis expression
language.)
Workflow orchestrationWorkflow orchestration
Backend Service
Backend Service
Backend Service
metadata
LAS API
back endrequest (SOAP)
Product Server
Backend Service
TDSOPeNDAP
LegacyCDAT
JDBC LegacyFerret
Serviceproxy
LAS Architecture (v7)LAS Architecture (v7)
UI
netCDFfiles
SQLdatabase
Metadata(XML)
GISservices
ServiceServiceAPIAPI
SOAPSOAP
June '07June '07 GO-ESSPGO-ESSP 2222
DesktopDesktop::Matlab,Matlab,IDL, IDV,IDL, IDV,Ferret,Ferret,GrADS, GrADS, ……
Information Products
netCDF,netCDF,ASCII,ASCII,GIS layersGIS layers
June '07June '07 GO-ESSPGO-ESSP 2323
What products should AR5 offer ?What products should AR5 offer ?
A matter of policy tbd:A matter of policy tbd:– Each gateway node offers distinct productsEach gateway node offers distinct products
(CDAT, NCL, BADC, Ferret, Matlab, …)(CDAT, NCL, BADC, Ferret, Matlab, …)oror
– Standard set of productsStandard set of productsoror
– Some combination of these Some combination of these
June '07June '07 GO-ESSPGO-ESSP 2424
One style of user experience:One style of user experience:access to native coordinates and regridded fieldsaccess to native coordinates and regridded fields
June '07June '07 GO-ESSPGO-ESSP 2828
Plot on Google Earth
• Fine structure materializes as we zoom in
Display to Google Earth ?
June '07June '07 GO-ESSPGO-ESSP 2929
An AR5-wide UI through HTML smoke and mirrors
(“sister servers”)
LASUI
NetScape
Data
LAS
site 1
Meta
Meta
VIRTUAL server
Data
LAS
Meta
Data
LAS
site 2
Meta
site 4
Data
LAS
Meta
site 3
LASuser
interface
Meta Meta
Meta