Reading HDF family of formatsvia NetCDF-Java / CDM
John Caron
UCAR/Unidata
NetCDF-Java library
• 100% Java• Open Source (LGPL, MIT)• Independent implementation• Used as a component in other software (partial)
– Integrated Data Viewer, THREDDS Data Server (Unidata)– Panoply (NASA)– ncBrowse (EPIC/NOAA)– Java NEXRAD Viewer (NCDC/NOAA)– MyWorld GIS (Northwestern)– EDC for ArcGIS, ERRDAP (SFSC/NOAA)– Live Access Server (PMEL/NOAA)– ncWMS (Reading)– Matlab plug-in (USGS)
NetcdfDataset
ApplicationScientific Feature Types
NetCDF-Java/
CDM architecture
OPeNDAP
THREDDS
Catalog.xml NetCDF-3
HDF5
I/O service provider
GRIB
GINI
NIDS
NetcdfFile
NetCDF-4
…Nexrad
DMSP
CoordSystem Builder
Datatype Adapter
NcMLNcML
Format Readers (IOSP)
• General: NetCDF, HDF5, HDF4, OPeNDAP• Gridded: GRIB-1, GRIB-2, GEMPAK• Radar: NEXRAD 2&3, DORADE, CINRAD,
Universal Format• Point: BUFR, ASCII• Satellite: DMSP, GINI, McIDAS AREA• Misc: GTOPO, Lightning, etc• Others in development (partial):
– AVHRR, GPCP, GACP, SRB, SSMI, HIRS (NCDC)
Line of Code (est)
LOC semicolons ratio LOC ratio seminetcdf3 1977 846 1 1hdf4 3151 1405 1.6 1.7hdf-eos 3737 1695 1.9 2.0hdf5 5735 2672 2.9 3.2
common 28121 9267
Why all the trouble?
• ~20-40% C/C++ time spent on portability issues• Platform Independence
– Linux, Solaris, Windows (Sun)– Mac OS X (Apple)– AIX, Linux, Windows, z/OS (IBM)– HP-UX (Hewlitt-Packard)
• Progammer productivity– Object-Oriented– Garbage Collected – no memory leaks– Rich libraries– Open source
• Faster than C for some applications
Independent implementation
• Written entirely from reading HDF4, HDF5 file specifications
• Helped debug (HDF5), validate file specs
• File format spec is what will be needed in 100 years to read legacy data– OTOH, semantics not always obvious
• Don’t confuse reference implementation with the file/protocol specification
HDF family of formats
• HDF5/NetCDF-4
• HDF4
• HDF-EOS
• Note: read-only, no parellel I/O, etc
HDF5/NetCDF4
• Goal is to read all HDF5– Can read all HDF5 files that we have example– including references, soft links– Complete coverage difficult to guarantee –
combinatoric explosion
• Some esoteric features we are skipping– File drivers, external files, slib compression
• Working on a comprehensive test harness– JNI interface to Netcdf4/HDF5 library– read every byte and compare
HDF4 / HDF-EOS
• Complete, works against all examples
• Tested against 400 sample files (27 Gb)– thanks to Ruth Duerr (NSIDC)
• Spot checked against HDFView
• Need systematic test to compare reading against the HDF4 C Library
Geolocation Primer
Swath
Float lat(245, 33477);
Float lon(245, 33477);
Float time(33477);
Float data(245, 33477);
Just know that its swath data• 245 points cross track• 33477 along the track• Each scan has a time coordinate
Swath
Float lat(33477, 245);
Float lon(33477, 245);
Float time(33477);
Float data(245, 33477);
Swath
Float lat(999,999);
Float lon(999,999);
Float time(999);
Float data(999,999);
Swath
Float v1(999, 999);
Float v2(999, 999);
Float v3(999);
Float v4(999,999);
If you write data
• Don’t rely on variable name conventions
• Don’t rely on index ordering
• Don’t rely on matching index sizes
• Minimize “you just have to know that…”
Dimensions
Dimensions
d1=999;
d2=999;
Variables:
float v1(d1=999, d2=999);
float v2(d1=999, d2=999);
float v3(d2=999);
float v4(d2=999,d1=999);
Good
Variables: float v1(d1=999, d2=999); v1:standard_name = “Latitude”; float v2(d1=999, d2=999); v2:standard_name = “Longitude”; float v3(d2=999); v3:standard_name = “Time”; float v4(d2=999,d1=999);
Data_type = “Swath”;Conventions = “My unique name”;
If you write data
• Unique signature
• Specify dimensions
• Identify georeferencing coordinates
• Identify data type
• Units are not optional
HDF-EOS, HDF-EOS2
• Read “structural metadata” field to obtain more semantics
• Parse text in “ODL”– Data type: Swath, Grid, Point– Dimensions– Geolocation coordinate variable types:
Latitude, Longitude, Time
HDF-EOS, HDF-EOS2
• Good– Unique signature, identify coordinates and
data type
• Not so good– ODL– Not using hdf4/5 constructs
• Bad– No data units– No time coordinate units!
Better EOS
Variables: float v1(999, 999); v1:standard_name = “Latitude”; v1:dims = “d1 d2”; float v2(999, 999); v2:standard_name = “Longitude”; v2:dims = “d1 d2”; float v3(999); v3:standard_name = “Time”; v3:dims = “d2”; float v4(999,999); v4:dims = “d2 d1”;
NPP (i1.4.0.3_NPP_QUAL)
• Good– XML better than ODL
• Not so good– Not using hdf4/5 constructs
• Bad– No data units– No time coordinate units!
• Fatal Error: please reboot – Metadata not in the same file
Summary
• Netcdf-Java reads entire HDFx family
• Good for Java-philes
• Needs more testing – Send example files, $
• Dimensions are not optional
• Keep structural and georeferncing metadata in the same file as the data– Can also have specialized external files
NetCDF-4 andCommon Data Model(Data Access Layer)
Dimension primer
Float lat(180);
Float lon(360);
Float alt(20);
Float time(1200);
Float data(1200,20,180,360);
Unique Name!
Float lfip(lfip=180);
Float lflop(lflop=180);
Float zorg(zorg=20);
Float skdf(skdf=1200);
Float dglot(skdf=1200,zorg=20,
lfip=180,lflop=180);
Float lfip(180);
Float lflop(180);
Float zorg(20);
Float freebish(1200);
Float dglot(1200,20,180,180);
Float lat(180);
Float lon(180);
Float alt(20);
Float time(1200);
Float data(1200,20,180,180);