Date post: | 23-Jan-2015 |
Category: |
Technology |
Upload: | henning-bergmeyer |
View: | 1,111 times |
Download: | 0 times |
Folie 1Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
A Python Framework for Staging ofGeo-referenced Data on the Collaborative Climate Community Grid (C3-Grid)
Henning Bergmeyer (http://tr.im/bergmeyer)
German Aerospace CenterSimulation- and Software Technology(V.09-06-30)Most up-to-date version of these slides available at: http://tr.im/ep09_pymodest
Folie 2Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
DLRGerman Aerospace Center
Research InstitutionSpace AgencyProject Management Agency
Folie 3Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Koeln
Lampoldshausen
Stuttgart
Oberpfaffenhofen
Braunschweig
Goettingen
Berlin-
Bonn
Trauen
Hamburg
Neustrelitz
Weilheim
Bremen-
Locations and employees
6000 employees across 29 research institutes and facilities at
13 sites.
Offices in Brussels, Paris and Washington.
Folie 4Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Talk Outline
Addressing the Heterogeneity Problem of Climate Data
PyModESt: Modular Extendable Stager in Python
Examples snippets from Data provider scripts for DLR data sets
initializeEnvironment
readStage Request
selectDataProcessor
chooseRequest Mode
handleCancel Request
handleStage Request
handleEstimation Request
handleExceptions
tidyWork Space
writeResponses
• authorize• retrieve Meta Data• adjust Constraints• estimate
• Stage Time• File Size
• find abandonedtemporary files
• (authorize)• prepare work space• retrieve meta data• adjust constraints• retrieve data• update meta data• transfer files to target
selectDataProcessor
• authorize• retrieve Meta Data• adjust Constraints• estimate
• Stage Time• File Size
• (authorize)• prepare work space• retrieve meta data• adjust constraints• retrieve data• update meta data• transfer files to target
initializeEnvironment
readStage Request
selectDataProcessor
chooseRequest Mode
handleCancel Request
handleStage Request
handleEstimation Request
handleExceptions
tidyWork Space
writeResponses
• authorize• retrieve Meta Data• adjust Constraints• estimate
• Stage Time• File Size
• find abandonedtemporary files
• (authorize)• prepare work space• retrieve meta data• adjust constraints• retrieve data• update meta data• transfer files to target
selectDataProcessor
• authorize• retrieve Meta Data• adjust Constraints• estimate
• Stage Time• File Size
• (authorize)• prepare work space• retrieve meta data• adjust constraints• retrieve data• update meta data• transfer files to target
PortalPortal
Workflow Management
Workflow Management
Data Management
Data Management
Data Information
System
Data Information
System
ArchiveArchive
Use
rs V
iew
Grid
Mid
dle
war
eR
esou
rces
PortalPortal
Workflow Management
Workflow Management
Data Management
Data Management
Data Information
System
Data Information
System
ArchiveArchive
Use
rs V
iew
Grid
Mid
dle
war
eR
esou
rces
Pre-Processing
Simulation Monitoring
Post-Processing
Visualization
End
Start
Stop Simulation
Optimum reached?
Yes
No
Yes
No
Problems?
Folie 5Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
The HeterogeneityProblem with Data inClimate Research
Distributed Data Archives throughout Germany
Large Data Quantities(Peta Bytes)Storage atData Sources(Sensors, Institutes)
Many Data FormatsDue to nature and purpose of dataHistoric reasons
Folie 6Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Pre-Processing
Simulation Monitoring
Post-Processing
Visualization
End
Start
Stop Simulation
Optimum reached?
Yes
No
Yes
No
Problems?
Scientific Workflow and Use of DataWorkflows consist of dependent tasksTasks are carried out
manuallyas local Applicationsas Jobs on compute resources
Scheduler / Batch Systemsplan execution time of jobson required resources (time, space)
Tasks consume and produce dataStaging: job data retrieval and storage of resultaccess should be standardized
Folie 7Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Transparent Grid Infrastructure for the GermanClimate Research Community
Uniform access to heterogeneous climate data stores
Publishing
Discovery
Citation
Download
Data Processing Tools
Data Visualization
Standard Tools / Workflows
User Interface: Web Portal
The Collaborative Climate Community Grid C3-Grid (2005-2009)
Folie 8Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Working with Data on the Portal
Advanced Search Browse by Data Set
Folie 9Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
C3-Grid Layered Architecture (simplified)
PortalPortal
Workflow Management
Workflow Management
Data Management
Data Management
Data Information
System
Data Information
System
ArchiveArchive
Use
rs V
iew
Grid
Mid
dlew
are
Res
ourc
es
WDC Climate WDC MareWDC RSAT DWD DKRZ, PIK, GKSS, AWI, MPI-M
IFM-Geomar FU Berlin Uni Cologne
Storage SolutionsdCacheOGSA-DAIFlat Files…heterogeneous!
Abstraction: Data Request Service + Metadata
Scheduled Tasks:• Computation• Data Download
=> custom staging scripts
Folie 10Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Stage Request Constraints: Data Selection
Data service receives selection constraints according to metadata
Data Set (Object ID)
Folie 11Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Stage Request Constraints: Data Selection
Data service receives selection constraints according to metadata
Data Set (Object ID)
Variables as CF Names
(Climate and Forecast MD Convention)
log_surface_pressure mole_fraction_of_ozone_in_air
…
Folie 12Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Stage Request Constraints: Data Selection
Data service receives selection constraints according to metadata
Data Set (Object ID)
Variables as CF Names
(Climate and Forecast MD Convention)
Regional Bounds (Longitude, Latitude)
Vertical Bounds andVertical Coordinate Reference System
Time Period
Data Set Specific Constraints
log_surface_pressure mole_fraction_of_ozone_in_air
…
Folie 13Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Problem solved… Solved in Middleware: Abstraction of heterogeneous storage access
Standardized Metadata Format (ISO 19115/19139)
Common Set of Request Constraints over Grid Service
…another one brought to the surface Computer Science Exercises for Meteorologists
Adaption of storage access to the standard service interface
extend Java service or write external stager script
Implementing Web/Grid Service Features
Concurrency Issues
Error handling
CF Name Translation
Metadata update requires XML processing & Schema knowledge
…
Folie 14Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
PyModESt helps the data provider
Hide unnecessary complexity
e.g. XML processing
Give Recipes
Guided Template for Data Processors in Python
Make implementation straight forward (e.g. reduce interfaces)
1 Input Channel (Constraints as Python primitives)
1 Output Channel
Settings and Environment Access
Avoid common mistakes
Concurrency of Requests
Let Users stay in their domain
Let them use their own tools
Provide additional tools for common issues
e.g. data grid calculations
Folie 15Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Staging Process Skeleton
initializeEnvironment
readStage Request
selectDataProcessor
chooseRequest Mode
handleCancel Request
handleStage Request
handleEstimation Request
handleExceptions
tidyWork Space
writeResponses
• authorize• retrieve Meta Data• adjust Constraints• estimate
• Stage Time• File Size
• find abandoned temporary files
• (authorize)• prepare work space• retrieve xml metadata• adjust constraints• retrieve data• update metadata• transfer files to target
selectDataProcessor
• authorize• retrieve Meta Data• adjust Constraints• estimate
• Stage Time• File Size
• (authorize)• prepare work space• retrieve xml metadata • adjust constraints• retrieve data• update metadata• transfer files to target
Necessary Implementation effort for DP when using PyModESt
Folie 16Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Implement a Data Processor
Initialize data processor __init__(c3env, stage_request)
perform common steps for stage request and estimation
e.g. determine data mesh properties
c3env and stage_request reference in data processor context
do not touch base data
Hook 1: retrieveAndFilterDataFiles()
– Hook 2: updateMetaData(c3_metadata)
Folie 17Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Hook 1: retrieveAndFilterDataFiles()
1. determine required base data files
Data provider knowledge about data store
2. download a base data file to a temporary filedld_datafile_name = self.c3env.reserveTempFilePath()fln, headers = urllib.urlretrieve( \
url=src_url, filename= dld_datafile_name)
3. extract constraint fulfilling data and append result file
Use specific tools or libraries
4. update result properties
remember meta data relevant attributes of the result
5. remove downloaded file to save temporary disk spaceos.remove(fln)self.c3env.releaseTempFilePath(dld_datafile_name )
6. continue with step 2 until finished
Folie 18Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Metadata object is automatically initialized from metadata server
Just pass updated attributes to methods of the metadata object
md.removeQuicklook()
md.filterContentInfo()
md.setHorizontalBounds()
md.setVerticalExtent()
md.updateTimePeriod()
md.setObjectId()
md.addLineageProcessStep()
Hook 2: updateMetadata(md)
Folie 19Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
C3Thesaurus - CF Names Translation(based on Guido’s recipe for dict inversion)
Climate and Forecast Convention used in C3-Grid to name variables
Data providers internally often follow other conventions
historic reasons
data file formats (e.g. GRB only supports numeric indexes)
Translates variable names between representations in different scopes
c3t.translateFromC3(self, attributes, tgt_scope)
c3t.translateToC3(self, attributes, src_scope)
{ SCOPE1 : { CF1 : DP1, CF2 : DP2 }, SCOPE2 : … }
SCOPE_MAP = { "g2.de.dlr.wdc.ERS2.GOME.L3.VCD.MONTHLYMEAN.O3" : { "mean_ozone_VCD" : "mean", "standard_deviation_ozone_VCD" : "strd_dev” } }
Folie 20Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Implement the Estimation Request(Scheduling Questions)
1. Verify request constraints
Is the request fulfillable?
2. Estimate result file size
How much storage space does the extracted data need?
3. Estimate staging time
When can I start a process that depends on the data?
4. Offer a contract to the Scheduling System
Do NOT process any base data files for this
Hook 3: estimateFileSize() returns long
Hook 4: estimateStageTime(stage_instant) returns timedelta
Folie 21Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Hook 3: estimateFileSize()
Return a long value (Size in Bytes)
For regular Raster / Table Data simplymultiply and sum-up
Use “GaussianGridHelper” to calculatetable index ranges on Gaussian Grids
Handle regions overlapping thelongitude periodic edge
gauss_grid_hlp = RegularGaussianGridHelper( src_lat_min, src_lat_max, lat_delta, lat_len, src_lon_min, src_lon_max, lon_delta, lon_len )
lat_idx_min, lat_idx_max, lat_idx_len, lon_idx_min, lon_idx_max, lon_idx_len = gauss_grid_hlp.calculateRegionIndices( lat_min, lat_max, lon_min, lon_max )
Folie 22Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Hook 4: estimateStageTime(stage_instant)
Given an instant for a Stage Request,when will the requested data be available?
Return a datetime.timedelta value
It is practically impossible to be precise in a complex dynamic system
theoretically statistics over past requests could be used
heuristically compare with sensor data of current server and network loads
Currently DLR implementations generously over-estimate with a constant: timedelta(seconds=60)
Folie 23Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Associate Data Processors with Data Set
Modularity of Stagers
Data Processor: Hooks + CF Translation for new data sets
The starter script
Contains all configuration settings
Associates data sets with data processors
{ ObjectID : DataProcessor }
Specific DataProcessor sub-class is plugged into the skeleton implementation
PROCESSOR_TYPES = {"g2.de.dlr.wdc.CWF" : netcdf_extraction.IPA_NCDF_Processor,“g2.de.dlr.wdc.ERS… “: wdc_hdf_processing.WDCHDFProcessor }
PROCESSOR_TYPES[stageRequest.object_ids[0]].retrieveAndFilter…
Starter
Config
DA
DA
DA
DA
Folie 24Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
ERS2.GOME.L3.VCD.MONTHLYMEAN.O3 (95–05)(Example Data Set 1)
WDC-RSAT: World Data Center for Remote Sensing of the Atmosphere
File StructureBase Data: 1 file / monthFile format HDF4HTTP download
Retrieval and Processing using PyHDF librarycreate new HDF fileiterate over months covering requested time periodadjust data describing attributes
Folie 25Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
PyHDF: Copying Parts of a Table to Another Table
HDF Scientific Data Set
Each variable is in a 2D table (longitude, latitude)
Result file contains 3D tables (date, longitude, latitude)
NumPy (Numerical Python) integration enables easy table operations
Select a 2D part of a table
Insert it into a specific range in a 3D table
tgt_ds[time_idx, :lat_idx_len, :(lon_len-lon_idx_min)] = src_ds[lat_idx_min:(lat_idx_max+1), lon_idx_min:]
tgt_ds[time_idx, :lat_idx_len, (lon_idx_len-lon_idx_max-1):] = src_ds[lat_idx_min:(lat_idx_max+1), :(lon_idx_max+1)]
Folie 26Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Chemical Weather Forecast Demo Data Set (2005)
Institute for Atmosphere Physics
File Structure
1 file / day with 8 time steps / day,
file format NetCDF
local file system
Retrieval and Processing using external command line tool CDO
iterate over files corresponding to time period
adjust data describing attributes by analyzing the result file
Folie 27Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Reading Structured Output of a Command Line Tool
Climate Data Operators (CDO)
supports a huge set of specific operations for climate data
command line tool
getRegionalBoundsFromNetCDF(filename) returns (xsize, ysize, …)sproc = subprocess.Popen(“cdo griddes %s” % filename, \
stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
output, err = sproc.communicate()
retcode = sproc.wait()
ncdf_props = str(output).split()
xsize = int(ncdf_props[ncdf_props.index(“xsize”) + 2])
This part provided as re-usable function callTool(cmd_ln) returns str
Seems trivial (GOOD!) but is incredibly useful
Saved a lot of implementation time
Folie 28Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Benefits of Using Python for Impementation of the Stager Lib
EasinessComparably easy to understand and modify by non computer scientistsIntuitive configuration using dictionariesAdd new data sets by Copy and Customize DataProcessors (documented DataProcessor template is provided)
Python: Powerful LanguageAccess to all constraints as Python types (float, str, dict, ...) Python standard libraries (datetime, timedelta, math, …)
ExtensibilityNo compilation and re-deployment necessaryEasy integration of scriptable tools and libraries Rich set of useful libraries available (HDF, Numeric, iso8601, PyParsing…)Integration of Java and C/C++ Libraries with JPype, ctypes, BOOST
Folie 29Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Conclusion
C3-Grid is a grid infrastructure for collaboration of climate researchers
Uniform ISO-annotated Data Access
Tools and complex workflows in Portal
Python can ease the life of both middleware developers and service implementers
PyModESt makes becoming a data provider easy
Modular development
DP can exploit the strengths of Python
Just implement 4 documented hooks for each data set kind
Provide CF Name translations (dict in dict)
Folie 30Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
Summary
The C3-Grid is a collaboration infrastructure for the climate science community. Main aim is transparency of the infrastructure to users by abstraction of heterogeneous data resources for easy data discovery and access and to allow execution of basic manipulation and analysis tasks as well as complex distributed workflows.
PyModESt is a Python framework for the comfortable modularized implementation of staging scripts for the C3-Grid that frees meteorological data providers from doing the work of system admins and vice versa.
Folie 31Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
References
C3-Grid: http://www.c3grid.deDLR Simulation and Software Technology: http://dlr.de/scDLR Intitute of Atmospheric Physics: http://www.dlr.de/pa/en/World Data Center for Remote Sensing of the Atmosphere: http://wdc.dlr.de/
Check-out for slideshow of beautiful current satellite images and animations: http://wdc.dlr.de/show/show.php [Sorry, it’s PHP ;-)]
PyHDF: http://pysclint.sourceforge.net/pyhdf/Numpy: http://numpy.scipy.org/Climate Data Operators: http://www.mpimet.mpg.de/fileadmin/software/cdo/JPype (Dynamically use Java in Python): http://jpype.sourceforge.net/ctypes (Integrate C/C++ libraries in Python): http://docs.python.org/library/ctypes.html
These slides on slideshare (always up-to-date):http://tr.im/ep09_pymodest