The Parallel System for Integrating Impact Models and Sectors (pSIMS)
Joshua Elliott Computation Institute
University of Chicago and Argonne National Laboratory
David Kelly Computation Institute
University of Chicago and Argonne National Laboratory
Neil Best Computation Institute
University of Chicago and Argonne National Laboratory
Michael Wilde Computation Institute
University of Chicago and Argonne National Laboratory
Michael Glotter Department of Geophysical Sciences
University of Chicago
Ian Foster Computation Institute
University of Chicago and Argonne National Laboratory
ABSTRACT
We present a framework for massively parallel simulations of
climate impact models in agriculture and forestry: the parallel
System for Integrating Impact Models and Sectors (pSIMS). This
framework comprises a) tools for ingesting large amounts of data
from various sources and standardizing them to a versatile and
compact data type; b) tools for translating this standard data type
into the custom formats required for point-based impact models in
agriculture and forestry; c) a scalable parallel framework for
performing large ensemble simulations on various computer
systems, from small local clusters to supercomputers and even
distributed grids and clouds; d) tools and data standards for
reformatting outputs for easy analysis and visualization; and d) a
methodology and tools for aggregating simulated measures to
arbitrary spatial scales such as administrative districts (counties,
states, nations) or relevant environmental demarcations such as
watersheds and river-basins. We present the technical elements of
this framework and the results of an example climate impact
assessment and validation exercise that involved large parallel
computations on XSEDE.
Categories and Subject Descriptors
I.6.8 [Simulation and Modeling]: Parallel – parallel simulation
of climate vulnerabilities, impacts, and adaptations.
General Terms
Algorithms, Performance, Design, Languages.
Keywords
Climate change Impacts, Adaptation, and Vulnerabilities (VIA);
Parallel Computing; Data processing and standardization; Crop
modeling; Multi-model; Ensemble; Uncertainty
1. INTRODUCTION AND DESIGN Understanding the vulnerability and response of human society to
climate change is necessary for sound decision-making in climate
policy. However, progress on these important research questions
is made difficult by the fact that science and information products
must be integrated across vastly different spatial and temporal
scales. Biophysical and environmental responses to global change
generally depend strongly on environmental (e.g., soil type),
socioeconomic (e.g., farm management), and climatic factors that
can vary substantially over regions at high spatial resolution.
Global Gridded Crop Models (GGCMs) are designed to capture
this spatial heterogeneity and simulate crop yields and climate
impacts at continental or global scales. Site-based GGCMs, like
those described here, aggregate local high-resolution simulations,
and are often limited by data availability and quality at the scales
required by a large-scale campaign. Obtaining the data inputs
necessary for a comprehensive high-resolution assessment of crop
yields and climate impacts typically requires tremendous effort by
researchers, who must catalog, assimilate, test, and process
multiple data sources with vastly different spatial and temporal
scales and extents. Accessing, understanding, scaling, and
integrating diverse data typically involves a labor-intensive and
error-prone process that creates a custom data processing pipeline.
A comparably complex set of transformations must often be
performed once impact simulations have been completed to
produce information products for a wide range of stakeholders,
including farmers, policy-makers, markets, and agro-business
interests. Thus, simulation outputs, like inputs, must be available
at all scales and in familiar and easy to use formats.
To address these challenges and thus facilitate access to high-
resolution climate impact modeling we are developing a suite of
tools, data, and models called the parallel System for Integrating
Impacts Models and Sectors (pSIMS). Using an integrated multi-
model multi-sector simulation approach, pSIMS leverages
automated data ingest and transformation pipelines and high-
performance computing to enable researchers to address key
challenges. We present in this paper the pSIMS structure, data,
and methodology; describe a prototype use case, validation
methodology, and key input datasets; and summarize features of
the software and computational architecture that enable large-
scale simulations.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
XSEDE13, July 22 - 25 2013, San Diego, California, USA
Copyright 2013 ACM 978-1-4503-2170-9/13/07 ...$15.00.
Our goals in developing this framework are fourfold, to: 1)
develop tools to assimilate relevant data from arbitrary sources; 2)
enable large ensemble simulations, at high resolution and with
continental/global extent, of the impacts of a changing climate on
primary industries; 3) produce tools that can aggregate simulation
output consistently to arbitrary boundaries; and 4) characterize
uncertainties inherent in the estimates by using multiple models
and simulating with large ensembles of inputs.
Figure 1 shows the principal elements of pSIMS including both
automated steps (the focus of this paper) and those that require an
interactive component (such as specification and aggregation).
The framework is separated into four major components: data
ingest and standardization, user initiated campaign specification,
parallel implementation of specified simulations, and aggregation
to arbitrary decision-relevant scales.
pSIMS is designed to support integration of any point-based
climate impact model that requires standard daily weather data as
inputs. We developed the framework prototype with versions 4.0
and 4.5 of the Decision Support System for Agrotechnology
Transfer (DSSAT [15]). DSSAT 4.5 supports models of 28
distinct crops using two underlying biophysical models (CERES
and CropGRO). Additionally, we have prototyped and tested a
parallel version of the CenW [16] forest growth simulation model
and are actively integrating additional crop yield and climate
impact models starting with the widely used Agricultural
Production Systems Simulator (APSIM) [23]. As of this writing,
continental- to global-scale simulation experiments ranging in
resolution from 3–30 arcminutes have been conducted on four
crop and 1 tree (pinus radiata) species using a variety of
weather/climate inputs and scenarios.
2. THE pSIMS DATA INPUT PIPELINE The minimum weather data requirements for crop and climate
impact models such as DSSAT, APSIM, and CenW are typically:
Daily maximum temperature (degrees C at 2m above the ground surface)
Daily minimum temperature (degrees C at 2m)
Daily average downward shortwave radiation flux (W/m2
measured at the ground surface)
Total daily Precipitation (mm/day at the ground surface)
In addition, it is also sometimes desirable to include daily average
wind speeds and surface humidity if available.
Hundreds of observational datasets, model-based reanalyses, and
multi-decadal climate model simulation outputs at regional and
global scales exist that can be used to drive climate impact
simulations. For some use-cases, one or another data product may
be demonstrably superior, but most often a clear understanding of
the range of outcomes and uncertainty requires simulations with
an ensemble of inputs. For this reason, input data requirements for
high-resolution climate impact experiments can be large and data
processing and management can be challenging.
One major challenge in dealing with long daily time-series of
high-resolution weather data is that input products are typically
stored one or more variables at a time in large spatial raster files
segmented into annual, monthly, or sometimes even daily or sub-
daily time-slices. Point-based impact models on the other hand,
typically require custom ASCII data types that encode long time-
series of daily data for all required variables for a single point into
one file. The process of accessing hundreds or thousands of
spatial raster files O(105–106) times to extract the time-series for
each point is time-consuming and expensive [20]. To ameliorate
this challenge, we have established a NetCDF4-based data-type
for the pSIMS framework—identified by a .psims.nc4
extension—and tools for translating existing archives into this
format (Figure 2).
Each .psims.nc4 file represents a single grid cell or simulation site
within the study area. Its contents are one or more 1 x 1 x T
arrays, one per variable, where T is the number of time steps. The
time coordinates of each array are explicit in the definition of their
dimensions, a strategy that facilitates near-real-time updates from
upstream sources. The spatial coordinates of each array are also
explicit in the definition of their dimensions, which facilitates
downstream spatial merging. Variables are grouped and named so
as to avoid namespace collisions with variables from other
sources in downstream merging, e.g.. “narr/tmax” refers to daily
maximum temperature variable extracted from the North
American Regional Reanalysis (NARR) dataset and “cfsrr/tmax”
to the daily minimum temperature from the NOAA Climate
Forecast System Reanalysis and Reforecast (CFSRR) dataset.
Because a given .psims.nc4 file contains information about a
single point location, the longitude and latitude vectors used as
Figure 1: Schematic of the pSIMS workflow. 1) Data ingest takes data from numerous publicly available archives in arbitrary file formats and data types. 2) The standardization step reconstitutes each of these datasets into the highly portable point-based .psims.nc4 format. 3) The specification step defines a
simulation campaign for pSIMS by choosing a .psims.nc4 dataset or subset and one or more climate impact models (and the requisite custom input and
output data translators) from the code library. 4) The translation step individual .psims.nc4 files (each representing one grid-cell in the spatial simulation) are converted into the custom file formats required by the models. 5) In the simulation step, individual simulation calls within the campaign are managed
by Swift on the available high-performance resource. 6) In the output reformatting step, model outputs (dozens or even hundreds of time-series variable
from each run) are extracted from model-specific custom output formats and translated into a standard .psims.out format. 7) Finally, in the aggregation
step, output variables are masked as necessary and aggregated to arbitrary decision-relevant spatial or temporal scales.
array dimensions are both of length one. Therefore the typical
weather variable is represented as a 1x1xT array where T is the
number of time steps in the source data. In contrast, arrays
containing forecast variables have two time dimensions, the time
the forecast is made and the time of the prediction. This use of
two time dimensions makes it possible to follow the evolution of a
series of forecasts made over a period of time for a particular
future date, as for example when forecasts of crop yields at a
particular location are refined over time as more information
becomes available..
pSIMS input files are named according to (row, col) tuple in the
global grid and organized in a file structure similarly (e.g.,
/narr/361/1168/361_1168.psims.nc4) so that the terminal
directories hold data for a single site. This improves the
performance of parallel reads and writes on shared filesystems
like GPFS and minimizes clutter while browsing the tree. Because
files represent points, the resolution is only set by the organization
of the enclosing directory tree and is recorded in an archive
metadata file, as well as in the metadata of each .psims.nc4 file.
Climate data from observations and simulations is widely
available in open and easily accessible archives, such as those
accessible by Earth System Grid (ESG) [30], NCAR’s NOMADS
servers [26], and NASA’s Modeling and Assimilation Data
Information Services Center (MDISC). These datasets have
substantial metadata—often based on Climate and Forecasting
(CF) conventions [8]—and are frequently identified by a unique
Digital Object Identifier (DOI) [21], making provenance and
tracking relatively straightforward. Such data is often archived at
uniform grid resolutions, but scales can vary from a few
kilometers to one or more degrees and frequencies from an hour
to a month. Furthermore, different map projections mean that
transformations are sometimes necessary before data can be used.
3. AN EXAMPLE ASSESSMENT STUDY We now present an example of a maize yield assessment and
validation exercise conducted in the conterminous US at five-
arcminute spatial resolution. For each grid cell in the study region,
we simulate eight management configurations (detailing fertilizer
and irrigation application) from 1980-2009 using the CERES-
maize model (Figure 3).
Simulations are driven by observation and reanalysis-based data
products comprised of NCEP Climate Forecast System Reanalysis
(CFSR) temperatures [13], NOAA Climate Prediction Center
(CPC) US Unified Precipitation [23], and NASA Surface
Radiation Budget solar radiation (SRB) dataset [27].
Figure 3: Time-series yields for a single site for four fertilizer
application rates w/ (solid) and w/o (dashed) irrigation.
The gridded yield for crop i in grid cell x at time t takes the form
where we denote explicitly the dependence of yield on local
climate (Q; including temperature, precipitation, solar radiation,
and CO2), soils (S), and farm management (M; including planting
dates, cultivars, fertilizers, and irrigation). Figure 4 shows
summary measures (median and standard deviation) over the
historical simulation period for a single management
configuration (rainfed with high nitrogen fertilizer application
rates) from this campaign.
High-resolution yield maps allow researchers to visualize the
changing patterns of crop yields. However, such measures must
typically be aggregated at some decision-relevant environmental
or political scale, such as county, watershed, state, river basin,
nation, or continent, for use in models or decision-maker analyses.
To this end, we are developing tools to aggregate yields and
climate impact measures to arbitrary spatial scales. For a given
individual region R, crop i, and time t, the basic problem of re-
scaling local yields (Y) to regional production (P) takes the form
where the weighting function for aggregation, Wi(x,t), is the area
in location x at time t that is engaged in the production of crop i.
Figure 2: Expanded schematic of data types and processing pipeline from Figure 1. The steps in the pipeline are as follows. 1) Data is ingested from
arbitrary sources in various file formats and data types. 2) If necessary data is transformed to a geographic projection. 3) For each land grid-cell, the full
time series of daily data is extracted for each cell (in parallel) and converted to the .psims.nc4 format. 4) Each set of .psims.nc4 files extracted from a dataset are then organized into an archive for long-term storage. 5) If the input dataset is still being updated, we ingest and process the updates at regular
intervals and 6) we append the updates to the existing .psims.nc4 archive.
A further use of aggregation is to produce measures of yield that
can be compared with survey data at various spatial scales for
model validation and uncertainty quantification. For example,
yield measures are aggregated to county and state level and
compared with survey data from the US National Agricultural
Statistics Service (NASS) [3] by calculating time-series
correlations between simulated and observed values (Figure 5).
Next we consider RMSE measures for the simulated vs. observed
yields at the various scales. Figure 6 plots the simulated vs.
observed yields at the county and state level: in total, 60423
observations over 30 years and 2373 counties and 1225
observations over 41 states, respectively. The root mean square
error (RMSE) between observed yields and prototype simulated
yields is 25% of the sample mean, and an unconstrained linear fit
of the simulation vs. observation results has an R2 of 0.6. At state
level, the RMSE is 15.7% of the sample mean and at national
level its reduced to 10.8% (13.6 bushels/Acre) over the test
period. Given the simplicity of the experiment, these results
compare favorably with other recent spatial simulation/validation
exercises at the state level.
Figure 4: Simulation outputs of a single pDSSAT campaign. Maps of
the median (top) and standard deviation (bottom) of simulated
potential rainfed/high-input maize yield.
4. USE OF Swift PARALLEL SCRIPTING Each pSIMS model run requires the execution of 10,000 to
120,000 fairly small serial jobs, one per geographic grid cell, and
each using one CPU core for a few seconds to a few minutes. The
Swift [29] parallel scripting language has made it straightforward
to write and execute pSIMS runs, using highly portable and
system-independent scripts such the example in Listing 1. Space
does not permit a detailed description of this program, but in brief,
it defines a machine-independent interface to the DSSAT
executable, loads a list of grids on which that executable is to be
run, and then defines a set of DSSAT tasks, one per grid.
The Swift language is implicitly parallel, high-level, and
functional. It automates the difficult and science-distracting tasks
of distributing tasks and data across multiple remote systems, and
of retrying failing tasks and restarting failing workflow runs. Its
runtime automatically manages the execution of tens of thousands
of small single-core or multi-core jobs, and dynamically packs
those jobs tightly onto multiple nodes of the allocated computing
resources to maximize system utilization. Swift automates the
acquisition of nodes; inter-job data dependencies; throttling;
scheduling and dispatch of work to cores; and retry of failing jobs.
Swift is used in many science domains, including earth systems,
energy economics, theoretical chemistry, protein science and
graph analysis. It is a simple scripting system for users to install
and use and enables many new classes of users to get on board and run productively in a remarkably short time.
Figure 5: Time-series correlations between county level aggregates of
simulated maize yields and NASS statistics from 1980-2009 for
pDSSAT simulations using CFSR temperatures, CPC precipitation,
and SRB solar. Only counties for which NASS records a minimum of
six years of data with an average of more than 500 maize cultivated
hectares per year are shown.
The key to Swift’s ability to execute large numbers of small tasks
efficiently on large parallel computers is its use of a two-level
scheduling strategy. Internally, Swift launches a pilot job called a
“coaster” on each node within a resource pool [12]. The Swift
runtime then manages the dispatch of application invocations,
plus any data that these tasks require, to those coasters, which
manage their execution on compute nodes. As tasks finish, Swift
schedules more work to those nodes, achieving high CPU
utilization even for fine-grained workloads. Individual tasks can
be serial, OpenMP, MPI, or other parallel applications.
Swift makes computing location independent, allowing us to run
pSIMS on a variety of grids, supercomputers, clouds, and clusters,
with the same pSIMS script used on multiple distributed sites and
diverse computing resources. Figure 7 shows a typical execution
scenario, in which pSIMS is run across two University of Chicago
campus resources: the UChicago Campus Computing Cooperative
(UC3) [6] and the UChicago Research Computing Center cluster
“Midway.” Over the past two years the RDCEP team has
performed a large amount of impact modeling using this
framework. In the past 10 months alone, more than 80 pSIMS
impact runs have been performed, totaling over 5.6 million
DSSAT runs and a growing number of runs with other models
such as CeNW: see Table 1.
Figure 6: Simulated county-level (top) and state-level (bottom)
average corn yields plotted against USDA survey data. Over 1980-
2009, there are 60423 data points from 2373 counties and 1225
datapoints from 41 states. The red line is y = x with a RMSE between
the simulated and observed yields of 26 bu/A (25% of sample mean)
at the county level and 18.6 bu/A (15.7% of the sample mean) at the
state level. The black line is the best-fit linear model with R2 = 0.60 at
county and 0.73 at state level.
We have run pSIMS over the UC3 campus computing collective
[6], Open Science Grid [22], and UChicago Midway research
cluster; the XSEDE clusters Ranger and its successor Stampede;
and, in the past, on Beagle, PADS, and Raven. Through Swift’s
interface to the Nimbus cloudinit.d [5], we have run on the NSF
FutureGrid and can readily run on commercial cloud services.
Figure 7: Typical Swift configuration for pSIMS execution.
Listing 1: Swift script for the pSIMS ensemble simulation study
Table 1: Summary of campaign execution by project, including the total number of jobs in each campaign, the total number of simulation units (jobs x scenarios x years), the total model CPU time, and the total size of the outputs generated.
Project Campaigns
Sim
Units
(Billion)
CPU
Hours
(K)
Jobs
(M)
Output
data
(TBytes)
NARCCAP USA 16 1.3 13 1.9 .47
ISI-MIP Global 80 11.8 216 4.38 4.14
Prediction 2012 2 0.2 2 0.24 0.5
Qualitatively, our experience with the pSIMS framework and
Swift has been a striking confirmation of the fact that portable
scripts can indeed enable the use of diverse and distributed HPC
resources for many-task computing by scientists with no prior
experience in HPC and with little time to learn and apply such
skills. After the scripts were developed and solidified initially by
developers on the Swift team, execution and extension of the
framework was managed by the project science team, who had no
prior experience with Swift and only modest prior experience in
shell and Perl scripting.
Whether due to bugs, human error, machine interruption, or other
issues, science workflows must often be restarted or re-run. Swift
enables a simple and intuitive re-execution mechanism to pick-up
interrupted simulations where they left off and complete the
// 1) Define RunDSSAT function as an interface to the DSSAT program
// a) Specify function prototype app (file tar_out, file part_out)
RunDSSAT (string x_file,
file scenario[],
file weather[],
file common[],
file exe[],
file perl[],
file wrapper)
// b) Specify how to invoke the DSSAST executable {
bash "-c"
@strcats(" ","chmod +x ./RunDSSAT.sh ; ./RunDSSAT.sh ",
x_file,
@tar_out, @arg("years"), @arg("num_scenarios"),
@arg("variables"), @arg("crop"),
@arg("executable"), @scenarios, @weather,
@comdata, @exe, @perl);
}
// 2) Read the list of grids over which computation is to be performed string gridLists[] = readData("gridList.txt");
string xfile = @arg("xfile");
// 3) Invoke RunDSSAT once for each grid; different calls are concurrent foreach g,i in gridLists {
// a) Map the various input and output files to Swift variables file tar_output <single_file_mapper; file=@strcat("output/",
gridLists[i], "output.tar.gz")>;
file part_output <single_file_mapper; file=@strcat("parts/",
gridLists[i], ".part")>;
file scenario[] <filesys_mapper;
location=@strcat(@arg("scenarios"), "/",
gridLists[i]), pattern="*">;
file weather[] <filesys_mapper;
location=@strcat(@arg("weather"), "/",
gridLists[i]), pattern="*">;
file common[] <filesys_mapper; location=@arg("refdata"),
pattern="*">;
file exe[] <filesys_mapper; location=@arg("bindata"),
pattern="*.EXE">;
file perl[] <filesys_mapper; location=@arg("bindata"),
pattern="*.pl">;
file wrapper <single_file_mapper; file="RunDSSAT.sh">;
// b) Call RunDSSAT (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3,
in4, in5, wrapper);
}
simulation campaign. And pSIMS has proven its value in enabling
entire projects (millions of jobs) to be re-executed when
application code errors were discovered.
Swift enables us to take advantage of new systems and systems
with idle opportunities (e.g., over holidays). We recently
demonstrated such powerful multi-system runs using a similar
geospatially gridded application for analyzing MODIS land use
data. The user was able to dispatch 3,000 jobs from a single
submit host (on Midway) to five different computer systems
(UC3, Midway, Beagle, MWT2, and OSG). The user initially
validated correct execution by processing 10 files on a basic login
host, and then scaled up to a 3,000-file dataset, changing only the
dataset name and a site-specification list to get to the resources.
Swift’s transparent multi-site capability expanded the scope of
their computations from one node to hundreds or thousands of
cores. Swift’s transparent multi-site capability expanded the scope
of their computations from one node to hundreds or thousands of
cores. (About 500 cores were used for most production runs, and
tests of full scale runs were executed at up to 4,096 cores on the
XSEDE "Ranger" system).The user did not need to look at what
sites were busy or adjust arcane scripts to get to these resources.
The throughput achieved by Swift varies according to the nature
of the system on which it is run. For example, Figure 8 shows an
execution profile for a heavily loaded University of Chicago UC3
system. The “load” (number of concurrently running tasks) varies
greatly over time, due to other users’ activity. At the other end of
the scale, on systems such as the (former) TACC Ranger and its
successor Stampede, where local login hosts are not suited for
running even modest multi-core jobs such as the Swift Java client,
Swift provides the flexibility of running both the client and all its
workers within a single multi-node HPC job. The performance of
such a run is shown in Figure 9 below. This performance profile
reflects the absence of competition for cluster resources at the
time of execution. Because the allocated partition was almost
completely idle, Swift was able to quickly ramp up the system and
keep it fully utilized for a significant portion of the run. Similar
flexibility exists to run Swift from an interactive compute node
and to send application invocation tasks to the compute nodes
composed from one or more separate HPC jobs.
Swift has also been instrumental in the preparation of input data
for our DSSAT simulation campaigns. In the case of the NARR
data it was necessary to process hundreds of thousands of six-hour
weather reports in individual gridded binary (GRB) files. Swift
managed the format conversion, spatial resampling, aggregation
to daily statistics, and collation into annual netCDF files as pre-
cursors to the individual point-based time series used in the
simulation. By exposing generic command line utilities for
manipulating the various file types as Swift functions it is possible
to express the transformations required in an intuitive fashion.
5. FUTURE OBJECTIVES Ongoing efforts related to the development of this framework fall
into two general themes: facilitating the applications of other
biophysical crop production models and enabling the management
of simulation campaigns through web-based portals.
The process of transforming a new weather input dataset is a
painstaking one that has been diffcult to generalize so far, so our
approach has been to decouple the data preparation phase of the
processing pipeline from the simulation and summarizing phases.
The point of contact between these activities is the .psims.nc4
format. As an intermediate representation of the data that drives
our simulations it gives us a target to implement translations from
both upstream data archives and into the formats expected by the
various crop modelling packages.
Figure 8: Task execution progress for the ~120,000-task simulation
campaign described in Section 3. This particular run was performed
on the University of Chicago’s UC3 system. Time is in seconds.
Figure 9: Results from a large scale pSIMS run on XSEDE’s Ranger
supercomputer. Using 256 nodes (4096 cores), the full run completes
in 16 minutes.
We are integrating pSIMS into two different web-based eScience
portals to improve usability and expand access to this powerful
tool. The Framework to Advance Climate, Economics, and
Impacts Investigations with Information Technology (FACE-IT)
is a prototype eScience portal and workflow engine based on
Galaxy [11]. As a first step towards integration of pSIMS with
FACE-IT, we have prototyped a pSIMS workflow in Galaxy: see
Figure 10. Working with researchers from iPlant and TACC, we
prototyped pSIMS as an application in the iPlant Discovery
Environment [18]: see Figure 11. The latter application has been
run on Stampede and Ranger.
6. DISCUSSION The parallel System for Integrating Impacts Models and Sectors
(pSIMS) is a new framework for efficient implementation of
large-scale assessments of climate vulnerabilities, impacts, and
adaptations across multiple sectors and at unprecedented scales.
pSIMS delivers two advances in software and applications for
climate impact modeling: 1) a high-performance data ingest and
processing pipeline that generates a standardized and evolving
input dataset of socio-economic, environmental, and climate
datasets based on a portable and efficient new data format; and 2)
a code base to enable large-scale simulations at high resolution of
the impacts of a changing climate on primary production
(agriculture, livestock, and forestry) using arbitrary point-based
climate impact models.
Figure 10: Prototype Galaxy-based FACE-IT portal
Figure 11: Prototype pSIMS integration in iPlant.
Both advances are enabled by high-performance computing
leveraged with the Swift parallel scripting language. pSIMS also
contains tools for data translation of both the input and output
formats from various models; specifications for integrating
translators developed in AgMIP; and tools for aggregation and
scaling of the resulting simulation outputs to arbitrary spatial and
temporal scales relevant for decision-support, validation, and
downstream model coupling. This framework has been used for
high-resolution crop yield and climate impact assessments at the
US [10] and global levels [9, 25].
ACKNOWLEDGMENTS We thank Pierre Riteau and Kate Keahey for help running pSIMS
on cloud resources with Nimbus; Stephen Welch of AgMIP, Dan
Stanzione of iPlant and the Texas Advanced Computing Center
(TACC), and John Fonner and Matthew Vaughn of TACC for
help executing Swift workflows on Ranger and Stampede, and
under the iPlant portal; and Ravi Madduri of Argonne and
UChicago for help placing pSIMS under the Galaxy portal. This
work was supported in part by the National Science Foundation
under grants SBE-0951576 and GEO-1215910. Swift is supported
in part by NSF grant OCI-1148443. Computing for this project is
provided by a number of sources including the XSEDE Stampede
machine at TACC, University of Chicago Computing
Cooperative, the University of Chicago Research Computing
Center, and through the NIH with resources provided by the
Computation Institute and the Biological Sciences Division of the
University of Chicago and Argonne National Laboratory under
grant S10 RR029030-01. NASA SRB data was obtained from the
NASA Langley Research Center Atmospheric Sciences Data
Center NASA/GEWEX SRB Project. CPC US Unified
Precipitation data obtained from the NOAA/OAR/ESRL PSD,
Boulder, Colorado, USA [2].
REFERENCES 1. AgMIP source code repository. [Accessed April 1, 2013];
Available from: https://github.com/agmip.
2. Earth System Research Laboratory, Physical Sciences
Division. [Accessed April 1, 2013]; Available from:
http://www.esrl.noaa.gov/psd/.
3. National Agricultural Statistics Service, County Data Release
Schedule. [Accessed April 1, 2013]; Available from:
http://1.usa.gov/bJ4VQ6.
4. Bondeau, A., Smith, P.C., Zaehle, S., Schaphoff, S., Lucht,
W., Cramer, W., Gerten, D., Lotze-Campen, H., MÜLler, C.,
Reichstein, M. and Smith, B. Modelling the role of agriculture
for the 20th century global terrestrial carbon balance. Global
Change Biology, 13(3):679-706, 2007.
5. Bresnahan, J., Freeman, T., LaBissoniere, D. and Keahey, K.,
Managing Appliance Launches in Infrastructure Clouds.
TG'11: 2011 TeraGrid Conference: Extreme Digital
Discovery, Salt Lake City, UT, USA, 2011.
6. Bryant, L., UC3: A Framework for Cooperative Computing at
the University of Chicago Open Science Grid Computing
Infrastructures Community Workshop,
http://1.usa.gov/ZXum6q, University of California Santa
Cruz, 2012.
7. Deryng, D., Sacks, W.J., Barford, C.C. and Ramankutty, N.
Simulating the effects of climate and agricultural management
practices on global crop yield. Global Biogeochemical Cycles,
25(2), 2011.
8. Eaton, B., Gregory, J., Drach, B., Taylor, K., Hankin, S.,
Caron, J., Signell, R., Bentley, P., Rappa, G., Höck, H.,
Pamment, A. and Juckes, M. NetCDF Climate and Forecast
(CF) Metadata Conventions, version 1.5. Lawrence Livermore
National Laboratory, http://cf-pcmdi.llnl.gov/documents/cf-
conventions/1.5/cf-conventions.html, 2010.
9. Elliott, J., Deryng, D., Muller, C., Frieler, K., Konzmann, M.,
Gerten, D., Glotter, M., Florke, M., Wada, Y., Eisner, S.,
Folberth, C., Foster, I., S. Gosling, Haddeland, I., Khabarov,
N., Ludwig, F., Masaki, Y., Olin, S., Rosenzweig, C., Ruane,
A., Satoh, Y., Schmid, E., Stacke, T., Tang, Q. and Wisser, D.
Constraints and potentials of future irrigation water
availability on global agricultural production under climate
change. Proceedings of the National Academy of Sciences,
Submitted to ISI-MIP Special Issue, 2013.
10. Elliott, J., Glotter, M., Best, N., Boote, K.J., Jones, J.W.,
Hatfield, J.L., Rosenzweig, C., Smith, L.A. and Foster, I.
Predicting Agricultural Impacts of Large-Scale Drought: 2012
and the Case for Better Modeling. RDCEP Working Paper
No. 13-01, http://ssrn.com/abstract=2222269, 2013.
11. Goecks, J., Nekrutenko, A., Taylor, J. and The Galaxy Team
Galaxy: a comprehensive approach for supporting accessible,
reproducible, and transparent computational research in the
life sciences. Genome Biol, 11(8):R86, 2010.
12. Hategan, M., Wozniak, J. and Maheshwari, K., Coasters:
Uniform Resource Provisioning and Access for Clouds and
Grids. Fourth IEEE International Conference on Utility and
Cloud Computing (UCC '11), Washington, DC, USA, 2011,
IEEE Computer Society, 114-121.
13. Higgins, R.W. Improved US Precipitation Quality Control
System and Analysis. NCEP/Climate Prediction Center
ATLAS No. 6 (in preparation), 2000.
14. Izaurralde, R.C., Williams, J.R., McGill, W.B., Rosenberg,
N.J. and Jakas, M.C.Q. Simulating soil C dynamics with
EPIC: Model description and testing against long-term data.
Ecological Modelling, 192(3–4):362-384, 2006.
15. Jones, J.W., Hoogenboom, G., Porter, C.H., Boote, K.J.,
Batchelor, W.D., Hunt, L.A., Wilkens, P.W., Singh, U.,
Gijsman, A.J. and Ritchie, J.T. The DSSAT cropping system
model. European Journal of Agronomy, 18(3–4):235-265,
2003.
16. Kirschbaum, M.U.F. CenW, a forest growth model with
linked carbon, energy, nutrient and water cycles. Ecological
Modelling, 118(1):17-59, 1999.
17. Leemans, R. and Solomon, A.M. Modeling the potential
change in yield and distribution of the earth's crops under a
warmed climate. Environmental Protection Agency, Corvallis,
OR (United States). Environmental Research Lab., 1993.
18. Lenards, A., Merchant, N. and Stanzione, D., Building an
environment to facilitate discoveries for plant sciences. 2011
ACM Workshop on Gateway Computing Environments (GCE
'11), 2011, ACM, 51-58.
19. Liu, J., Williams, J.R., Zehnder, A.J.B. and Yang, H. GEPIC –
modelling wheat yield and crop water productivity with high
resolution on a global scale. Agricultural Systems, 94(2):478-
493, 2007.
20. Malik, T., Best, N., Elliott, J., Madduri, R. and Foster, I.,
Improving the efficiency of subset queries on raster images.
HPDGIS '11: Second International Workshop on High
Performance and Distributed Geographic Information
Systems, Chicago, Illinois, USA, 2011, ACM, 34-37.
21. Paskin, N. Digital Object Identifiers for scientific data. Data
Science Journal, 4:12-20, 2005.
22. Pordes, R., Petravick, D., Kramer, B., Olson, D., Livny, M.,
Roy, A., Avery, P., Blackburn, K., Wenaus, T., Würthwein,
F., Foster, I., Gardner, R., Wilde, M., Blatecky, A., McGee, J.
and Quick, R., The Open Science Grid. Scientific Discovery
through Advanced Computing (SciDAC) Conference, 2007.
23. Probert, M.E., Dimes, J.P., Keating, B.A., Dalal, R.C. and
Strong, W.M. APSIM's water and nitrogen modules and
simulation of the dynamics of water and nitrogen in fallow
systems. Agricultural Systems, 56(1):1-28, 1998.
24. Rosenzweig, C., Jones, J.W., Hatfield, J.L., Ruane, A.C.,
Boote, K.J., Thorburn, P., Antle, J.M., Nelson, G.C., Porter,
C., Janssen, S., Asseng, S., Basso, B., Ewert, F., Wallach, D.,
Baigorria, G. and Winter, J.M. The Agricultural Model
Intercomparison and Improvement Project (AgMIP):
Protocols and pilot studies. Agricultural and Forest
Meteorology, 170(0):166-182, 2013.
25. Rosenzweig, C. and others Assessing agricultural risks of
climate change in the 21st century in a global gridded crop
model intercomparison. Proceedings of the National Academy
of Sciences, Submitted to ISI-MIP Special Issue, 2013.
26. Rutledge, G.K., Alpert, J. and Ebisuzaki, W. NOMADS: A
Climate and Weather Model Archive at the National Oceanic
and Atmospheric Administration. Bulletin of the American
Meteorological Society, 87(3):327-341, 2006.
27. Stackhouse, P.W., Gupta, S.K., Cox, S.J., Mikovitz, J.C.,
Zhang, T. and Hinkelman, L.M. The NASA/GEWEX Surface
Radiation Budget Release 3.0: 24.5-Year Dataset. GEWEX
News, 21(1):10-12, 2011.
28. Waha, K., van Bussel, L.G.J., Müller, C. and Bondeau, A.
Climate-driven simulation of global crop sowing dates.
Global Ecology and Biogeography, 21(2):247-259, 2012.
29. Wilde, M., Foster, I., Iskra, K., Beckman, P., Zhang, Z.,
Espinosa, A., Hategan, M., Clifford, B. and Raicu, I. Parallel
Scripting for Applications at the Petascale and Beyond. IEEE
Computer, 42(11):50-60, 2009.
30. Williams, D.N., Ananthakrishnan, R., Bernholdt, D.E.,
Bharathi, S., Brown, D., Chen, M., Chervenak, A.L.,
Cinquini, L., Drach, R., Foster, I.T., Fox, P., Fraser, D.,
Garcia, J., Hankin, S., Jones, P., Middleton, D.E., Schwidder,
J., Schweitzer, R., Schuler, R., Shoshani, A., Siebenlist, F.,
Sim, A., Strand, W.G., Su, M. and Wilhelmi, N. The Earth
System Grid: Enabling Access to Multi-Model Climate
Simulation Data. Bulletin of the American Meteorological
Society, 90(2):195-205, 2009.