Quality Assurance Project Plan
Project 13‐TN2
Development of an IDL‐based geospatial data processing framework for
meteorology and air quality modeling
Daniel Tong Cooperative Institute for Climate and Satellites, University of Maryland, College Park, Maryland
Hyun Cheol Kim, Fantine Ngan Cooperative Institute for Climate and Satellites, University of Maryland, College Park, Maryland
and Pius Lee Air Resources Laboratory, NOAA, College Park, MD
Summary of Project
QAPP Category Number: III
Type of Project: Research or Development (Modeling)
QAPP Requirements: This QAPP includes descriptions of the project and
objectives; organization and responsibilities; geospatial data processing
framework; computational algorithms; GIS and satellite data; quality metrics;
reporting; and references.
QA Requirements: Audits of Data Quality: Cat III = 10% Required Report of QA Findings: In final report
2
Distribution List
Gary McGaughey, Project Manager, Texas Air Quality Research Program
Cyril Durrenberger, Quality Assurance Project Plan Officer, Texas Air Quality Research Program
Bright Dornblaser, Project Liaison, Texas Commission on Environmental Quality
Chris Owen, Quality Assurance Project Plan Officer, Texas Commission on Environmental Quality
Maria Stanzione, Program Grant Manager, Texas Air Quality Research Program
4
1. PROJECT DESCRIPTION AND OBJECTIVES
1.1 Problem Statement
Fast and accurate handling of Geographic Information System (GIS) data and satellite data is
essential in regional meteorological and chemical modeling and for data analysis. Accurate land
use information is particularly important in meteorological simulations for land surface
exchanges. It is also crucial in air quality simulations for spatial allocation of emission sources.
There has been increasing demand for geospatial data processing tools as finer resolution air
quality simulations become more commonplace.
The Texas Commission on Environmental Quality (TCEQ) has been considering utilizing fine
resolution land use land cover (LULC) data and satellite‐based remote sensing data for
meteorology and air quality simulations. Under contract to TCEQ, Byun et al. (2007 & 2008)
have incorporated high resolution LULC dataset from the University of Texas Center for Space
Research (UTCSR) (Wells, 2006) and a Texas Forest Service (TFS) (Cheng and Byun, 2008)
dataset into the Fifth‐Generation Penn State/NCAR Mesoscale Model (MM5) (Grell et al., 1994).
A similar approach to use fine resolution LULC data (30‐m Texas LULC generated by Texas A&M)
has also been done for the Weather Research and Forecasting (WRF) (Skamarock et al., 2008),
which is a successor to the MM5 model (Byun et al., 2011). Satellite data is also an important
source for providing model inputs. Furthermore, this satellite data form one of the bases for
model performance evaluation. Precedence of such applications was evident in several of the
previous TCEQ projects; e.g., implementations of Sea Surface Temperature (SST) (Byun et al.,
2008) and soil moisture (Byun et al., 2011).
Many such previous approaches were successful. However, there was some concern regarding
the general procedures of geospatial data processing used in the approaches. They had all
focused on developing a project specific data processing tool. There was no integrated effort to
handle a variety of GIS and/or satellite data sets in a unified framework. Such a framework will
facilitate speedy incorporation of new data formats. As increasing numbers of finer resolution
data have become available recently, current data processing tools are too slow, or could cause
memory problems while processing huge data sets. For example, current National Land Cover
Data (NLCD, http://www.mrlc.gov/index.php) data features 30‐m LULC data covering the
Contiguous United States (CONUS) (161190×104424=16 billion pixels) requires a large
computer with large run time and memory to process. Usually, it is impossible to load the
whole data set into the computer memory. Random access and subtraction of local data sets
are a fundamental capability of a successful data processing tool. The CONUS road network
5
data in vectorized polyline formats is another example. It usually has more than a million
entities. Therefore efficient tools to handle these data are necessary.
1.2 Project Objectives
This project investigates basic computational algorithms to handle GIS data and satellite data. It
develops a set of generalized libraries within a geospatial data processing framework aiming for
more efficient and accurate processing of geospatial data. We will utilize the Interactive Data
Language (IDL), by EXELIS Visual Information Solutions, to build geospatial processing libraries.
An IDL‐based Geospatial Data Processor (IGDP) has been created by the Air Resources
Laboratory, National Ocean and Atmospheric Administration (ARL/NOAA). It can process GIS
data both in vector format (e.g., ESRI shapefiles (.SHP) and raster format (e.g., Geo Tagged
Image File Format (GeoTIFF) and ERDAS IMAGINE (.IMG)) for any given domain. Processing
speeds will be improved through selective usages of polygon‐clipping routines and other
algorithms optimized for particular applications. The raster tool will be developed utilizing a
histogram reverse‐indexing method that enables easy access of grouped pixels. It generates
statistics of pixel values within each grid cell with improved speed and enhanced control of
memory usage. The spatial allocating tool, using the polygon clipping algorithms, requires huge
computational power to calculate fractional weighting between GIS polygons (and/or polylines)
and gridded cells. To overcome this speed issue and computational accuracy, an efficient
polygon/polyline clipping algorithm is crucial. A key for faster spatial allocation is to optimize
computational iterations in both polygon clipping and map projection calculations.
The project has the following specific objectives
1. To conduct a literature search for summarizing and comparing available GIS data
processing algorithms. Advantages and constraints from each algorithm will be
described.
2. To develop an optimized geospatial data processing tool that can handle raster data
format (e.g. pixels) and vector data format (polylines and polygons) with enhanced
processing time and accuracy, for any given target domain.
3. To collect and to process sample GIS and satellite data. Applications will include a spatial
regridding method on emissions and satellite data, such as the Moderate Resolution
Imaging Spectroradiometer (MODIS) Aerosol Optical Depth (AOD), the Ozone
Monitoring Instrument (OMI), and the Global Ozone Monitoring Experiment
(GOME)‐2 NO2 column data.
4. To perform an engineering test with processed fine resolution LULC data.
5. To draft a final report that documents all work performed in support of the project.
6
2. ORGANIZATION AND RESPONSIBILITIES
2.1 Personnel and Responsibilities
This project is a collaborative effort between Drs. Daniel Tong, Hyun Cheol Kim, Fantine Ngan of
the University of Maryland at College Park (UMD), and Dr. Pius Lee of Air Resources Laboratory
(ARL) of the National Oceanic and Atmospheric Administration (NOAA). Dr. Daniel Tong,
Research Associate Professor in the Cooperative Institute for Climate and Satellites, UMD at
College Park, is the Principal Investigator for the project. Drs. Hyun Cheol Kim and Fantine Ngan
from the Cooperative Institute for Climate and Satellites, UMD at College Park will serve as Co‐
Principal Investigators. Dr. Pius Lee, ARL National Air Quality Forecasting Capability Project
Leader, will also serve as a Co‐Principal Investigator for the project. Project participants and their
responsibilities are provided in Table 1 below. Drs. Daniel Tong and Pius Lee will have overall
oversight of the quality assurance.
7
Table 1. Project participants and their affiliations and key responsibilities.
Participant (Organization) KeyResponsibilities
Daniel Tong
(UMD)
Principal Investigator with overall responsibility for preparation
of emissions test runs, including the quality assurance and quality
control activities.
Hyun Cheol Kim
(UMD)
Co‐Principal Investigator with overall responsibility for the GIS
data processing algorithm review, development of the raster
and vector data processing tool, development of the spatial data
regridding method and the satellite data processing,
documentation and training of newly developed tools, including
quality assurance and quality control activities.
Fantine Ngan
(UMD)
Co‐Principal Investigator with overall responsibility for model
simulation and result evaluation, including quality assurance and
quality control activities. Pius Lee
(ARL/NOAA)
Co‐Principal Investigator with overall responsibility for quality
assurance and quality control activities.
Gary McGaughey
(University of Texas at Austin)
AQRP project Manager who oversees that the grantees achieve
the satisfactory completion of the project.
8
2.2 Schedule
The schedule for specific tasks is listed in Table 2.
Table 2. Schedule of project activities.
ID Task 2/13 3/13 4/13 5/13 6/13 8/13 9/13 10/13 11/13
1 Literature Review for basic algorithms X X X
2 Collection of GIS and satellite data X X
3 Development of raster tool X X X X
4 Development of vector tool X X X X
5 Applications (e.g. spatial regridding of
emissions and satellite data) X X X X
6 Documentation and training (manuals on IDL
library and packages) X X X X X X
7 Test run (engineering test) X X X
8 Reporting X X X X X X X X X
9
3. SCIENTIFIC APPROACH: Computational algorithms and data
Appropriate handling of GIS data is crucial in air quality simulations, especially in preparation of
emissions data. However, tools for GIS data have scarcely been developed in the air quality
scientific community. In most cases, current solutions for GIS data processing include PC‐based
ArcGIS tools and/or the U.S. EPA spatial allocator. Both tools have clear limitations in the
seamless processing of GIS data: compatibility on multi‐platforms, flexibility in various data
formats; and most importantly, processing speed for fine resolution data. In order to overcome
these limitations, we will design a generalized GIS data processing library and package to
process not only GIS data, but also emissions and satellite data.
Most GIS data has three types of data: point (pixel), line (polyline), and area (polygon). Pixel
data is usually called a raster dataset which is one‐dimensional dataset without effective area.
The value of each pixel represents the value at the center point of the area. Polyline and/or
polygon data have an effective length or area, and a polygon is a closed form of polyline, so we
can use similar routines for both polyline and polygon data. In order to convert polygon or pixel
data to a gridded format, we need to know how these irregular polygons overlap each grid cell,
and how many pixels are inside each grid cell. A polygon clipping algorithm and a pixel grouping
algorithm are key components in this new GIS data processing tool that we will develop.
3.1 Polygon clipping algorithms
Traditionally, a polygon clipping algorithm has been used in computer graphics to clip out the portions of a polygon that lie outside the window of the output device to prevent undesirable effects. Lately advanced computer graphics uses polygon clipping to render 3D images through hidden surface removal and to produce high‐quality surface details using techniques such as Beam tracing. It is also used in distributing the objects of a scene to appropriate processors in multiprocessor ray‐tracing systems to improve rendering speeds.
As clipping an arbitrary polygon against an arbitrary polygon is a basic routine in computer graphics and it may be applied thousands of times, the efficiency of these routines is therefore extremely important. To achieve good results, several polygon clipping algorithms have been developed, from simplified clipping algorithms that can only clip regular (e.g. convex) polygons, to more complicated algorithms that can handle more general polygons (e.g. concave or self‐intersecting polygons). These algorithms have their own advantages and disadvantages in processing efficiency and flexibility. We describe two such algorithms:
(1) Southerland‐Hodgman algorithm
The Sutherland‐Hodgman polygon clipping algorithm (Sutherland and Hodgman, 1974) was introduced in 1974. It works by extending each line of the convex clip polygon in turn and selecting only vertices from the subject polygon that lie on the visible side. The algorithm
10
begins with an input list of all vertices in the subject polygon, and one side of the clip polygon is extended infinitely in both directions, and the path of the subject polygon is traversed. Vertices from the input list are inserted into an output list if they lie on the visible side of the extended clip polygon line, and new vertices are added to the output list where the subject polygon path crosses the extended clip polygon line. This process is repeated iteratively for each clip polygon side, using the output list from one stage as the input list for the next. Once all sides of the clip polygon have been processed, the final generated list of vertices defines a new single polygon that is entirely visible. These steps are shown in Fig. 1. This is a very fast and efficient algorithm, but applies only when the target polygon is convex.
(2) Vatti clipping algorithm
The Vatti clipping algorithm is a more complicated and generalized polygon clipping algorithm. It allows clipping of any number of arbitrarily shaped subject polygons by any number of arbitrarily shaped clip polygons. Unlike the Sutherland‐Hodgman algorithm, the Vatti algorithm does not restrict the types of polygons that can be used as subjects or clips. Even more complex (e.g. self‐intersecting) polygons, and polygons with holes can be processed. This algorithm can be applied to various Boolean clipping operations: “intersections”, “difference”, “union”, and “exclusive”. This algorithm is generally applicable only in the 2D space.
Compared to the Sutherland‐Hodgman algorithm, the Vatti clipping algorithm is a more complete polygon clipping algorithm with more functionalities, but its features are sometimes beyond our scope and cause inevitable loss in processing efficiencies. Therefore, for the development of the IGDP, we utilized both polygon clipping algorithms, the simple Sutherland‐Hodgman and the complex Vatti algorithm, based on the necessary features and optimized processing time of GIS data.
Figure 1. Steps of Sutherland‐Hodgman polygon clipping algorithm (http://www.cs.helsinki.fi/group/goa/viewing/leikkaus/intro2.html)
11
3.2 Raster data handling procedures
The raster data processing tool uses a histogram reverse‐indexing method in the IDL histogram
function, and is capable of fast access of grouped pixels. For each grid cell, the raster tool
provides a histogram and statistics of pixels inside the grid cell. Figure 2 shows an example of
30‐m NLCD LULC data in a 4‐km grid cell near the Houston region. Usually, around 18,000 pixels
are found in a 4‐km grid cell, and the histogram in the right panel shows the pixel count
distribution for LULC types in a single grid cell.
12
Figure 2. 30‐m NLCD land use land cover data set near Houston (left), and an example pixel
distribution from a 4‐km grid cell.
3.3 Application of IGDP ‐ Satellite data regridding
Regridding of model output or satellite data with different map projection settings is very
important for inter‐comparisons of modeled results and/or satellite outputs. Simple
interpolation might be able to generate an approximate result, but it fails to conserve mass. For
more accurate remapping of spatial data, we need to know the exact fractions between the
initial data cells and the target grid cells. The spatial allocator tool (e.g. vector tool) from the
IGDP can provide exact fractions using the polygon clipping algorithm. Figure 3 shows an
example of this “conservative remapping” method. If we would like to regrid 4km data into
12km grid cells exactly, we need to know overlapping fractions of each original cell to the target
cell. Using spatial allocator from the GDP vector tool, one can perform regridding calculations
with the necessary accuracy.
13
Figure 3. Example of “conservative spatial regridding”. In order to calculate the exact amount in
total value in 12‐km cell (j), we need to know all the 4‐km cell(i)’s overlapping fractions, and
then sum them up.
4. QUALITY METRICS
Tool evaluation: As the goal of this project is to develop an accurate and efficient tool to
process geospatial data and satellite data which are usually used for model input and model
evaluation, the performance of the geospatial data processing tool will be evaluated (1) by the
comparison to geospatial data output processed using traditional tools, such as ArcGIS and EPA
spatial allocator, as well as the terrain processor of WRF Preprocessing System (WPS), and (2)
by the estimation of the tool’s processing speeds for varying model domain configurations.
Processed data will be compared with geographical spatial maps – comparing processed water
or ocean fractions with GIS coast line data and/or hydrology GIS data (e.g. rivers and lakes) are
one of the simple and effective ways to check the accuracy of the new tool, especially the
capability of handling proper map projections. Fine resolution LULC data, 30‐m or finer (if
available), will be processed as model inputs in a fine resolution simulation (i.e. 4‐km
simulation), and will be compared with default data set currently used in WRF‐ARW model.
Spatial areas with significant differences will be identified and described, by both graphical and
statistical comparisons. Four spatial graphics – processed data with the new tool, processed
data with model terrain processor (e.g. WRF WPS), unprocessed fine resolution raw data, and
coast line GIS data (e.g. shapefile (.SHP)) – will be compared in the regions with complicated
coast lines (e.g. Houston‐Galveston region). Statistics of processed data by the new tool and the
14
traditional tool will be compared by showing the histogram distribution of each LULC type
indices.
Engineering test: An engineering test will be performed for a short time period, to ensure
quality of newly developed IGDP GIS data processing tool. Meteorology and/or chemistry
model simulations using high resolution LULC, and additional fine resolution data, if any, will be
performed and a general evaluation with observations will be conducted. Comparisons with
old/new or coarse/fine resolution input will be presented, but detailed investigation of any
scientific finding will be beyond the scope of this project. Table 3 summaries the model
configuration of the WRF model for the meteorology simulation that follows TCEQ SIP modeling
settings. We will choose identical domain settings as previous studies to minimize efforts for
model set up and benefit by using the previous model results as base for this evaluation. Model
simulations with old and new inputs will be compared by computing mean deviation and root
mean square deviation between two model outputs. We will also compute statistics including
mean bias (MB) and root mean square error (RMSE), using surface observational data.
Mean Deviation,
N
iii MM
NMD
1
)21(1
Root Mean Square Deviation (RMSD),
2/1
1
2)21(1
N
iiiRMSD MM
NE
Mean Bias (MB),
N
iiiGB OM
NB
1
)(1
Root Mean Square Error (RMSE),
2/1
1
2)(1
N
iiiRMSE OM
NE
where M is model value, O is measured value, and N is number of data. M1 and M2 denote
simulations with new and old inputs, respectively.
15
Table 3. Model configuration and domain nesting of the WRF model.
Domain name NA36 SUS12 TX04
Resolution 36 km 12 km 4 km
Domain coverage Continental US Texas & adjoined
states
Eastern Texas
Horizontal grid 162 x 128 174 x 138 216 x 288
Initialization NAM + NCEP daily SST
Run in 2‐way nesting
Nest‐down of SUS12
Microphysics WSM5a WSM6b
Cloud scheme KFc None
Radiation scheme RRTMd for longwave radiation
MM5 (Dudhia)e for shortwave radiation
PBL scheme YSUf scheme
Land surface
model
5‐layer slab modelg
Nudging 3D grid nudging (no nudging of mass fields within PBL)
a WRF Single‐Moment 5‐class (Hong et al., 2004). b WRF Single‐Moment 6‐class (Hong and Lim, 2006). c Kain and
Fritsch scheme (Kain, 2004). d Rapid Radiative Transfer Model scheme (Mlawer et al., 1997). e Dudhia (1989). f
Yonsei University scheme (Hong et al., 2006). g 5‐layer soil temperature model (Grell et al., 1994).
5. DATA ANALYSIS, INTERPRETATION, AND MANAGEMENT
In addition to the development of a geospatial data processing tool, three types of data will be
collected and archived for the tool’s operational test and performance evaluations: (1) various
GIS data will be collected and utilized, both in vector and raster data format. GIS shape files for
population, census track, road networks, rail road, etc., will be tested for the evaluation of
16
polygon and/or polyline clipping capability, and various fine resolution land use land cover will
be tested for the geospatial tool’s raster data handling capability. (2) Various satellite data, such
as MODIS AOD, OMI/GOME‐2 NO2 column data, and/or Geostationary Sea Surface
Temperature (SST) data, will be collected and utilized to investigate spatial regridding capability
of the newly developed geospatial data tool. (3) A short‐term engineering model run with
outputs will be produced and archived. This engineering run is to evaluate the basic
performance of the geospatial data processing tool and to generate an example of the model’s
input data. This run is not intended to generate the best‐effort simulation with scientific
meaning, but will be evaluated with reasonable methods, and will be discussed for the model
input’s overall quality and future project topic development. Simulation data will be evaluated
by comparisons with base run (e.g. with lulc data processed by traditional tool) and
observations data, as described in the previous section.
6. REPORTING
A technical work plan (statement of work, quality assurance project plan, budget, and budget
justification) will be submitted by February 15, 2013. Monthly technical reports will be
prepared and submitted by the 8th of each month with accompanying financial reports
submitted by the 12th of each month throughout the duration of the project. The literature
review of basic GIS data processing algorithms (e.g. polygon clipping algorithms) and their
inter‐comparisons of performance will be described in detail in the final report. Manuals on
the IDL routine library will also be included in the final report. Engineering test run results with
simple evaluation will be documented in the final project report. A final technical report will be
submitted by November 30, 2013, preceded by a draft final report on October 21, 2013.
During or after completion of the project, the investigators anticipate the preparation of
conference presentations and manuscripts for submission to appropriate peer‐reviewed
journals in the field. Drs. Daniel Tong and Pius Lee will supervise the completion of all reports,
presentations, and manuscripts, which will be collaborative efforts between the UMD and the
ARL/NOAA team.
17
6. REFERENCES
Byun, Daewon, F. Ngan, F.‐Y. Cheng, H.‐C. Kim, and S. Kim, Improvement of MM5 Surface
Characteristics, Final Report, Texas Commission on Environmental Quality, August, 2008,
44 pp
Byun, Daewon, S. Kim, F.‐Y. Cheng, H.‐C. Kim, and F. Ngan, Improved Modeling Inputs: Land Use
and Sea‐Surface Temperature, Final Report, Texas Commission on Environmental
Quality, August, 2007, 33 pp
Byun, D. W., F. Ngan, and H. C. Kim, 2011: Improvement of Meteorological Modeling by
Accurate Prediction of Soil Moisture in the Weather Research and Forecasting (WRF)
Model. Final Report for Texas Commission on Environmental Quality, March 2011, 46 pp
Cheng, F.Y., Byun, D.W., 2008. Application of high resolution land use and land cover data for
atmospheric modeling in the Houston‐Galveston Metropolitan Area: Part I,
meteorological simulation results. Atmos. Environ. 42, 7795e7810.
Grell, G.A., Dudhia, J., Stauffer, D., 1994. A description of the fifth‐generation Penn State/NCAR
mesoscale model (MM5), NCAR Technical Note: NCAR/TN‐ 398þSTR.
Sutherland, I. E. and G. W. Hodgman, 1974: Reentrant polygon clipping, Comm. of the ACM, 17,
32‐42, doi:10.1145/360767.360802
Vatti, B.R., 1992: A generic solution to polygon clipping, Comm. of the ACM, 7, 56‐63,
doi:10.1145/129902.129906
Wells, G., 2006: The New Eastern Texas Land Use Land Cover Classification Project, HARC
Project Contact number: H‐46‐T28‐2004‐T2, UT‐Austin Center for Space Research,
Austin, TexasNielsen‐Gammon, J. W., 2001: Initial modeling of August 2000 Houston‐
Galveston ozone episode, report to the Technical Analysis Division, Texas Natural
Resource Conservation Commission. Dec. 2001.