A remote sensing and GIS-assisted landscape epidemiology approach to West Nile virus

sc

c

nivenivenive

omtal c

phenomena. Machine learning is a sub-discipline of articial intelligence that can be used to create

disease in the United States, and has a fairly well-understood transmission cycle that is believed to be

andscarates,

existed. Russian parasitologist Evgeny Nikanorovich Pavlovsky

landscape epidemiology research.

common culprits include corvids (crows and jays) and Culexmmals are consid-not a part of the

capable of passingmerous species ofion, so identifyingareas containingan be difcult.ligence concernedor using logical

inductive inference to improve performance (Quinlan, 1986). Ingeospatial studies, machine learning is sometimes used in place ofsimple statistical techniques like linear regression in an attempt tobetter model complex relationships with multiple interacting var-iables, such as the relationship between disease and the environ-ment. One very common machine learning technique involves theuse of hierarchical decision trees to discriminate among classes ofobjects (Carbonell, Michalski, & Mitchell, 1983). Cubist is aninductive machine learning program created by RuleQuest that

Abbreviations: WNV, West Nile virus; NLDC, National Land Cover Database;NDVI, Normalized Difference Vegetation Index; CDC, Centers for Disease Controland Prevention; MAUP, Modiable Areal Unit Problem; NGP, Northern Great Plains.* Corresponding author. Present address: 316 Jessup Hall, University of Iowa,

Iowa City, IA 52242, USA. Tel.: 1 435730 7688.E-mail addresses: [email protected], [email protected] (S.G. Young),

Contents lists availab

Applied Ge

journal homepage: www.els

Applied Geography 45 (2013) [email protected] (J.A. Tullis), [email protected] (J. Cothren).coined the term landscape epidemiology to describe the theorythat pathogens and the animals that sometimes transmit them aresubject to environmental inuences and if the conditions necessaryfor them to survive can be identied on the landscape, then theareas at risk of disease can be delineated. Landscape epidemiologyis thus inherently geographic in nature and has proven veryeffective at predicting both spatial and temporal distributions ofcertain diseases. Vector-borne diseases e transmitted by mosqui-toes, ticks, or other animals e are prime candidates for applied

mosquitoes (CDC, 2003). Humans and other maered incidental, or dead-end, infections and aretransmission cycle since they are generally noton the disease (see Fig. 1). There are, however, nubirds and mosquitoes involved inWNV transmisssuitable environmental conditions across largemultiple species of potential hosts and vectors c

Machine learning is a branch of articial intelwith computer systems capable of learning,1900s that scientist began to understand why such relationships species involved in the viral transmission cycle, but the most

have been recognized since HippocIntroduction

The relationships between the l0143-6228/$ e see front matter 2013 Elsevier Ltd.http://dx.doi.org/10.1016/j.apgeog.2013.09.022the study of WNV, using both landscape epidemiology and machine learning techniques. A combinationof remotely sensed and in situ variables is used to predict WNV incidence with a correlation coefcient ashigh as 0.86. A novel method of mitigating the effect of limited case numbers is also tested and ultimatelydiscarded. A consistent spatial pattern of model errors is identied, indicating the chosen variables arecapable of predicting WNV disease risk across most of the United States, but are inadequate in thenorthern Great Plains region of the US.

2013 Elsevier Ltd. All rights reserved.

pe and human healthbut it wasnt until the

West Nile virus (WNV) was introduced to North America in1999. Its transmission cycle is well understood, with birds acting asthe primary hosts, and mosquito vectors transmitting the virus toother birds. There are over 160 bird species and at least 36mosquitoLandscape epidemiologyDisease predictionhighly dependent on environmental conditions. This study takes a remote sensing and GIS approach toMachine learningpredictive models from large and complex datasets. West Nile virus (WNV) is a relatively new infectiousA remote sensing and GIS-assisted landto West Nile virus

Sean G. Young a,*, Jason A. Tullis b, Jackson CothrenaDepartment of Geosciences and Center for Advanced Spatial Technologies, 331 JBHT, UbDepartment of Geosciences and Center for Advanced Spatial Technologies, 321 JBHT, UcDepartment of Geosciences and Center for Advanced Spatial Technologies, 304 JBHT, U

Keywords:West Nile virusRemote sensing

a b s t r a c t

Landscape epidemiology, sphy that uses environmenAll rights reserved.ape epidemiology approach

rsity of Arkansas, Fayetteville, AR 72701, USArsity of Arkansas, Fayetteville, AR 72701, USArsity of Arkansas, Fayetteville, AR 72701, USA

etimes called spatial epidemiology, is a sub-discipline of medical geogra-onditions as explanatory variables in the study of disease or other health

le at ScienceDirect

ography

evier .com/locate/apgeog

eogdevelops decision trees for predicting numerical values (RuleQuestResearch, 2012). It can also convert those trees into productionrules consisting of if-then statements, which are easier to under-stand and interpret (Jensen, Hodgson, Garcia-Quijano, Im, & Tullis,2009). Cubist has been successfully incorporated in a number ofpublished studies in order to increase regression performance in adata mining environment (Jensen et al., 2009; Riggins, Tullis, &Stephen, 2009; Walton, 2008). In this study Cubist was used tomodel the impact of complex environmental variables on WNVdisease incidence.

The primary research question was how wellhuman WNV riskcould be modeled for the entire continental United States using theprinciples of landscape epidemiology within a GIS and remotesensing framework. While a number of studies characterized WNVrisk at regional and local scales, national-scale models have beenless successful. This study uniquely combined machine learningwith landscape epidemiology and national remote sensor de-rivatives to achieve an improved national-scale model.

Background

While West Nile virus has spread from coast to coast and is nowconsidered endemic (Peterson, Robbins, Restifo, Howell, & Nasci,2008) there does appear to be signicant spatial clustering as evi-denced by extremely high spatial autocorrelation values measuredacross the continental United States (Young & Jensen, 2012).Furthermore, for reasons not entirely understood, the states andcounties with the highest cumulative incidence and most pro-nounced incidence rates are primarily clustered together in thenorthern Great Plains (Lindsey, Kuhn, Campbell, & Hayes, 2008).

Fig. 1. West Nile virus transmission cycle, adapted from (CDC, 2012).

S.G. Young et al. / Applied G242When analyzing clusters of incidence rates, which are normalizedby population, the northern Great Plains, as well as southwestIdaho, stand out as persistent hotspots (Sugumaran, Larson, &DeGroote, 2009). These regional variations pose difculties fornational-scale studies such as this one.

The CDC has recommended examination of dead AmericanCrows as an effective surveillance strategy for WNV (CDC, 2003;LaDeau, Calder, Doran, & Marra, 2010). Unfortunately such sur-veillance is expensive and time consuming, and as a result thegeographic and temporal coverage of such data is generally quitepoor, highlighting the need for other sources of relevant data.Landscape epidemiology suggests that environmental variables canbe used as proxies for host and vector data. For more informationon the biology and etiology of WNV, the reader is referred toLindsey et al., 2008, Sejvar et al., 2003, and the CDCs WNVhomepage (CDC, 2012).

A number of studies show the effectiveness of using a landscapeepidemiology approach with environmental variables, especiallywhen remote sensing is employed. Cooke, Grala, and Wallis (2006)modeled mosquito habitats using environmental variables to pre-dict WNV risk. Their models indicated that 67% of human casesoccurred in areas predicted as high-risk (Cooke et al., 2006). Diuk-Wasser, Brown, Andreadis, and Fish (2006) used remotely sensedenvironmental variables to predict WNV vector abundance inConnecticut (Diuk-Wasser et al., 2006). Swatantran et al. (2012)successfully used remote sensing and machine learning methodsto map migratory bird habitats (Swatantran et al., 2012). Re-searchers looking at WNV infection in Culex mosquitoes in north-east Illinois found increased air temperature was the strongestpredictor of increased infection, and that precipitation and tem-perature alone could explain up to 79% of the spatial variability(Ruiz et al., 2010). One study in Florida found that droughts canactually amplify the disease in amanner similar to that observed forthe related St. Louis encephalitis virus (Shaman, Day, & Stieglitz,2002). The theory is that drought brings avian hosts and vectormosquitoes into close contact as they are forced to cluster aroundthe less-abundant water sources, which facilitates the epizooticcycling and amplication of the arboviruses within these pop-ulations (Shaman, Day, & Stieglitz, 2005, p. 134).

Other studies have highlighted the importance of humaneenvironment interactions on WNV distributions. Bowden, Magori,and Drake (2011) analyzed human WNV incidence and land coveracross the US and identied regional variations. In Northeasternregions urban land covers were positively associated with WNVdisease while in the Western US agricultural land covers had thestrongest positive association. They theorized the regional differ-ences they observed can be explained by behavioral differencesbetween the prominent mosquito vectors for the respective regions(Bowden et al., 2011). Gates and Boston (2009) identied a verystrong relationship between irrigation and both human and equineWNV occurrence at the county level over a three-year period. Theysuggested articial irrigation increases available mosquito habitatand therefore increases risk of disease transmission. They found asirrigation rose as a percentage of total land area by only 0.1% thattheWNV incidence rate would increase by 50% for humans and 63%for horses (Gates & Boston, 2009). Liu, Weng, and Gaines (2008)similarly found WNV outbreaks in Indianapolis were inuencedby percentages of agriculture and water (Liu et al., 2008). These andother studies demonstrate the inuence of land cover and land usevariables on WNV, and therefore provide a motivation for remotesensing-assisted predictive modeling of WNV risk.

Materials and methods

Study area and period

As of 2012, human disease cases fromWNV have been reportedfor all contiguous 48 United States, and these were chosen as thestudy area of interest. While some researchers maintain that na-tional scale models of WNV risk are likely not useful due to themajor differences in environmental conditions across the countryand differences in primary host and vector species (DeGroote &Sugumaran, 2012), this study sought to identify variables andconditions that could be applied broadly. Without discounting theusefulness of region-specic models, a national-scale model wasdeveloped to determine to what extent the machine learningtechniques here employed could be applied across the study areawithout modication. As will be shown, a single regional modelwas also created and compared against the national model (Fig. 2).

The disease incidence data provided by the CDC is aggregated tothe county level for the sake of condentiality, so the county wasthe basic spatial unit in this study. While counties introduce arbi-

raphy 45 (2013) 241e249trary and variable spatial aggregation during analysis, their use is

biomass and leaf area index (LAI) in areas with relatively lower

Fig. 2. A simplied owchart of the materials and methods employed.

S.G. Young et al. / Applied Geognecessarily common in the eld of medical and health geography.For the sake of consistency and simplicity, and to avoid unnecessarydata manipulation, counties were used as the basic unit for all ofthe input data.

The rst few years of WNV in the US were atypical due to thespatially restricted nature of the virus as a newly emerging path-ogen in a new environment. It wasnt until 2002 that WNV wasdetected west of the states bordering the Mississippi River, butduring that year it spread all the way to California and Washington(see Fig. 3). By 2003 it had occurred in 47 of the lower 48 states, sofor this reason 2003 was selected as the rst year from whichincidence data was used in model generation in this study. Thestudy period chosen was the six-year period from 2003 to 2008.

Remote sensor data

A major goal of this study was to identify data sources for use inWNV risk modeling that did not require extensive eld work, asdead bird or mosquito collection surveys generally do. Incorpora-tion of remotely sensed data was therefore a key in meeting thatgoal. While striving to capture as much of the pertinent environ-mental data as possible, and also minimizing information contentoverlap and multicollinearity, ve primary datasets were selectedas explanatory variables: Normalized Difference Vegetation Index(NDVI), elevation, land cover, temperature, and precipitation. Pre-processing included spatial aggregation to the county level using

zonal statistics and areal tabulation.

Fig. 3. Year of rst reported WNV activity by state.were obtained at a spatial resolution of 30 30 arcseconds(approximately 709 926m at 40 north latitude) andmeasured indegrees Celsius. Annual and 30-year normals for precipitation datawas also obtained at a resolution of 30 30 arcseconds, measuredin millimeters.

Human WNV disease data was obtained from the CDCs Arbo-NET system. ArboNET was created in 2000 to monitor and trackWNV and other arboviral (arthropod-borne viral) diseases. HumanWNV disease is reported as either neuroinvasive or non-neuroinvasive, with the former being the far more serious diseaseconsisting of meningitis, encephalitis, and acute accid paralysisamong other possible forms of the disease. Non-neuroinvasivedisease is generally termed West Nile Fever, and while still occa-sionally deadly, is considered a mild form of the disease. Since thesame virus is responsible for both forms of disease and everyone isconsidered at risk for infection, the two forms were combined astotal disease cases for analysis in this study. Asymptomatic in-fections were not included.

Intercensal population estimates, created by the Federal StateCooperative Program for Population Estimates (FSCPE) were ob-tained from the US Census Bureau. Polygon county GIS data for theUS was obtained from Esri, partially derived from Tele Atlas data,and was provided with the ArcGIS 10.1 software.

Software programs and tools

Esris ArcGIS for Desktop 10.1 was the GIS of choice for thisbiomass, such as grasslands (Jensen, 2007). The NDVI was obtainedfrom the NASA Land Processes Distributed Active Archive Center(LPDAAC) and was derived from imagery collected by the ModerateResolution Imaging Spectroradiometer (MODIS) instrument aboardthe Terra satellite. The MOD13A3 data provided monthly NDVIvalues (2003e2008) at a spatial resolution of 1 1 km in tiles thatwere mosaicked together to cover the study area.

Elevation data was obtained from the Global Land Cover FacilitySRTM (Shuttle Radar Topography Mission) global mosaic, originallycollected from Space Shuttle Endeavor in February 2000 andresampled to a 1 1 km spatial resolution. Elevation derivatives ofslope and aspect, with the latter reclassied to eight directions,were created using the Spatial Analyst tools in Esris ArcMap 10.1software.

Land cover data was obtained from the National Land CoverDatabase 2006 (NLCD2006), maintained by the Multi-ResolutionLand Characteristics Consortium (MRLC) and the US GeologicalSurvey (USGS). This data was created using primarily unsupervisedclassication of Landsat ETM imagery at a spatial resolution of30 30 m. The NLCD2006 covers the United States with 16 landcover categories, not counting Alaska-only categories that were notpresent in the study area.

In situ and ancillary data

Temperature and precipitation data were both obtained fromOregon State Universitys PRISM (Parameter-elevation Regressionson Independent Slopes Model) Climate Group database. Thisdatabase uses advanced interpolation techniques that make use ofgrids underlying the weather station data such as digital elevationmodels and 30-year climatological averages. Maximum and mini-mum temperature data for monthly, annual, and 30-year normalsNDVI, when used in combination with topographic data, hasshown excellent predictive ability of WNV risk (Peterson et al.,2008); it is especially sensitive to variations in vegetation

raphy 45 (2013) 241e249 243study. The ArcMap programwas used extensively during almost all

(Openshaw, 1984). The nal method commonly used to address thesmall numbers problem is to aggregate values over time (Wang,2006). While this method is fairly straightforward and does notexacerbate theMAUP, it does imply all explanatory variables shouldlikewise be aggregated over the same temporal range, and it limitsthe amount of time-series comparisons that can be made from thedata.

In this study the small numbers problem was addressed in twoways. First, incidence data was converted to incidence rates, or inother words, normalized by population. All of the data was thentemporally aggregated to obtain more reliable rates. NDVI andclimate data that are measured monthly were averaged by corre-sponding months between years, creating a mean-year for thestudy period. Land cover and elevation data remained unmodieddue to a lack of reliable change information for those datasets over

eography 45 (2013) 241e249stages of the analysis, from pre-visualization and data explorationto analysis and nal map creation.

The Cubist machine learning decision tree from RuleQuestResearch was incorporated to build rule-based predictive models.These models, while created using complex decision trees, areexpressed as collections of rules, each with an associated multi-variate linear model, to maximize interpretability. When datamatches a specic rules condition, the model associated with thatrule is used to calculate a predicted value (RuleQuest Research,2012). Cubist also supports model testing on independent subsetsof the data that can then be imported back into GIS software forfurther visualization and testing. A single-threaded Linux version ofCubist 2.07 is available under the GNU GPL (general public license)free of charge, and this was the version used in this study.

Microsoft Ofce Excel and LibreOfce Calc were used to calcu-late simple ratios such as disease incidence rates and to format datatables for use in Cubist. Python 2.7, and the associated IDLE(Interactive DeveLopment Environment) were used extensivelyduring data preprocessing to create and run geoprocessing scriptsusing the ArcPy site package of tools available with the ArcGISsoftware.

Analytical techniques

Addressing the small numbers problemThe small numbers problem is probably the most pervasive

problem in disease mapping occurring when the number of casesof disease in an area is small, or when the population of the area issmall (Pringle, 1995, p. 343). This can be due to small areas or raredisease or both, but when such small numbers are used to calculaterates the results can be misleading.

The small numbers problem is apparent in the WNV incidencedata, both due to the variable sizes of counties and due to the factthat it is still a relatively new disease and is not fully endemic acrossthe contiguous United States. Another reason for the small numberswith respect to WNV is underreporting. It is estimated that onlyabout 1 in 150 infected persons will develop a severe form of thedisease (CDC, 2012). Another 20% or so develop relatively mildsymptoms generally termed West Nile Fever, or non-neuroinvasiveWNV, and the remaining roughly 80% are completely asymptomatic(Mostashari et al., 2001). Non-neuroinvasive symptoms are often somild that many cases are likelymisdiagnosed and unreported everyyear (Sejvar et al., 2003). Since only about 20% of infections causesickness, and many of those sicknesses are likely misdiagnosed, thedatawe have are only a small sample of total infections. In addition,non-neuroinvasive cases are not included in the list of nationallynotiable diseases (indicating which diseases must by law be re-ported to the CDC, maintained by the Council of State and Terri-torial Epidemiologists), so while the CDC strongly encouragesreporting, it is not required (CDC, 2003; CSTE, 2012). All of thissuggests our case counts are merely a sample of total cases, and asmall one at that (see Fig. 4).

There are basically three methods of dealing with the smallnumber problem. The rst is to use spatial smoothing techniquesthat compute a locations value based on the values of that loca-tions neighbors, reducing spatial variability (Wang, 2006). Thesecond is to aggregate values to larger geographic areas until suf-ciently high values are reached, but this approach introduces thechallenges associated with the modiable aerial unit problem orMAUP (Wang, Guo, & McLafferty, 2012). The MAUP can be tested bycomparing different levels of aggregation, but this dramaticallyincreases the complexity of the model and in turn diminishes theinterpretability of results. Unfortunately there is no simple x forthis problem, although data normalization and the consistent use of

S.G. Young et al. / Applied G244the same areal units (e.g. counties) can help mitigate its effectsthe study period.A second approach to mitigating the small numbers problem

with regards to WNV was also evaluated. As mentioned above,neuroinvasive disease cases represent only about 1 out of every 150human infections. While non-neuroinvasive disease is likelysignicantly underdiagnosed due to its mild symptoms and clinicalsimilarity to other diseases (CDC, 2003), neuroinvasive cases aregenerally quite severe and it seems reasonable to assume reportingof neuroinvasive cases is much closer to 100% than reporting fornon-neuroinvasive disease. Neuroinvasive cases were treated as150 infections each, resulting in a theoretical 30 non-neuroinvasivecases per neuroinvasive case (following the estimate that 20% ofinfections result in non-neuroinvasive disease) per year. Theseestimated values for WNV disease incidence were then used tocalculate estimated incidence rates. These two values for WNVincidence (raw/reported and estimated) were both run throughCubist separately so their relative strengths and weaknesses couldbe compared.

Model generation with machine learningThe above-mentioned datasets were aggregated to a common

geographic scale (US Counties) and compiled into data tables forinput into Cubist. TheWNV incidence data served as the dependentvariable in the equation, with the environmental data serving asindependent, or explanatory variables in a manner conceptuallysimilar to multiple regression. 2484 counties (80% of the 3105counties in the study area) were randomly selected to be used astraining data, with the remaining counties used for validation ortesting of the model.

Model evaluationCare was taken to maintain strict separation between training

and test data. Cubist reports statistical accuracy measures for each

Fig. 4. Total reported human cases of WNV in the US from 1999 to 2012, compared to

estimated total cases (not counting asymptomatic infections) derived from serosurveyresults by Mostashari et al. (2001) and others.

the future.The observed spatial pattern of model errors, remarkably

consistent across the 30 different models, is intriguing. Perhaps

Table 1Predictive model results; e indicates the use of estimatedWNV rates during modelgeneration.

Run# Average jErrorj

Relative jErrorj

Correlationcoefcient

Most used variable(s)

R1 18.9e20.7 0.35e0.39 0.84e0.86 PrecipitationR1e 185.2e218 0.56e0.66 0.55e0.56 NLCD-41; PrecipitationR2 28.5 1.03 0.34 TMin(Dec)R2e 176.2 0.93 0.13 NDVI(Nov)R3 11.5 0.35 0.8e0.84 PrecipitationR3e 106.8e112.9 0.55e0.58 0.55e0.58 PrecipitationR4 1.6e1.8 0.64e0.73 0.26 NDVI(Dec); PrecipitationR4e 14.3 0.56 0.18e0.22 NLCD-81; NDVI(June)R5 3.1e3.3 0.61e0.66 0.46e0.56 TMin(Dec); ElevationR5e 40.3e42.5 0.77e0.82 0.12 NLCD-41; TMin(Oct)R6 4.2 0.61e0.62 0.58e0.6 NDVI(Dec)R6e 36.6e37.7 0.59e0.61 0.41e0.48 NDVI(Mar)R7 5.2e5.3 0.52e0.53 0.75 TMin(Oct)R7e 36.7e37.2 0.66e0.67 0.28 NDVI(Jan)R8 1.5 0.72 0.29e0.3 TMin(Apr)R8e 13.2e13.3 0.62 0e0.02 NLCD-71NGP 144.6 0.84 0.5 Precipitation; TMax(June, Sep,

Dec, Yr); TMin(May, July)

Geography 45 (2013) 241e249 245model it creates, consisting of Average jErrorj, Relative jErrorj, and aCorrelation Coefcient. The Average jErrorj, or average errormagnitude, is simply the mean absolute difference between thepredicted values and the actual values. This is simple enough tointerpret, as smaller values would indicate less error and thereforea stronger model, although some datasets could contain largeaverage error numbers and still be relatively strong models due tothe nature and distribution of the input data. The Relative jErrorj, orrelative error magnitude, is the ratio of the average error magni-tude divided by the error magnitude that would result from everypredicted value being equal to the mean value. The relative errormagnitude should be less than one for a model to be considereduseful. This provides a more comparable metric across models.Finally, the correlation coefcient reported by Cubist is the Pear-sons product-moment, or Pearsons r, measure of linear depen-dence, measured by dividing the covariance of the predicted andactual values by the product of their standard deviations. Values forthe correlation coefcient will always fall between 1 and 1, withvalues near 1 indicated a near perfect correlation. So long as thedata sampled is sufciently representative of WNV incidence, thiswould indicate the model ts the real world data very well and isin effect a good predictive model.

Cubist also computes predicted WNV disease incidence for eachcounty in the dataset for each model. These predicted incidencevalues were reimported into ArcGIS and joined back to the UScounties vectordatavia FIPScode formapping. Thiswasdone to allowa visual analysis of the models predictive power and regional effec-tiveness. These prediction maps were created using a standard devi-ation classication technique applied to the difference betweenpredicted values and actual values (i.e. the model residuals or errors)to easily display where the model over- or under-predicted disease.Carewas taken toensure theneutral category in thesemaps (theareaswithin one-half positive or negative standard deviation from themean) contained the true zero value, so that reading of the maps asover- and under-predictive was accurate and intuitive.

Results

Over 30 variations of the predictive model were created, withwidely varying results. Correlation coefcients ranged fromessentially 0 to as high as 0.86 depending on the experimentalsetup. Runs 1 and 2 included all 6 years of data, while 3e8 eachused a single years data, from 2003 to 2008 respectively. In eachsetup tested our estimated WNV rates (indicated with an e inTable 1), used in an attempt tomitigate the small numbers problem,resulted in poorer model performance. The variable or variableswith the most predictive ability for each model are also shown.

Strikingly,whenmodel residualsweremappedalmosteverysetupexhibited a similar spatial pattern of errors wherein the eastern USwas well modeled, the northern great plains, counties bordering theRockyMountains, and a region centered on southern Idaho exhibitedrelatively highmodel errors, and thewest coast returned to relativelylow errors. Fig. 5 is a representative example of this pattern, takenspecically from the results of model R1 (see Table 1 for details).

The Northern Great Plains (NGP) region in particular wasconsistently riddled with model errors. Interestingly, a region-specic Cubist model focused only on the NGP resulted in a corre-lation coefcient of 0.5, compared to 0.7 when these same countiesare run through the national model; the national model predictedthe NGP region more accurately than the regional model did.

Discussion

The primary research question in this study was if remotely

S.G. Young et al. / Appliedsensed environmental variables could be used to predict WNVincidence rates with acceptable accuracy across the US. While thequestion was intentionally vague on what would be consideredacceptable accuracy, the results of the R1 model seem to justifythe conclusion that they can. That being said, there were someobvious shortcomings with a number of the models, both temporaland spatial in nature.

Model strength varied widely by year, indicating a pronouncedsensitivity to temporal changes, and a corresponding lack ofconsistent environmental conditions that can be used as reliablepredictors. Said another way, the fact that the yearly models variedso much in effectiveness implies that there is not a single set ofenvironmental conditions that will always indicate WNV diseasepresence. If there were, the Cubist models would have been ex-pected to identify similar rule sets each year, resulting in modelsthat performed equally well from year to year. Further testing isneeded to see if a well-designed model based on many years ofdata, such as the R1 model, can provide accurate predictions into

NGPe 1097 0.99 0.27 TMin MayFig. 5. A map of WNV predictive model residuals, classied by standard deviations.Note the apparent spatial clustering of model errors.

even more interesting than this apparent regional clustering oferrors is the fact that these regions appear to be subject to bothunder- and over-prediction in close proximity. After initial modelevaluation presented in the results above, most of the subsequentanalysis was devoted to attempting to explain these spatialpatterns.

The regions of poor model performance seem closely related tothe regions previously identied by Young and Jensen (2012) andothers as the areas with the most pronounced clustering of diseaseincidence. Fig. 6 shows the Anselin Local Morans I, a spatial clusterand outlier analysis tool that measures spatial autocorrelation(Anselin,1995), of the disease incidence data for the study period (a)compared to the errors or residuals from predictive model R1 (b).

Fig. 7 was developed to investigate the effects of sampling bias.Fig. 7a shows the counties selected by Cubist as training cases for

using the estimated WNV model. This was an unexpected conse-quence as the estimated WNV models were created specically toattempt to mitigate the small numbers problem, but they appear tohave in effect expanded it to a small and not-as-small numbersproblem. Fig. 10 shows a slightly different pattern, with very smallcounties being fairly well modeled using raw WNV rates, andslightly less well modeled using estimated rates. Unlike population,the estimated rates seem to have shifted errors down to smallercounties. Between the consistently lower correlation coefcientsand the scatterplots of Figs. 9 and 10, we have concluded that theattempt to estimate WNV incidence rates to mitigate the smallnumbers problem was not successful.

With some exceptions, the region centered on southern Idahoappeared in manymodels to be a secondary cluster of model errors.Unlike the NGP region however, the southern Idaho cluster usually

It was an assumption of this study (one well supported by theliterature) that a national-scale predictive model would invariably

S.G. Young et al. / Applied Geography 45 (2013) 241e249246the R1 models, and Fig. 7b shows the remaining counties used fortesting. Visual inspection conrms that the sampling is markedlyunbiased given the constraint of using counties as the basic spatialunits. Parts c and d show the same counties as a and b, this timeoverlaid with the model errors from R1 to demonstrate that theobserved spatial pattern of model errors is apparent in bothtraining (c) and test (d) datasets, indicating spatial sampling biaswas not a signicant factor in producing the observed pattern.

It has been suggested that the low population density of thenorthern Great Plains andmanyWestern counties is responsible forthe clustering observed in these areas, as a result of the smallnumbers problem exaggerating incidence rates. Fig. 8 offers somesupport of this theory, showing the lowest population counties inlighter shades, which appear to match fairly well with the regionssuffering from the most model errors, with the possible exceptionof the low-population counties in the eastern US.

The strongest evidence for this theory, that model errors areassociated with low-population counties, is presented in Fig. 9,which shows the standard deviations of themodel errors for R1 andR1e plotted against county population. Fig. 10 was created tocompare the effects of county size on model effectiveness and theimpact of the estimated WNV values. The resulting scatterplotsshows that the models both perform admirably at a large range ofpopulation and areal values, with the majority of the errorsoccurring in counties with relatively lower populations and/orsmaller areas. This is most apparent in the top scatterplots.

The lower plots in Figs. 9 and 10 present the same data as the topplots, but use a logarithmic scale on the Y axis to help emphasizesthe differences between R1 and R1e error distributions. In Fig. 10,R1e errors, while they appear to follow the same pattern as R1 ofbeing most common in low-population counties, can also be seento shift the errors up, meaning counties with slightly higherpopulations are more likely to be over- or under-predicted whenFig. 6. Anselin Local Morans I maps; a) showing clustering of WNV Incidence Rates for thmodel R1.perform poorer for specic regions than smaller region-specicmodels would, owing to the signicant regional variations inenvironmental conditions across the US. The results of the NGPmodel showed that assumption to be unfounded in this case. Thenational model R1, which boasted an overall correlation coefcientof 0.86, dropped to about 0.7 when looking only at the NGP region,while the region-specic NGP model only mustered a 0.5 corre-lation coefcient. As much as possible, all other variables wereheld constant, indicating the region-specic NGP model was aworse predictor of its own region than the national R1 model was.The implication seems to be that this region is subject to someunknown confounding variable(s) that the models were notequipped to predict. Further exploration is needed to determinethe exact nature of the interference and what variables might beresponsible.

Conclusions

This study sought answers to three main research questions: 1)can remotely sensed environmental variables be used to predictWNV incidence rates across the continental US, 2) is a single na-tional model accurate enough, or is regional variation too strong,requiring smaller region-specic models, and 3) can the smallexhibited primarily under-prediction errors at its core. There isdebate as towhether or not the agricultural land use of the region isresponsible, this being an area well known for very large cattlefarms, but what role such land use might play inWNV transmissionis unclear. There is likely some set of environmental conditions, orperhaps behavioral conditions among the population, responsiblefor this pattern that were not included in this study.e entire 6-year study period, and b) showing clustering of residuals from predictive

Training and Test Counties overlaid with model residuals.

S.G. Young et al. / Applied Geography 45 (2013) 241e249 247numbers problem be mitigated by estimating WNV incidence fromneuroinvasive cases to compensate for underreporting?

With correlation coefcients as high as 0.86, the answer to therst research question is yes, the chosen environmental variables oftemperature, precipitation, elevation, NDVI, and land cover arerelated to WNV incidence strongly enough to allow predictivemodeling. The machine learning techniques employed were able toidentify complex relationships between the data, and in the case ofthe R1 model explained up to 86% of the observed real-world data.

The second question, of whether or not a national model is

Fig. 7. a) Training Counties, b) Test Counties, and ced)appropriate across the highly diverse study area of the continentalUS is harder to answer. Again pointing tomodel R1, it is tempting toconclude that a national model is effective, however the repeatedpattern of model errors clustering spatially in the Northern GreatPlains (NGP) region and elsewhere indicates the model is notappropriate for all regions. That said, the follow up model NGPshowed that region-specic models may not in fact produce betterresults than the national model. Owing to this last nding, it wasdeemed prudent to conclude that a national model is appropriate,

Fig. 8. Average 2003e08 population at the county level.as long as it is used and interpreted with the knowledge of itsregional biases and shortcomings.

Third, as to the novel method of mitigating the small numbersproblem with WNV data, this study showed clearly that it was noteffective. Not only did the estimated WNV models consistentlyperform much poorer than their raw WNV counterparts, but theirerrors were spread over a larger range of county population values,in effect amplifying the small numbers problem instead of miti-gating it.Fig. 9. Scatterplots of model residuals of R1 and R1e vs. county populations.

eogLimitations and areas for future study

This study was subject to a number of limitations, most notablythe necessity of using counties as the basic study unit and the smallnumbers problem. It seems reasonable that ner spatial resolutiondata could yield better results, but it was not available to the au-thors at the time of this study. The small numbers problem, asdiscussed previously, is a persistent problem with studies of this

Fig. 10. Scatterplots of model residuals of R1 and R1e vs. county areas in square miles.

S.G. Young et al. / Applied G248nature, and while the mitigation technique here employed wasunsuccessful, it is hoped that other techniques might yet bedeveloped to help lessen its impact.

It should also be noted that themodel residuals exhibited spatialautocorrelation, which can be an indicator of model mis-specication. The spatial autocorrelation was, as discussed earlier,primarily limited to the Northern Great Plains region. It seemslikely the clustering of errors in that region is due to confoundingvariables that were not properly modeled by the selected variables.

Other areas for improvement and future research includeincorporating spatial information explicitly into the predictivemodel. It is possible that simply including latitude and longitudecoordinates of county centroids might allow the machine learningtechniques to identify simple spatial patterns in the data. Morecomplicated methods of including topological relationships(perhaps a county-neighbor weights matrix) might also yieldfavorable results. It should also be noted that Cubist is only onemachine learning program and many other programs and tech-niques exist, including neural networks, which might be shown tobetter model complex relationships between the environment andWNV risk.

Finally there is the obvious need to identify the confoundingvariable(s) at work in the NGP region and in the western US ingeneral. The literature suggests the most likely culprits includearticial irrigation, population dynamics, human land use, and/orperhaps the regionally diverse mosquito vectors and avian hostswhich sometimes prefer specic environmental conditions. Thenext major step in this research, would be to incorporate the resultsinto a spatial decision support system (SDSS) for use by researchers

Shaman, J., Day, J. F., & Stieglitz, M. (2002). Drought-induced amplication of SaintLouis Encephalitis virus, Florida. Emerging Infectious Diseases, 8(6), 575e580.

http://dx.doi.org/10.3201/eid0806.010417.

Shaman, J., Day, J. F., & Stieglitz, M. (2005). Drought-induced amplication andepidemic transmission of West Nile virus in southern Florida. Journal of Medicaland public health ofcials with an interest in early warningdetection of areas at high-risk for WNV disease.

References

Anselin, L. (1995). Local indicators of spatial associationdLISA. Geographical Anal-ysis, 27(2), 93e115.

Bowden, S. E., Magori, K., & Drake, J. M. (2011). Regional differences in the associ-ation between land cover and West Nile virus disease incidence in humans inthe United States. American Journal of Tropical Medicine and Hygiene, 84(2),234e238. http://dx.doi.org/10.4269/ajtmh.2011.10-0134.

Carbonell, J. G., Michalski, R. S., & Mitchell, T. M. (1983). An overview of machinelearning. In R. S. Michalski, J. Carbonell, & T. Mitchell (Eds.),Machine learning: Anarticial intelligence approach (pp. 3e23). Palo Alto, California: TIOGA Publish-ing Co. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/similar?doi10.1.1.18.5035&typeab.

CDC. (2003). Epidemic/epizootic West Nile virus in the United States: Guidelines forsurveillance, prevention, and control (No. 3rd Revision). Fort Collins, CO: Centersfor Disease Control and Prevention.

CDC. (2012, October). CDC West Nile virus homepage. Retrieved October 22, 2012,from http://www.cdc.gov/ncidod/dvbid/westnile/index.htm.

Cooke, W. H., Grala, K., & Wallis, R. C. (2006). Avian GIS models signal human riskfor West Nile virus in Mississippi. International Journal of Health Geographics,5(1), 36.

CSTE. (2012). CSTE list of nationally notiable conditions. Atlanta, GA: Council of Stateand Territorial Epidemiologists. Retrieved from http://www.cste.org/webpdfs/CSTENotiableConditionListAugust2012.pdf.

DeGroote, J. P., & Sugumaran, R. (2012). National and regional associations betweenhuman West Nile virus incidence and demographic, landscape, and land useconditions in the coterminous United States. Vector-Borne and Zoonotic Diseases,12(8), 657e665. http://dx.doi.org/10.1089/vbz.2011.0786.

Diuk-Wasser, M. A., Brown, H. E., Andreadis, T. G., & Fish, D. (2006). Modeling thespatial distribution of mosquito vectors for West Nile virus in Connecticut, USA.Vector-Borne and Zoonotic Diseases, 6(3), 283e295.

Gates, M. C., & Boston, R. C. (2009). Irrigation linked to a greater incidence of humanand veterinary West Nile virus cases in the United States from 2004 to 2006.Preventive Veterinary Medicine, 89(1/2), 134e137.

Jensen, J. R. (2007). Remote sensing of the environment : An earth resource perspective.Upper Saddle River, NJ: Pearson Prentice Hall.

Jensen, J. R., Hodgson, M. E., Garcia-Quijano, M., Im, J., & Tullis, J. A. (2009). A remotesensing andGIS-assisted spatial decision support system forhazardouswaste sitemonitoring. Photogrammetric Engineering and Remote Sensing, 75(2), 169e177.

LaDeau, S. L., Calder, C. A., Doran, P. J., & Marra, P. P. (2010). West Nile virusimpacts in American crow populations are associated with human land useand climate. Ecological Research, 26, 909e916. http://dx.doi.org/10.1007/s11284-010-0725-z.

Lindsey, N. P., Kuhn, S., Campbell, G. L., & Hayes, E. B. (2008). West Nile virusneuroinvasive disease incidence in the United States, 2002e2006. Vector-Borneand Zoonotic Diseases, 8(1), 35e40. http://dx.doi.org/10.1089/vbz.2007.0137.

Liu, H., Weng, Q., & Gaines, D. (2008). Spatio-temporal analysis of the relationshipbetween WNV dissemination and environmental variables in Indianapolis, USA.International Journal of Health Geographics, 7(1), 66.

Mostashari, F., Bunning, M. L., Kitsutani, P. T., Singer, D. A., Nash, D., Cooper, M. J.,et al. (2001). Epidemic West Nile encephalitis, New York, 1999: results of ahousehold-based seroepidemiological survey. The Lancet, 358(9278), 261e264.

Openshaw, S. (1984). Number 38, The modiable areal unit problem. In Conceptsand techniques in modern geography (pp. 1e41). Norwich: Geo Books.

Peterson, A. T., Robbins, A., Restifo, R., Howell, J., & Nasci, R. (2008). Predictableecology and geography of West Nile virus transmission in the central UnitedStates. Journal of Vector Ecology, 33(2), 342e352.

Pringle, D. G. (1995). Mapping disease risk estimates based on small numbers: anassessment of empirical Bayes techniques. The Economic and Social Review,27(4), 341e363.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81e106.Riggins, J. J., Tullis, J. A., & Stephen, F. M. (2009). Per-segment aboveground forest

biomass estimation using LIDAR-derived height percentile statistics. GIScience &Remote Sensing, 46(2), 232e248. http://dx.doi.org/10.2747/1548-1603.46.2.232.

Ruiz, M. O., Chaves, L. F., Hamer, G. L., Sun, T., Brown,W.M., Walker, E. D., et al. (2010).Local impact of temperature and precipitation on West Nile virus infection inCulex species mosquitoes in northeast Illinois, USA. Parasites & Vectors, 3(1), 19.

RuleQuest Research. (2012). RuleQuest Research data mining tools. RetrievedSeptember 7, 2012, from http://www.rulequest.com/.

Sejvar, J. J., Haddad, M. B., Tierney, B. C., Campbell, G. L., Marn, A. A., VanGerpen, J. A., et al. (2003). Neurologic manifestations and outcome of West Nilevirus infection. JAMA: The Journal of the American Medical Association, 290(4),511e515. http://dx.doi.org/10.1001/jama.290.4.511.

raphy 45 (2013) 241e249Entomology, 42(2), 134e141. http://dx.doi.org/10.1603/0022-2585(2005)042[0134:DAAETO]2.0.CO;2.

Sugumaran, R., Larson, S. R., & DeGroote, J. P. (2009). Spatio-temporal clusteranalysis of county-based human West Nile virus incidence in the continentalUnited States. International Journal of Health Geographics, 8(1), 43.

Swatantran, A., Dubayah, R., Goetz, S., Hofton, M., Betts, M. G., Sun, M., et al. (2012).Mapping migratory bird prevalence using remote sensing data fusion. PLoS ONE,7(1), e28922. http://dx.doi.org/10.1371/journal.pone.0028922.

Walton, J. T. (2008). Subpixel urban land cover estimation: comparing cubist,random forests, and support vector regression. Photogrammetric Engineeringand Remote Sensing, 74(10), 1213e1222.

Wang, F. (2006). Quantitative methods and applications in GIS. Boca Raton, Fla.: CRC/Taylor & Francis.

Wang, F., Guo, D., & McLafferty, S. (2012). Constructing geographic areas for cancerdata analysis: a case study on late-stage breast cancer risk in Illinois. AppliedGeography, 35(1e2), 1e11. http://dx.doi.org/10.1016/j.apgeog.2012.04.005.

Young, S. G., & Jensen, R. R. (2012). Statistical and visual analysis of human WestNile virus infection in the United States, 1999e2008. Applied Geography, 34(0),425e431. http://dx.doi.org/10.1016/j.apgeog.2012.01.008.

S.G. Young et al. / Applied Geography 45 (2013) 241e249 249

A remote sensing and GIS-assisted landscape epidemiology approach to West Nile virusIntroductionBackgroundMaterials and methodsStudy area and periodRemote sensor dataIn situ and ancillary dataSoftware programs and toolsAnalytical techniquesAddressing the small numbers problemModel generation with machine learningModel evaluation

ResultsDiscussionConclusionsLimitations and areas for future study

References

Date post:	30-Dec-2016
Category:	Documents
Upload:	jackson
View:	219 times
Download:	3 times

A remote sensing and GIS-assisted landscape epidemiology approach to West Nile virus

Documents