Two approaches to species distributionmodeling and how the climate change isincorporated
Juan M. BarriosNovember, 1st, 2017
National Commission for the Knowledge and Use of Biodiversity (CONABIO)
Outline
About CONABIO
Species Distribution Modeling
Let’s talk about data
1
About CONABIO
About CONABIO
• CONABIO was founded on 1996. Since then it has beendevoted to organize the existing information of the biotaof Mexico.
• Most of that information, at a time, were recorded byacademic institutions and some private people. Therewere some missing opportunities to analize the biota databecause there was not a consolidate data facility.
• That primary goal give as a result the birth of the NationalBiodiversity Information System (SNIB).
2
SNIB
• There are approximately8.8 M of curated datapoints
• Observations are as old as1579, those registries areinferred by literature.
• A big part of CONABIOpersonnel and currentprojects are responsible tointegrate and curate newinformation.
● ● ●● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
0
2500000
5000000
7500000
1600 1700 1800 1900 2000
year
obse
rvat
ions
by year
Cummultaive species observation counts in SNIB
3
SNIB is more than species registries
There are about 13,000 digital maps of various subjects:vegetation maps, species distribution, satellite imagery, soiltypes, burning risk maps, water temperature maps, and muchmore. Most of these can be currently access onhttp://www.conabio.gob.mx/informacion/gisLater this year we are going to release a updated tool toexplore and download all these information.
4
There are many more projects on CONABIO
Some of them are...
• Genetic diversity project.• MADMEX-REDD+, project to monitor deforestation andforest degradation.
• Enviromental suitability protection.• Algal blooming and sea biodiversity programs.• Invasive species studies.• Mangrove monitoring, phenotypical data and chemicalvariables.
• Biodiversity Monitoring Network
A common goal:
Try to understand the biological processes with data
5
There are many more projects on CONABIO
Some of them are...
• Genetic diversity project.• MADMEX-REDD+, project to monitor deforestation andforest degradation.
• Enviromental suitability protection.• Algal blooming and sea biodiversity programs.• Invasive species studies.• Mangrove monitoring, phenotypical data and chemicalvariables.
• Biodiversity Monitoring Network
A common goal:
Try to understand the biological processes with data
5
Species Distribution Modeling
What is the Species Distribution Modeling (SDM)?
• The goal is to determine where a given specie is present.• To achieve this, most studies look a set of meaningfulclimate and topographic variables to find suitabilyconditions based on current species observations(Ecological Niche Modeling).
• But the Species Distribution Modeling also shouldconsider if a suitable niche can really be a habitat for aparticular specie. [Peterson and Soberón, 2012]
6
Classical approach: MaxEnt, problem statment
Given an area of interest D where some environmentalvariables are defined x1, x2, . . . , xm, and a set of sitesz1, z2, . . . , zn where individuals of a particular specie wereobserved. We intend to estimate the range of specie habitat.
7
MaxEnt: How to do that?
Then it is assumed that the sites {zi} where selectedindependently from a unknwon probability measure p on D.
A principle of maximum entropy states that this probability isuniform over D. The uniform distribution is the ”most random”of all.
ConstraintsWe have to consider that the specie might prefer someenvironmental features.
Then one needs to find the probability distribution p̂ thatmaximize the entropy subject to: the expectation of thefeatures x(z) under p̂ matches the sample means of thosefeatures,
1n∑
x(zi) =∫Dx(z)p̂(z)d z = Ep̂x(z).
8
MaxEnt: How to do that?
Then it is assumed that the sites {zi} where selectedindependently from a unknwon probability measure p on D.
A principle of maximum entropy states that this probability isuniform over D. The uniform distribution is the ”most random”of all.
ConstraintsWe have to consider that the specie might prefer someenvironmental features.
Then one needs to find the probability distribution p̂ thatmaximize the entropy subject to: the expectation of thefeatures x(z) under p̂ matches the sample means of thosefeatures,
1n∑
x(zi) =∫Dx(z)p̂(z)d z = Ep̂x(z). 8
MaxEnt
This criteria is equivalent to maximizing the likelihood of theparametric model
p(z) = eλx(z)∫D eλx(u)du.
This likelihood is the same as a IPP with log-linear ratefunction.
MaxEnt is a software for modeling species distribution frompresence-only data1 using this approach.
1https://biodiversityinformatics.amnh.org/open_source/maxent/
9
Practical application
• Clean data• Select the spatial region basedon biological meaningful way
• Fit the MaxEnt models• Given the fitted model as a map,work with some experts in thefield to produce moremeaningful output.
10
What about the climate change?
• From the model point of view, we only have a new set ofvalues for the environmental variables. So just evaluatethat new values on the fitted model.
• It rise a new problem. What happen with values neverobserved?
• Just delete this regions.• Evaluate them and deal in some manner with thoseregions.
11
A second approach
Now we try to estimate the probability of an observation givena set of features p(· | x). Instead, we consider the log odds,
S(c | x) = ln
(p(c | x)p(c̄ | x)
)Using the Bayes theorem and assuming conditionalindependence of features given specie c, p(x | c) =
∏i p(xi | c),
we haveS(c | x) =
∑iS(c | xi) + ln
(p(c)p(c̄)
).
12
Marginal scores
Then we estimate the marginal log oddsS(c | xi) = ln
(p(xi | c)p(xi | c̄)
)as
p(xi | c) = |{c ∩ xi}|/|{c}|p(xi | c̄) = |{c̄ ∩ xi}|/(N− |{c}|)
in order to smooth these estimations, when |{c ∩ xi}| = 0 or|{c}| = N, we use a Laplace smoothing.
13
Some discussion
• This approach only looks for co-occurrences on a gridedspace.
• We can consider a space-time grid to incorporate timeinto the model.
• Hence it is easy to incoporate some time-dependentclimate variability.
• We can also incorporate different data like other taxa.occurences.
14
SPECIES: Plataform to explore ecological data
We developed a platformhttp://species.conabio.gob.mx/candidate/
• Fast prototyping• Create reproducible experiments,You can ask for a unique linkwith your setup
• Incorporate some performancemetrics and statistics
15
Some technical advantages of SPECIES
• There is an API that leverage all the calculations of theapplication... So You can use it with Python and R.
• The API also have some endpoint that gives you cleandata to use on another workflow.
• Our database design was robust enough to perform crossvalidation in real time (2 min or less).
• We plan to release all the parts of the application as anOpen Source.
16
Let’s talk about data
Species data is messy
• Species can be misidentified.• We can have atypical datapoints... In Mexico we haveobservations of lions but there’sno lion population in Mexico.
• Species taxonomicalclassificaction is variable overthe time.
• There is bias: taxon groups biasand spatial sample bias.
0e+00
2e+06
4e+06
6e+06
1900 1925 1950 1975 2000
year
obse
rvat
ions
classAmphibia
Aves
Liliopsida
Magnoliopsida
Mammalia
Reptilia
by year
Some representative taxonomical classes
17
Joint work with
SPECIES team(UNAM-CONABIO)
C. Stephens (C3-UNAM, Phys)C. González (UAM-Lerma, Bio)R. Sierra (CONABIO, Math)J.C. Salazar (CONABIO, Eng)E. Rovredo (CONABIO, Bio)
CONABIO ecosystemsevalutation team
A. Cuervo (UNAM-CONABIO, Bio)W. Tobon (CONABIO, Bio)D. Ramirez (CONABIO, Bio)J. Lopez (CONABIO, Bio)T. Urquiza (CONABIO, Bio)
18
Questions?
18
References i
References
W. Fithian and T. Hastie. Finite-sample equivalence instatistical models for presence-only data. The Annals ofApplied Statistics, 7(4):1917, 2013.
C. González-Salazar, C. R. Stephens, and P. A. Marquet.Comparing the relative contributions of biotic and abioticfactors as mediators of species’ distributions. EcologicalModelling, 248:57–70, 2013.
19
References ii
A. T. Peterson and J. Soberón. Species distribution modelingand ecological niche modeling: getting the concepts right.Natureza & Conservação, 10(2):102–107, 2012.
S. J. Phillips, M. Dudík, and R. E. Schapire. A maximum entropyapproach to species distribution modeling. In Proceedingsof the 21st international conference on Machine learning,page 83. ACM, 2004.
J. Sarukhán and R. Jiménez. Generating intelligence fordecision making and sustainable use of natural capital inMexico. Current Opinion in Environmental Sustainability, 19:153–159, 2016.
20