Species Distribution Modeling and Prediction:
A Class Imbalance Problem
Reid A. Johnson1
Nitesh V. Chawla1
Jessica J. Hellmann2
1Department of Computer Science and Engineering 2Department of Biological Sciences
University of Notre Dame
Introduction
• Forming knowledge of where species live is important, with applications in ecological conservation and sustainability.
• Modeling techniques that estimate the potential distribution of species are used as proxies for actual observations.
• Species distribution modeling is the process of combining occurrence data with environmental variables to create a model of a species’ niche requirements.
1
Outline
• Background and problem setup
• Typical approaches
• Our approach
• Evaluation
• Results
2
3
Notice the imbalance that exists between the number of occurrences (green) and the number of non-occurrences (non-green). We can term this as a “class imbalance”.
Background — Problem Setup
4
Background — Problem Setup
• What are the problem attributes?
– Data characterized by two classes
– Severe data imbalance
– Many instances
• How are these attributes typically addressed?
5
Background — Problem Attributes
MaxEnt Method
• Of all possible species distributions, choose the one that maximizes a goodness measure termed “entropy”. – Unknown probability distribution denoted 𝝅.
– Approximation of 𝝅 is 𝝅 .
– The entropy of 𝝅 is defined as
𝑯 𝝅 = − 𝝅 𝒙 𝒍𝒏𝝅 𝒙
𝒙∈𝑿
where 𝐥𝐧 is the natural logarithm.
– No unfounded constraints should be placed on 𝝅 .
Phillips, Steven J., and Miroslav Dudík. "Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation." Ecography 31.2 (2008): 161-175. 6
7
Problem Setup Revisited
• HDDTs are a method of building decision trees using Hellinger distance as the splitting criterion:
8
𝑯𝑫 = 𝑁+𝑖
𝑁+−
𝑁−𝑖
𝑁−
2𝑝
𝑖=1
are the classes of interest
is the number of data partitions
= number of samples in partition 𝒊
= number of samples of class +, −, respectively
= number of samples of class +, − in partition 𝒊
+, −
𝒑
𝑵𝒊
𝑵+, 𝑵−
𝑵+𝒊 , 𝑵−
𝒊
Hellinger Distance Decision Trees
Cieslak, David A., et al. "Hellinger distance decision trees are robust and skew-insensitive." Data Mining and Knowledge Discovery 24.1 (2012): 136-158.
9
Experiments — Species Observations
10
Layer Variable Type
BIO1 Annual Mean Temperature Temp.
BIO2 Mean Diurnal Range Temp.
BIO3 Isothermality Temp.
BIO4 Temperature Seasonality Temp.
BIO5 Max. Temperature of Warmest Month Temp.
BIO6 Min. Temperature of Coldest Month Temp.
BIO7 Temperature Annual Range Temp.
BIO8 Mean Temperature of Wettest Quarter Temp.
BIO9 Mean Temperature of Driest Quarter Temp.
BIO10 Mean Temperature of Warmest
Quarter Temp.
BIO11 Mean Temperature of Coldest Quarter Temp.
Layer Type Type
BIO12 Annual Precipitation Prec.
BIO13 Precipitation of Wettest Month Prec.
BIO14 Precipitation of Driest Month Prec.
BIO15 Precipitation Seasonality Prec.
BIO16 Precipitation of Wettest Quarter Prec.
BIO17 Precipitation of Driest Quarter Prec.
BIO18 Precipitation of Warmest
Quarter Prec.
BIO19 Precipitation of Coldest Quarter Prec.
Experiments — Environmental Features
BIOCLIM http://www.worldclim.org/bioclim
11
• AUROC:
– True positive rate: fraction of positive instances classified positive
– False positive rate: fraction of negative instances classified negative
• AUPR:
– Precision: fraction of retrieved instances that are relevant
– Recall: fraction of relevant instances that are retrieved
• CORR:
– Correlation between observation and prediction
– Computed as a Pearson correlation coefficient
Model Evaluation
12
Results — Mean AUPR Per Method
13
MAXENT HDDT
Results — Model Distributions
V. olivaceus
14
• We characterize the problem as one of class imbalance.
• We introduce machine learning models that are robust against imbalance.
• We also introduce performance metrics that can help to more fully characteristic model performance in this domain.
• Using these and other metrics, we demonstrate the potential of these models and provide recommendations for further investigation.
Conclusion
15
Acknowledgements
– We gratefully acknowledge support by the National Science Foundation (Award Number 0129584)
16
Questions
?