+ All Categories
Home > Documents > Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… ·...

Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… ·...

Date post: 04-Apr-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
17
Species Distribution Modeling and Prediction: A Class Imbalance Problem Reid A. Johnson 1 Nitesh V. Chawla 1 Jessica J. Hellmann 2 1 Department of Computer Science and Engineering 2 Department of Biological Sciences University of Notre Dame
Transcript
Page 1: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

Species Distribution Modeling and Prediction:

A Class Imbalance Problem

Reid A. Johnson1

Nitesh V. Chawla1

Jessica J. Hellmann2

1Department of Computer Science and Engineering 2Department of Biological Sciences

University of Notre Dame

Page 2: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

Introduction

• Forming knowledge of where species live is important, with applications in ecological conservation and sustainability.

• Modeling techniques that estimate the potential distribution of species are used as proxies for actual observations.

• Species distribution modeling is the process of combining occurrence data with environmental variables to create a model of a species’ niche requirements.

1

Page 3: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

Outline

• Background and problem setup

• Typical approaches

• Our approach

• Evaluation

• Results

2

Page 4: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

3

Notice the imbalance that exists between the number of occurrences (green) and the number of non-occurrences (non-green). We can term this as a “class imbalance”.

Background — Problem Setup

Page 5: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

4

Background — Problem Setup

Page 6: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

• What are the problem attributes?

– Data characterized by two classes

– Severe data imbalance

– Many instances

• How are these attributes typically addressed?

5

Background — Problem Attributes

Page 7: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

MaxEnt Method

• Of all possible species distributions, choose the one that maximizes a goodness measure termed “entropy”. – Unknown probability distribution denoted 𝝅.

– Approximation of 𝝅 is 𝝅 .

– The entropy of 𝝅 is defined as

𝑯 𝝅 = − 𝝅 𝒙 𝒍𝒏𝝅 𝒙

𝒙∈𝑿

where 𝐥𝐧 is the natural logarithm.

– No unfounded constraints should be placed on 𝝅 .

Phillips, Steven J., and Miroslav Dudík. "Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation." Ecography 31.2 (2008): 161-175. 6

Page 8: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

7

Problem Setup Revisited

Page 9: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

• HDDTs are a method of building decision trees using Hellinger distance as the splitting criterion:

8

𝑯𝑫 = 𝑁+𝑖

𝑁+−

𝑁−𝑖

𝑁−

2𝑝

𝑖=1

are the classes of interest

is the number of data partitions

= number of samples in partition 𝒊

= number of samples of class +, −, respectively

= number of samples of class +, − in partition 𝒊

+, −

𝒑

𝑵𝒊

𝑵+, 𝑵−

𝑵+𝒊 , 𝑵−

𝒊

Hellinger Distance Decision Trees

Cieslak, David A., et al. "Hellinger distance decision trees are robust and skew-insensitive." Data Mining and Knowledge Discovery 24.1 (2012): 136-158.

Page 10: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

9

Experiments — Species Observations

Page 11: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

10

Layer Variable Type

BIO1 Annual Mean Temperature Temp.

BIO2 Mean Diurnal Range Temp.

BIO3 Isothermality Temp.

BIO4 Temperature Seasonality Temp.

BIO5 Max. Temperature of Warmest Month Temp.

BIO6 Min. Temperature of Coldest Month Temp.

BIO7 Temperature Annual Range Temp.

BIO8 Mean Temperature of Wettest Quarter Temp.

BIO9 Mean Temperature of Driest Quarter Temp.

BIO10 Mean Temperature of Warmest

Quarter Temp.

BIO11 Mean Temperature of Coldest Quarter Temp.

Layer Type Type

BIO12 Annual Precipitation Prec.

BIO13 Precipitation of Wettest Month Prec.

BIO14 Precipitation of Driest Month Prec.

BIO15 Precipitation Seasonality Prec.

BIO16 Precipitation of Wettest Quarter Prec.

BIO17 Precipitation of Driest Quarter Prec.

BIO18 Precipitation of Warmest

Quarter Prec.

BIO19 Precipitation of Coldest Quarter Prec.

Experiments — Environmental Features

BIOCLIM http://www.worldclim.org/bioclim

Page 12: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

11

• AUROC:

– True positive rate: fraction of positive instances classified positive

– False positive rate: fraction of negative instances classified negative

• AUPR:

– Precision: fraction of retrieved instances that are relevant

– Recall: fraction of relevant instances that are retrieved

• CORR:

– Correlation between observation and prediction

– Computed as a Pearson correlation coefficient

Model Evaluation

Page 13: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

12

Results — Mean AUPR Per Method

Page 14: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

13

MAXENT HDDT

Results — Model Distributions

V. olivaceus

Page 15: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

14

• We characterize the problem as one of class imbalance.

• We introduce machine learning models that are robust against imbalance.

• We also introduce performance metrics that can help to more fully characteristic model performance in this domain.

• Using these and other metrics, we demonstrate the potential of these models and provide recommendations for further investigation.

Conclusion

Page 16: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

15

Acknowledgements

– We gratefully acknowledge support by the National Science Foundation (Award Number 0129584)

Page 17: Species Distribution Modeling and Predictionrjohns15/content/presentations/cidu2012_spe… · MaxEnt Method • Of all possible species distributions, choose the one that maximizes

16

Questions

?


Recommended