Model Evaluation
A. Townsend PetersonUniversity of Kansas
Generalities
• Calibration data and evaluation data must be independent
• Important to establish whether the observed coincidence between model predictions and testing data is closer than random expectations
• Only once a model is tested (successfully) should the model be interpreted and explored
Threshold-dependent or Not?Thresholded• PRO
– Simplicity of test– Clear interpretation– Computation is easy
• CON– Assumptions required in
thresholding– Less well accepted by the
community (who cares?)
Continuous• PRO
– Avoid need for thresholding and assumptions
– Very well accepted by community
• CON– Less clear in interpretation– Problems (known) with ROC
AUC– Computational challenges
Binomial Test
• Given a SINGLE threshold• Proportional area predicted present
determines expected numbers of points correctly predicted
• Binomial test assesses whether observed number of successes is greater than that expected by chance alone
If predicted suitable area covers 15% of the testing area, then 15% of evaluation points are expected to fall in the predicted suitable area by chance.
• p = proportion of area predicted suitable
• s = number of successes• n = number of evaluation
points• =1-BINOMDIST(s,n,p,”TRUE”)
Cumulative binomial distribution calculates the probability of obtaining s successes out of n trials in a situation in which p proportion of the testing area is predicted present. If this probability is below 0.05, we interpret the situation as indicating that the model’s predictions are significantly better than random.
Threshold-dependent Approach
Threshold-independent Approaches
Corr
ect p
redi
ction
of
pres
ence
info
rmati
on
(= a
void
ance
of o
miss
ion
erro
r)
Correct prediction of absence information (= avoidance of commissionerror)
ROC Problems
• Ignores predicted probability values … just a ranking of suitabilities
• Speaks to regions of ROC space (= predictions) that are not particularly relevant
• Weights omission and commission errors equally• No information about spatial distribution of
model errors• Study area extent determines outcomes!
Significance vs Performance
• Predictions that are significantly better than random is important, and is a sine qua non for model interpretation
• BUT, it is also important to assure that the model performs sufficiently well for the intended uses of the output
• Performance measures include omission rate, correct classification rate, etc.