Download - Model evaluation 201606

Model Evaluation

A. Townsend PetersonUniversity of Kansas

Generalities

• Calibration data and evaluation data must be independent

• Important to establish whether the observed coincidence between model predictions and testing data is closer than random expectations

• Only once a model is tested (successfully) should the model be interpreted and explored

Threshold-dependent or Not?Thresholded• PRO

– Simplicity of test– Clear interpretation– Computation is easy

• CON– Assumptions required in

thresholding– Less well accepted by the

community (who cares?)

Continuous• PRO

– Avoid need for thresholding and assumptions

– Very well accepted by community

• CON– Less clear in interpretation– Problems (known) with ROC

AUC– Computational challenges

Binomial Test

• Given a SINGLE threshold• Proportional area predicted present

determines expected numbers of points correctly predicted

• Binomial test assesses whether observed number of successes is greater than that expected by chance alone

If predicted suitable area covers 15% of the testing area, then 15% of evaluation points are expected to fall in the predicted suitable area by chance.

• p = proportion of area predicted suitable

• s = number of successes• n = number of evaluation

points• =1-BINOMDIST(s,n,p,”TRUE”)

Cumulative binomial distribution calculates the probability of obtaining s successes out of n trials in a situation in which p proportion of the testing area is predicted present. If this probability is below 0.05, we interpret the situation as indicating that the model’s predictions are significantly better than random.

Threshold-dependent Approach

Threshold-independent Approaches

Corr

ect p

redi

ction

of

pres

ence

info

rmati

on

(= a

void

ance

of o

miss

ion

erro

r)

Correct prediction of absence information (= avoidance of commissionerror)

ROC Problems

• Ignores predicted probability values … just a ranking of suitabilities

• Speaks to regions of ROC space (= predictions) that are not particularly relevant

• Weights omission and commission errors equally• No information about spatial distribution of

model errors• Study area extent determines outcomes!

Significance vs Performance

• Predictions that are significantly better than random is important, and is a sine qua non for model interpretation

• BUT, it is also important to assure that the model performs sufficiently well for the intended uses of the output

• Performance measures include omission rate, correct classification rate, etc.