Forecast Evaluation Concepts
• Forecast evaluation basics – Tressa Fowler
• Evaluation of categorical variables – Tara Jensen
• Evaluation of continuous variables – Tressa Fowler
• Evaluation of probabilistic forecasts – Barbara Brown
• Intro to spatial forecast verification – Barbara Brown
3:40 – 5:10 Exercises using R, A statistical tool
• Overview of tools and useful links
• Introduction to R (Tara Jensen)
Basic Verification Concepts
Tressa L. Fowler
National Center for Atmospheric Research
Boulder Colorado USA
Basic concepts - outline
• What is verification?• Why verify?• Identifying verification goals• Forecast “goodness”• Designing a verification study• Types of forecasts and observations• Matching forecasts and observations• Verification attributes• Miscellaneous issues• Questions to ponder: Who? What? When? Where? Which?
Why?
3Copyright 2011, University Corporation for
Atmospheric Research, all rights reserved
What is verification?
• Verification is the process of comparing forecasts to relevant observations– Verification is one aspect of measuring forecast goodness
• Verification measures the quality of forecasts (as opposed to their value)
• For many purposes a more appropriate term is “evaluation”
4Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Why verify?
• Purposes of verification (traditional definition)
– Administrative purpose• Monitoring performance• Choice of model or model configuration
(has the model improved?)
– Scientific purpose• Identifying and correcting model flaws• Forecast improvement
– Economic purpose• Improved decision making• “Feeding” decision models or decision support systems
5Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Identifying verification goals
What questions do we want to answer?• Examples: In what locations does the model have the best
performance?Are there regimes in which the forecasts are better or
worse? Is the probability forecast well calibrated (i.e.,
reliable)?Do the forecasts correctly capture the natural
variability of the weather?
Other examples?
6Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Identifying verification goals (cont.)
• What forecast performance attribute should be measured?• Related to the question as well as the type of forecast
and observation
• Choices of verification statistics, measures, graphics• Should match the type of forecast and the attribute of
interest• Should measure the quantity of interest (i.e., the
quantity represented in the question)
7Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Forecast “goodness”
• Depends on the quality of the forecast
AND
• The user and his/her application of the forecast information
8Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Seasonal Forecast: Streamflow 15% > normal. Good or bad?
9Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Seasonal Forecast: Streamflow 15% > normal.Good or bad?
10Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Good forecast or Bad forecast?
• Agricultural users: No problem, draw full water rights at leisure.
• Rafting Companies: Geared up for busy rafting season, suffered losses when many sections of river
were closed for safety.
11
Different users have different ideas about
what makes a forecast good
Different verification approaches can measure different types of “goodness”
Basic guide for developing verification studies
Consider the users…– … of the forecasts– … of the verification information
• What aspects of forecast quality are of interest for the user?– Typically (always?) need to consider multiple aspects
Develop verification questions to evaluate those aspects/attributes
• Exercise: What verification questions and attributes would be of interest to …– … operators of an electric utility?– … a city emergency manager?– … a mesoscale model developer?– … aviation planners?
12Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Basic guide for developing verification studies
Identify observations that represent the event being forecast, including the– Element (e.g., temperature, precipitation)
– Temporal resolution
– Spatial resolution and representation
– Thresholds, categories, etc.
13Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Observations are not truth• We can’t know the complete “truth”.• Observations generally are more “true” than a
model analysis (at least they are relatively more independent)
• Observational uncertainty should be taken into account in whatever way possible In other words, how well do adjacent observations match
each other?
14
Observations might be garbage if
• Not Independent (of forecast or each other)
• Biased– Space
– Time
– Instrument
– Sampling
– Reporting
• Measurement errors
• Not enough of them
15Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Basic guide for developing verification studies
Identify multiple verification attributes that can provide answers to the questions of interest
Select measures and graphics that appropriately measure and represent the attributes of interest
Identify a standard of comparison that provides a reference level of skill (e.g., persistence, climatology, old model)
16Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Types of forecasts, observations• Continuous
– Diurnal Temperature Range
– Rainfall amount
– Annual Snowfall
• Categorical– Dichotomous
Rain vs. no rain
Strong winds vs. no strong wind
Night frost vs. no frost
Often formulated as Yes/No
– Multi-category Cloud amount category
Precipitation type
– May result from subsetting continuous variables into categories Ex: Temperature categories of 0-10, 11-20, 21-30, etc.
17Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Types of forecasts, observations• Probabilistic
– Observation can be dichotomous, multi-category, or continuous Precipitation occurrence – Dichotomous (Yes/No) Precipitation type – Multi-category Temperature distribution - Continuous
– Forecast can be Single probability value (for dichotomous events) Multiple probabilities (discrete probability distribution for
multiple categories) Continuous distribution
– For dichotomous or multiple categories, probability values may be limited to certain values (e.g., multiples of 0.1)
• Ensemble– Multiple iterations of a continuous or
categorical forecast May be transformed into a probability distribution
– Observations may be continuous,dichotomous or multi-category
2-category precipitation forecast (PoP) for US
ECMWF 2-m temperature meteogram for Helsinki
18Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Verification attributes
• Verification attributes measure different aspects of forecast quality
– Represent a range of characteristics that should be considered
– Many can be related to joint, conditional, and marginal distributions of forecasts and observations
19Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
20
Joint : The probability of two events in conjunction.
Pr (Tornado forecast AND Tornado observed) = 30 / 2800 = 0.01
Conditional : The probability of one variable given that the second is already determined.
Pr (Tornado Observed | Tornado Fcst) = 30/100= 0.30
Tornado
forecast
Tornado Observed
yes no Total fc
yes 30 70 100
no 20 2680 2700
Total obs 50 2750 2800
Marginal : The probability of one variable without regard to the other.
Pr(Yes Forecast) = 100/2800 = 0.04Pr(Yes Obs) = 50 / 2800 = 0.02
Verification attribute examples
• Bias - (Marginal distributions)
• Correlation- Overall association (Joint distribution)
• Accuracy- Differences (Joint distribution)
• Calibration- Measures conditional bias (Conditional distributions)
• Discrimination- Degree to which forecasts discriminate between
different observations (Conditional distribution)
21Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Some key things to think about …
Who…– …wants to know?
What… – … does the user care about?– … kind of parameter are we evaluating? What are its
characteristics (e.g., continuous, probabilistic)?– … thresholds are important (if any)?– … forecast resolution is relevant (e.g., site-specific, area-
average)?– … are the characteristics of the obs (e.g., quality, uncertainty)? – … are appropriate methods?
Why…– …do we need to verify it?
22Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Some key things to think about…
How…
– …do you need/want to present results (e.g., stratification/aggregation)?
Which…
– …methods and metrics are appropriate?
– … methods are required (e.g., bias, event frequency, sample size)
23Copyright 2011, University Corporation for Atmospheric Research, all rights reserved
Categorical Verification
Tara JensenNCAR/RAL/JNT
Contributions from Matt Pocernich, Eric Gilleland,
Tressa Fowler, Barbara Brown and others
MH
F
Observation
Forecast
Finley Tornado Data
(1884)
Forecast answering the
question:
Will there be a tornado?
Observation answering the
question:
Did a tornado occur?
YES
NO
Answers fall into 1 of 2 categories Forecasts and Obs are Binary
YES
NO
Yes No Total
Yes 28 72 100
No 23 2680 2703
Total 51 2752 2803
Observed
Fo
reca
st
Finley Tornado Data
(1884)
Contingency Table
Yes No Total
Yes 28 72 100
No 23 2680 2703
Total 51 2752 2803
Observed
Fo
reca
stA Success?
Percent Correct = (28+2680)/2803 = 96.6% !!!!
Yes No Total
Yes 0 0 0
No 51 2752 2803
Total 51 2752 2803
Observed
Fo
reca
stWhat if forecaster
never forecasted a tornado?
Percent Correct = (0+2752)/2803 = 98.2% !!!!
maybe Accuracy is not the most
informative statistic
But the contingency table concept is good…
2 x 2 Contingency Table
Yes No Total
Yes Hit
False
Alarm
Forecast
Yes
No Miss
Correct
Negative
Forecast
No
Total Obs. Yes Obs. No Total
ObservedF
ore
cast
Example: Accuracy = (Hits+Correct Negs)/Total
Common Notation(however not universal notation)
Example: Accuracy = (a+d)/n
Yes No Total
Yes a b a+b
No c d c+d
Total a+c b+d n
Observed
Fore
cast
What if data are not binary?
Temperature < 0 C
Precipitation > 1 inch
CAPE > 1000 J/kg
Ozone > 20 µg/m³
Winds at 80 m > 24 m/s
500 mb HGTS < 5520 m
Radar Reflectivity > 40 dBZ
MSLP < 990 hPa
LCL < 1000 ft
Cloud Droplet Concentration > 500/cc
Hint: Pick a threshold
that is meaningful
to your end-user
Contingency Table for
Freezing Temps (i.e. T<=0 C)
Another Example:
Base Rate (aka sample climatology) = (a+c)/n
<= 0C > 0C Total
<= 0C a b a+b
> 0C c d c+d
Total a+c b+d n
Observed
Fore
cast
Alternative Perspective on
Contingency Table
Hits
Correct
Negatives
False Alarms Misses
Forecast = yes Observed = yes
Conditioning to form a statistic
• Considers the probability of one event given another event
• Notation: p(X|Y=1) is probability of X occuring given
Y=1 or in other words Y=yes
Conditioning on Fcst provides:
• Info about how your forecast is performing
• Apples-to-Oranges comparison if comparing stats from 2 models
Conditioning on Obs provides:
• Info about ability of forecast to discriminate between event and
non-event - also called Conditional Probability or “Likelihood”
• Apples-to-Apples comparison if comparing stats from 2 models
Conditioning on forecasts
Forecast = yes
f=1
Observed = yes
x=1
p(x|f=1)p(x=1|f=1) = a / aUb = a/(a+b) = Fraction of Hits
p(x=0|f=1) = b / aUb = b/(a+b) = False Alarm Ratio
Conditioning on observations
Forecast = yes
f=1
Observed = yes
x=1
p(f|x=1)p(f=1|x=1) = a / aUc = a/(a+c) = Hit Rate
p(f=0|x=1) = c / aUc = c/(a+c) = Fraction of Misses
What’s considered good?
Conditioning on Forecast
Fraction of hits - p(f=1|x=1) = a/(a+b) : close to 1
False Alarm Ratio - p(f=0|x=1) = b/(a+b) : close to 0
Conditioning on Observations
Hit Rate - p(f=1|x=1) = a/(a+c): close to 1
[aka Probability of Detection Yes (PODy)]
Fraction of misses p(f=0|x=1) = b/(a+c) : close to 0
Examples of categorical scores(most based on conditioning)
• Hit Rate (PODy) = a/(a+c)
• PODn = d/(b+d) = ( 1 – POFD)
• False Alarm Rate (POFD) = b/(b+d)
• False Alarm Ratio (FAR) = b/(a+b)
• (frequency) Bias (FBIAS) = (a+b)/(a+c)
• Threat Score or Critical Success Index = a/(a+b+c)
a cb
d
POD
Probability of
Detection
POFD
Probability of
False Detection
(CSI)
Examples of CTC calculations
Yes No Total
Yes 28 72 100
No 23 2680 2703
Total 51 2752 2803
ObservedF
ore
ca
st
Threat Score = 28 / (28 + 72+ 23) = 0.228
Probability of Detection = 28 / (28 + 23) = 0.55
False Alarm Ratio= 72/(28 + 72) = 0.720
Skill Scores
How do you compare the skill of easy to
predict events with difficult to predict events?
• Provides a single value to summarize performance.
• Reference forecast - best naive guess; persistence;
climatology.
• Reference forecast must be comparable.
• Perfect forecast implies that the object can be perfectly
observed.
Generic Skill Score
climo
1
ref
perf ref
A ASS
A A
MSEMSESS
MSE
Positively oriented and 1 is optimal
Climo could be a separate forecast or a
gridded forecast sample climotology
where A = any measure
ref = reference
perf = perfect
where MSE =
Mean Square ErrorExample:
Commonly Used Skill Scores
• Gilbert Skill Score - based on the CSI corrected for the
number of hits that would be expected by chance.
• Heidke Skill Score - based on Accuracy corrected by the
number of hits that would be expected by chance.
• Hanssen-Kuipers Discriminant – (Pierce Skill Score) measures the ability of the forecast to discriminate between (or
correctly classify) events and non-events. H-K=POD-POFD
• Brier Skill Score for probabilistic forecasts
• Fractional Skill Score for neighborhood methods
• Intensity-Scale Skill Score for wavelet methods
Accounting for Uncertainty
• Observational
• Model
– Model parameters
– Physics
– Verification scores
• Sampling
– Verification statistic is a realization of a random process
– What if the experiment were re-run under identical conditions?
When should sampling variability be used?
• You are comparing two forecasts of the same
event, evaluate the differences.
• Sampling variability is large and can quickly
overwhelm small, but significant differences.
6hr Accum Precip
Model 1 Model 2 Observation
Confidence Intervals (CIs)
“If we re-run the experiment N times, and create N (1-α)100% CI’s, then we expect the true value of the parameter to fall inside (1-α)100 of the intervals.”
Confidence intervals can be parametric or non-parametric…
Confidence Intervals (CI’s)
• Parametric
– Assume the observed sample is a realization from a known population distribution with possibly unknown parameters (e.g., normal).
– Normal approximation CI’s are most common.
– Quick and easy.
Example: Let be an independent and
identically distributed (iid) sample from a normal
distribution with variance.
Then, is an estimate of the mean
of the sample. A (1-α)100% CI for the mean is
given by
How to calculate Normal Approx.
CI’sX1, ,Xn
X
2
X1
nXi
i 1
n
X z /2
X
n Note: You can find much
more about these ideas in
any basic statistics text
book
Uncertainty
Yes No Total
Yes 28 72 100
No 23 2680 2703
Total 51 2752 2803
Observed
Fo
reca
st
Hit rate= 0.55 ≈ (0.41, 0.69)
FAR= 0.72 ≈ (0.63, 0.81)
95% normal
approximation CI
shown in red
Confidence Intervals (CI’s)
• Nonparametric
– Assume the distribution of the observed sample is representative of the population distribution.
– Bootstrap CI’s are most common.
– Can be computationally intensive, but easy enough.
X1, ,Xn
IID Bootstrap Algorithm
(Nonparametric) Bootstrap
CI’s
1. Resample with replacement from the sample, .
2. Calculate the verification statistic(s) of interest from the resample in step 1.
3. Repeat steps 1 and 2 many times, say B times, to obtain a sample of the verification statistic(s) θB .
4. Estimate (1-α)100% CI’s from the sample in step 3.
Empirical Distribution (Histogram) of statistic
calculated on repeated samples
5%5%
Bounds for
90% CI
Values of statistic θB
Presented by
Tressa L. Fowler
Adapted from presentations created
by
Barbara Casati and Barbara Brown
Verification of
Continuous Forecasts
• Exploratory methods
– Scatter plots
– Discrimination plots
– Box plots
• Statistics
– Bias
– Error statistics
– Robustness
– Comparisons
Exploratory methods:
joint distribution
Scatter-plot: plot of observation versus forecast values
Perfect forecast = obs, points should be on the 45o diagonal
Provides information on: bias, outliers, error magnitude, linear association, peculiar behaviours in extremes, misses and false alarms (link to contingency table)
Exploratory methods:
marginal distributionQuantile-quantile plots:
OBS quantile versus the corresponding FRCS quantile
Scatter-plot and qq-plot: example 1Q: is there any bias? Positive (over-forecast) or negative (under-forecast)?
Scatter-plot and qq-plot: example 2
Describe the peculiar behaviour of low temperatures
Scatter-plot: example 3
Describe how the error varies as the
temperatures grow
outlier
Scatter-plot and
Contingency TableDoes the forecast detect correctly
temperatures above 18 degrees ?Does the forecast detect correctly
temperatures below 10 degrees ?
Example Box (and Whisker) Plot
Copyright 2011, UCAR, all rights reserved.
Exploratory methods:
marginal distributionsVisual comparison: Histograms, box-plots, …
Summary statistics:
• Location:
• Spread:
IQRSTDEVMEDIANMEAN
9.755.9917.0018.62FRCS
8.525.1820.2520.71OBS
1
1mean
median
n
i
i=
0.5
= X = xn
= q
2
1
0.25
1st dev
Inter Quartile Range
IQR
n
i
i=
0.75
= x Xn
=
= q q
Exploratory methods:
conditional distributions
Conditional histogram and
conditional box-plot
Exploratory methods:
conditional qq-plot
Continuous scores: linear bias
linear bias = Mean Error =1
nf
i- o
i( )i=1
n
å = f - o
Mean Error = average of the errors = difference between the means
It indicates the average direction of error: positive bias indicates over-forecast, negative bias indicates under-forecast.
Does not indicate the magnitude of the error (positive and negative error can cancel outs).
Attribute:
measures
the bias
Mean Absolute Error
MAE =1
nf
i- o
ii=1
n
å
Average of the magnitude of the errors
Linear score = each error has same weight
It does not indicates the direction of the error, just the magnitude
Attribute:
measures
accuracy
Median Absolute Deviation
MAD = median f
i- o
i{ }
Median of the magnitude of the errors
Very robust
Extreme errors have no effect
Attribute:
measures
accuracy
Continuous scores: MSE
MSE =1
nf
i- o
i( )2
i=1
n
å
Average of the squares of the errors: it measures the magnitude of the error, weighted on the squares of the errors
it does not indicate the direction of the error
Quadratic rule, therefore large weight on large errors:
good if you wish to penalize large error
sensitive to large values (e.g. precipitation) and outliers; sensitive to large variance (high resolution models); encourage conservative forecasts (e.g. climatology)
Attribute:
measures
accuracy
Continuous scores: RMSE
RMSE = MSE =1
nf
i- o
i( )2
i=1
n
å
RMSE is the squared root of the MSE: measures the magnitude of the error retaining the variable unit (e.g. OC)
Similar properties of MSE: it does not indicate the direction the error; it is defined with a quadratic rule = sensitive to large values, etc.
NOTE: RMSE is always larger or equal than the MAE
Attribute:
measures
accuracy
Forecast Lead Time
24 48 12072 96
Model 1
Model 2
Continuous scores: linear correlation
rXY
=
1
ny
i- y( ) x
i- x( )
i=1
n
å
1
ny
i- y( )
2
×1
nx
i- x( )
2
i=1
n
åi=1
n
å
=cov(Y,X)
sYs
X
Measures linear association between forecast and observation
Y and X rescaled (non-dimensional) covariance: ranges in [-1,1]
It is not sensitive to the bias
Not robust = better if data are normally distributedNot resistant = sensitive to large values and outliers
Attribute:
measures
association
Scores for continuous forecasts
Simplest overall measure of performance:
Correlation coefficient
( , )
( ) ( )fx
Cov f x
Var f Var x1
( )( )
( 1)
n
i i
ifx
f x
f f x x
rn s s
Continuous scores:
anomaly correlation
• Correlation calculated on
anomaly.
• Anomaly is difference
between what was forecast
(observed) and climatology.
• Centered or uncentered
versions.
MSE and bias correction
• MSE is the sum of the squared bias and the
variance. So bias = MSE
MSE = f - o( )2
+ sf
2 + so
2 - 2sfs
or
fo
MSE = ME2 + var(f - o)
Continuous skill scores:
good practice rules
• Use same climatology for the comparison of different models.
• When evaluating the Reduction of Variance, sample climatology
gives always worse skill score than long-term climatology: ask
always which climatology is used to evaluate the skill.
Continuous skill scores:
good practice rules• If the climatology is calculated pulling together data from many
different stations and times of the year, the skill score will be better
than if a different climatology for each station and month of the
year are used.
• In the former case the model gets credit from forecasting correctly
seasonal trends and specific locations climatologies.
• In the latter case the specific topographic effects and long-term trends
are removed and the forecast discriminating capability is better
evaluated. Choose the appropriate climatology for fulfilling your
verification purposes.
• Persistence forecast: use same time of the day to avoid diurnal cycle
effects.
Continuous Scores of Ranks
Problem: Continuous scores sensitive to large values or non robust.
Solution: Use the ranks of the variable, rather than its actual values.
The value-to-rank transformation:• diminish effects due to large values • transform distribution to a Uniform distribution• remove bias
Rank correlation is the most common.
36714528rank
22.324.625.519.823.124.221.727.4Temp oC
Evaluation of Probability Forecasts
Barbara BrownJoint Numerical Testbed
NCAR, Boulder, CO
July 2011
Acknowledgments: Tom Hamill, Laurence Wilson, Tressa Fowler
Copyright UCAR 2011, all rights reserved.
Questions to ask before beginning?
• How were the probability forecasts constructed?– Subjective forecasts (i.e., human generated)– Statistical methods– Ensemble forecasts (i.e., model-based)
• What are the “events” being forecasted?– Often the “event” is confused with the forecast
• Multi-category or dichotomous?– Extended methods needed for multi-category
• How are your forecasts used?– Kinds of decisions and decision makers (decision
making “systems”)
Copyright UCAR 2011, all rights reserved.
Dichotomous variables
• Observations have 2 possible valuesExamples:
Rain / No rainTemperature > 40 C or 40 C
• Forecasts can – Have multiple values (e.g., 0, .1, .2, …, 1) – Be continuous between 0 and 1
• Probability forecasts are a special form of continuous or categorical forecast
• Extension to multiple categories: Each observed category is assigned a forecast probability
Copyright UCAR 2011, all rights reserved.
Verifying a probabilistic forecast
• You cannot verify a probabilistic forecast with a single observation.
• The more data you have for verification, (as is true in general for other statistical measures) the more certain you are.
• Rare events (low probability) require more data to verify.
Copyright UCAR 2011, all rights reserved.
The Brier Score
• Analogous to MSE…
– The observation, x, takes on values of 0 and 1
– The forecast, f, is a probability value
• Measures the average squared error in the probabilities
– Large errors result in large penalties
Copyright UCAR 2011, all rights reserved.
2
1
)(1
BS k
n
k
k xfn
Brier skill score
Copyright UCAR 2011, all rights reserved.
refref
ref
BS
BS1
BS0
BSBSBSS
The Brier Skill Score (BSS) measures the relative improvement of the forecasts over a reference
forecast
Typically, the reference forecast is the “sample climatology” – i.e., the frequency with which the
“event” actually occurred
Decomposition of the Brier Score
• Decomposition is based on “categories” of probability values
• Reliability and Resolution are measures of forecast performance
• Uncertainty depends only on the observations– Measure of forecast “difficulty”
Copyright UCAR 2011, all rights reserved.
)1()(1
)(1
BS 2
1
2
1
xxxxNn
xfNn
i
I
i
iii
I
i
i
Reliability Resolution Uncertainty
Brier Skill Score
Copyright UCAR 2011, all rights reserved.
)1()(1
)(1
BS 2
1
2
1
xxxxNn
xfNn
i
I
i
iii
I
i
i
Reliability Resolution Uncertainty
UNC
RELRESBSS
Components of the Brier Score• Reliability
Measures how well the conditional relative frequency of events matches the forecast
• ResolutionMeasures how well the forecasts
distinguish situations with different frequencies of occurrence
• UncertaintyMeasures the variability in the
observations (i.e., the difficulty of the forecast situations)
Copyright UCAR 2011, all rights reserved.
2
1
1( )
I
i i i
i
N f xn
2
1
1( )
I
i i
i
N x xn
(1 )x x
Properties of a perfect probabilistic forecast of a binary event.
Sharpness
forecast
frequency
observed
non-events
observed
events
Resolution Reliability
Copyright UCAR 2011, all rights reserved.
Our friend, the scatterplot
Copyright UCAR 2011, all rights reserved.
Introducing the reliability diagram!( close relative to the attribute diagram)
• Analogous to the scatter plot- same intuition holds
• Data must be binned!
• Hides how much data is represented by each category
• Expresses conditional probabilities
• Confidence intervals can illustrate the problems with small sample sizes
Copyright UCAR 2011, all rights reserved.
Reliability Diagram
From Eumetcalmodule on forecast verification
Reliability Diagram Characteristics
From Eumetcalmodule on forecast verification
Reliability Diagram Characteristics
Probs under-forecast
No skill
Perfect categorical
forecast
Tends to mean, some
skill
Too few samples
Relatively reliable, rare
event
No resolution
Over-resolved forecast
Typical categorical forecast
Sharpness is also important
• “Sharpness” measures specificity of Prob forecasts
• Given two reliable forecast systems, the one producing the sharper forecasts is preferable.
• Sharpness without reliability implies unrealistic confidence.
• Sharpness ≠ Resolution.
• Sharpness is function of forecasts only
From Eumetcal module on forecast verification
DiscriminationMeasures ability of forecasts to distinguish situations
leading to the occurrence and non-occurrence of event
Depends on:
• Separation of means of conditional distributions
• Variance within conditional distributions
forecast
frequency
observed
non-events
observed
events
forecast
frequency
observed
non-events
observed
events
forecast
frequency
observed
non-events
observed
events
Good discrimination Poor discrimination Good discrimination
Copyright UCAR 2011, all rights reserved.
Receiver Operating Characteristic (ROC)• Another approach for examining discrimination between
events and non-events• Formed by setting multiple thresholds for the forecast
value– For each threshold treat the forecast as categorical (i.e., Yes/No)– Analogous to setting “decision thresholds” in the probabilities
• For each threshold compute POD and PDFD (often called the “hit rate” and the “false alarm rate”)
• Plot the POD and POFD values for each threshold against each other using scatter plot
• ROCs do not take into account reliability – measure “potential skill”– Need to examine reliability in addition– Allows comparison of forecasts with different biases
• Typically used for Probability Forecasts but can be used any forecasts that can be thresholded
Empirical ROC Curve
Diagonal line represents
No Skill (hit just as likely as a false alarm)
If line falls under Diagonal
Fcst Worse than Random Guess
Area under the ROC curve (AUC) is a useful measure of skill
Perfect = 1, Random = 0.5
Perfect
Useful references• Good overall references for forecast verification:
– (1) Wilks, D.S., 2006: Statistical Methods in the Atmospheric Sciences (2nd Ed). Academic Press, 627 pp.
– (2) WMO Verification working group forecast verification web page, http://www.cawcr.gov.au/projects/verification/
– (3) Jolliffe and Stephenson Book: Jolliffe, I.T., and D.B. Stephenson, 2003: Forecast Verification. A Practitioner's Guide in Atmospheric Science. Wiley and Sons Ltd, 240 pp.
• Rank histograms: Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550-560.
• Spread-skill relationships: Whitaker, J.S., and A. F. Loughe, 1998: The relationship between ensemble spread and ensemble mean skill. Mon. Wea. Rev., 126, 3292-3302.
• Brier score, continuous ranked probability score, reliability diagrams: Wilks text again.
• Relative operating characteristic: Harvey, L. O., Jr, and others, 1992: The application of signal detection theory to weather forecasting behavior. Mon. Wea. Rev., 120, 863-883.
• Economic value diagrams: – (1)Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system.
Quart. J. Royal Meteor. Soc., 126, 649-667.
– (2) Zhu, Y, and others, 2002: The economic value of ensemble-based weather forecasts. Bull. Amer. Meteor. Soc., 83, 73-83.
• Overestimating skill: Hamill, T. M., and J. Juras, 2006: Measuring forecast skill: is it real skill or is it the varying climatology? Quart. J. Royal Meteor. Soc., Jan 2007 issue. http://tinyurl.com/kxtct
Copyright UCAR 2011, all rights reserved.
Spatial Verification Methods
Barbara Brown ([email protected])
National Center for Atmospheric Research (NCAR)Boulder, Colorado
Collaborators: Randy Bullock, John Halley Gotway, David Ahijevych, Eric Gilleland, Beth Ebert,
Barbara Casati
July 2011
Challenge of High Resolution
Fawcett, BAMS
3-km WRF, 2009
Examples of 12-h accumulated precip
190-km LFM, 1977
THEN NOW
Traditional approach
101
Consider gridded
forecasts and
observations of
precipitation
Which is better?
OBS1
2 3
45
Traditional approach
102
OBS1
2 3
45
Scores for Examples 1-4:
Correlation Coefficient = -0.02
Probability of Detection = 0.00
False Alarm Ratio = 1.00
Hanssen-Kuipers = -0.03
Gilbert Skill Score (ETS) = -0.01
Scores for Example 5:
Correlation Coefficient = 0.2
Probability of Detection = 0.88
False Alarm Ratio = 0.89
Hanssen-Kuipers = 0.69
Gilbert Skill Score (ETS) = 0.08
Forecast 5 is “Best”
Traditional approach
103
OBS1
2 3
45
Some problems with the
traditional approach:
(1) Non-diagnostic –
doesn’t tell us what was
wrong with the forecast – or
what was right
(2) Utra-sensitive to small
errors in simulation of
localized phenomena
Spatial forecasts
Spatial methods aim to:• Account for
uncertainties in timing and location
• Account for spatial structure
• Provide information on error in physical terms
• Provide information that is– Diagnostic– Meaningful to
forecast users104
Weather variables (e.g., precipitation) defined over spatial
domains have coherent structure
and features
Spatial Method Categories
New spatial verification approaches
NeighborhoodGive credit to "close" forecasts
Scale separationMeasure scale-dependent error
Field deformationMeasure distortion and
displacement (phase error) for
whole field How should the forecast be
adjusted to make the best match
with the observed field?
Object- and feature-basedEvaluate attributes of
identifiable features
106
Scale separation methods
• Goal:
Examine performance as a function of spatial scale
• Example: Power spectra
– Does it look real?
– Harris et al. (2001) compared multi-scale statistics for model and radar data
107
From Harris et al. 2001
Scale separation methods
Example methods :
• Intensity-scale (Casati et al. 2004)
• Multi-scale variability(Zapeda-Arce et al. 2000; Harris et al. 2001; Mittermaier 2006)
• Variogram (Marzban and Sandgathe 2009)
108
Neighborhood verificationGoal:Examine forecast
performance in a region; don’t require exact matched
• Also called “fuzzy” verification
• Example: Upscaling– Put observations and/or
forecast on coarser grid– Calculate traditional
metrics
• Provide information about scales where the forecasts have skill
109
Neighborhood methodsExample methods :
• Distribution approach (Marsigli)
• Fractions Skill Score (Roberts 2005; Roberts and Lean 2008; Mittermaier and Roberts 2009)
• Multiple approaches (Ebert 2008, 2009) (e.g., Upscaling, Multi-event cont. table, Practically perfect)
110
bad
good
Field deformation
Goal:
Examine how much a forecast field needs to be transformed in order to match the observed field
111
Field deformation methods
112
Example methods :
• Forecast Quality Index (Venugopal et al. 2005)
• Forecast Quality Measure/Displacement Amplitude Score (Keil and Craig 2007, 2009)
• Image Warping (Gilleland et al. 2009; Lindström et al. 2009; Engel 2009)
From Keil and Craig 2008
Object/Feature-based
Goals:
1. Identify relevant features in the forecast and observed fields
2. Compare attributes of the forecast and observed features
113
MODE example 2008
Object/Feature-based
Example methods:
• Cluster analysis (Marzban and Sandgathe 2006a,b)
• Composite (Nachamkin 2005, 2009)
• Contiguous Rain Area (CRA) (Ebert and McBride 2000; Ebert and Gallus 2009)
• MODE (Davis et al. 2006, 2009)
• Procrustes (Micheas et al.2007, Lack et al. 2009)
• SAL (Wernli et al. 2008, 2009)
114
Composite Centered on All Observed Events
CRA example (Ebert and Gallus)
Composite: Nachamkin
Limitations: Filtering (Neighborhood and Scale separation)
Does not clearly isolate specific errors (e.g., displacement, amplitude, structure)
Limitations: Displacement methods (features-based, field deformation)
• May have somewhat arbitrary matching criteria
• Often many parameters to be defined• More research needed on diagnosing
mesoscale structure
Strengths – FilteringNeighborhood & Scale-Separation
• Accounts for
o Unpredictable scales
o Uncertainty in observations
• Simple – ready-to-go
• Evaluates different aspects of a forecast (e.g., texture)
• Scale-dependent skill
Strengths - Displacement
• Features-based– credit for close forecast– measures displacement, structure
• Field-deformation– Distinguish aspect ratio and orientation angle error– credit for close forecast
What do the new methods measure?
119
Attribute Traditional Feature-based
Neighbor-hood
Scale Field
Defor-mation
Perf at different scales
Indirectly Indirectly Yes Yes No
Location errors
No Yes Indirectly Indirectly Yes
Intensity errors
Yes Yes Yes Yes Yes
Structure errors
No Yes No No Yes
Hits, etc. Yes Yes Yes Indirectly Yes
Back to the original example… What can the new methods tell us?
Example:• MODE “Interest” measures
overall ability of forecasts to match obs
• Interest values provide more intuitive estimates of performance than the traditional measure (ETS)
• But note: Even for spatial methods, Single measures don’t tell the whole story!
120
Final comments
• Benefits of spatial methods
– Provide potential for greater insight into forecast performance
– Provide more meaningful comparisons of forecast performance
• Limitations
– Require gridded forecasts and observations
– May require setting many parameters
– Somewhat difficult to implement
121
Many references and other information
http://www.rap.ucar.edu/projects/icp/index.html
Software is available for many of the methods – See website above
MET (see lectures Thursday) includes several methods [MODE, Intensity-scale (wavelet), neighborhood]; R package includes intensity-scale
Information resources