Traditional Verification Scores Fake forecasts 5 geometric 7 perturbed subjective evaluation ...

Traditional Verification Scores

Fake forecasts 5 geometric 7 perturbed

subjective evaluation expert scores from last year’s workshop 9 cases x 3 models

Geometric

error/scores for first 4 cases correlation coefficient = -0.02 prob of detection = 0.00 false alarm ratio = 1.00 Hanssen&Kuipers = -0.03 equitable threat = -0.01

case 5 correlation coefficient = 0.2 prob of detection = 0.88 false alarm ratio = 0.89 Hanssen&Kuipers = 0.69 equitable threat = 0.08

THE WIN

NER

1

2 3

4 5

Perturbed fake cases – known errors

1. 3 pts right, 5 pts down





6. 12 pts right, 20 pts down, times 1.5

7. 12 pts right, 20 pts down, minus 0.05”

Perturbed fake cases 1-3

obs 1

2 3

1 2 3 4 5 6 7 multiplicative bias

thresholds >0, >=0.01”, >=0.02”, >=0.03”

6 7

4 5

6 7

4 5

Gilbert skill score (ETS)

subjective evaluation

A

CB

histograms of expert scores

histogram of mean scores (2-trials)

19 25

10495

176

10392

28

6

0

20

40

60

80

100

120

140

160

180

200

1 1.5 2 2.5 3 3.5 4 4.5 5

Score

24 first-trial scores 22 second-trial scores

mean score from trial 1 and 2 with 95% confidence bars

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

26 Apr 13 May 14 May 18 May 19 May 25 May 1 Jun 3 Jun 4 Jun

1st trial

2nd trial

observation (truth)

wrf4ncar

wrf2caps

wrf4ncep

observation (truth)

wrf4ncar

wrf2caps

wrf4ncep

expert scores vs grid stats

Equitable threat score (Gilbert Skill score) forecast area bias (thresh=0.07”)

95% conf

expert scores vs grid stats

odds ratio Pearson product-moment correlation coefficient

regular

bootstrap method

do the expert scores show significant differences among the models?

Student's t-Test

2-tail, paired

2-trial mean p-value

wrf2caps-wrf4ncar 0.04

wrf2caps-wrf4ncep 0.06

wrf4ncar-wrf4ncep 0.003

mean (2-trial) score for each modelwith 95% confidence interval

wrf2caps, 2.95

wrf4ncar, 3.02

wrf4ncep, 2.83

all, 2.93

2.65

2.70

2.75

2.80

2.85

2.90

2.95

3.00

3.05

3.10

3.15

Chance null hypothesis is true (i.e. no difference in means)

do the expert scores show significant differences among the models?Wilcoxon-Mann-Whitney rank-sum test (Wilks, p. 138) 2-tail

probability difference in ranks due to chance




Wilcoxon signed-rank test (Wilks, p. 142) 2-tail




mean (2-trial) score for each modelwith 95% confidence interval

wrf2caps, 2.95

wrf4ncar, 3.02

wrf4ncep, 2.83

all, 2.93

2.65

2.70

2.75

2.80

2.85

2.90

2.95

3.00

3.05

3.10

3.15

Date post:	13-Dec-2015
Category:	Documents
Upload:	beatrice-hood
View:	218 times
Download:	2 times

Traditional Verification Scores Fake forecasts 5 geometric 7 perturbed subjective evaluation ...

Documents