+ All Categories
Home > Documents > Evaluating Hypotheses

Evaluating Hypotheses

Date post: 24-Feb-2016
Category:
Upload: kasa
View: 72 times
Download: 0 times
Share this document with a friend
Description:
Evaluating Hypotheses. How good is my classifier?. How good is my classifier?. Have seen the accuracy metric Classifier performance on a test set. If we are to trust a classifier’s results Must keep the classifier blindfolded Make sure that classifier never sees the test data - PowerPoint PPT Presentation
Popular Tags:
32
EVALUATING HYPOTHESES How good is my classifier?
Transcript
Page 1: Evaluating Hypotheses

EVALUATING HYPOTHESESHow good is my classifier?

Page 2: Evaluating Hypotheses

Evaluating Hypotheses 28/29/03

How good is my classifier? Have seen the accuracy metric

Classifier performance on a test set

Page 3: Evaluating Hypotheses

Evaluating Hypotheses 38/29/03

If we are to trust a classifier’s results

Must keep the classifier blindfolded

Make sure that classifier never sees the test data

When things seem too good to be true…

First and Foremost…

Page 4: Evaluating Hypotheses

Evaluating Hypotheses 48/29/03

Confusion Matrix

Could collect more information

PredictedActual

class pos negpos true pos false negneg false pos true neg

Page 5: Evaluating Hypotheses

Evaluating Hypotheses 58/29/03

Sensitivity vs. Specificity Sensitivity

Out of the things predicted as being positive, how many were correct

Specificity Out of the things predicted as being negative

how many were correct𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=

𝑡𝑛𝑡𝑛+ 𝑓𝑝

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑡𝑝

𝑡𝑝+ 𝑓𝑛

PredictedActual

class pos negpos true pos false

negneg false

postrue neg

• Not as sensitive if begins missing what it is trying to detect• If identify more and more things as

target class, then beginning to get less specific

Page 6: Evaluating Hypotheses

Evaluating Hypotheses 68/29/03

Can we quantify our Uncertainty? Will the accuracy hold with brand

new, never before seen data?

Once we’re sure no cheating is going on…

Page 7: Evaluating Hypotheses

Evaluating Hypotheses 78/29/03

Binomial DistributionDiscrete probability distribution of the number of successes in a sequence of n independent yes/no experiments

Successes or failures—Just what we’re looking for!

Page 8: Evaluating Hypotheses

Evaluating Hypotheses 88/29/03

Binomial DistributionPr (𝑅=𝑟 )= 𝑛!

𝑟 ! (𝑛−𝑟 )!𝑝𝑟(1−𝑝 )𝑛−𝑟

Probability that the random variable R will take on a specific value r

Might be probability of an error or of a positive

Since we have been working with accuracy let’s go with positive

Book works with errors

Page 9: Evaluating Hypotheses

Evaluating Hypotheses 98/29/03

Binomial Distribution Very simple calculations

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

PDF with p = .5

quantile

Density

Page 10: Evaluating Hypotheses

Evaluating Hypotheses 108/29/03

What Does This Mean? We can use as an estimator of p Now have p and the distribution given p We have the tools to figure out how

confident we should be in our estimator

Page 11: Evaluating Hypotheses

Evaluating Hypotheses 118/29/03

The question How confident should I be in the accuracy

measure? If we can live with statements like:

95% of the accuracy measures will fall in the range of 94% and 97%

Life is good Confidence interval

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

PDF with p = .5

quantile

Density

Page 12: Evaluating Hypotheses

Evaluating Hypotheses 128/29/03

How calculate We want the quantiles where area

outside is 5% We can estimate p

There are tools available in most programming languages

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

PDF with p = .5

quantile

Density

Page 13: Evaluating Hypotheses

Evaluating Hypotheses 138/29/03

Example

0 20 40 60 80 100

0.00

0.05

0.10

0.15

PDF with p = .95

quantile

Density

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

PDF with p = .5

quantile

Density

In R lb=qbinom(.025,n,p) ub=qbinom(.975,n,p) Lower and upper bound constitute confidence

interval

Page 14: Evaluating Hypotheses

Evaluating Hypotheses 148/29/03

Still, Are We Really This Confident?

3 4 5 6 7 8 9

23

45

67

Xs

Ys

What if none of the small cluster of Blues were in the training set?

All of them would be in the test set

How well would it do?

Sample error vs. true error

Might have been an accident—a pathological case

Page 15: Evaluating Hypotheses

Evaluating Hypotheses 158/29/03

Cross-Validation What if we could test the classifier

several times with different test sets If it performed well each time wouldn’t

we be more confident in the results?

ReproducibilityConsistency

Page 16: Evaluating Hypotheses

Evaluating Hypotheses 168/29/03

K-fold Cross-Validation

Usually we have a big chunk of training data

If we bust it up into randomly drawn chunks

Can train on remainder

And test with chunk

1

2

3

4

5

6

7

8

9

10

Training Data Segregated Into Ten Equaly Sized Random Sets

Page 17: Evaluating Hypotheses

Evaluating Hypotheses 178/29/03

K-fold Cross-Validation

If 10 chunks Train 10 times Now have

performance data on ten completely different test datasets

1

2

3

4

5

6

7

8

9

10

Training Data Segregated Into Ten Equaly Sized Random Sets

Page 18: Evaluating Hypotheses

Evaluating Hypotheses 188/29/03

Must stay blindfolded while training Must discard all lessons after each fold

Remember, No Cheating

Page 19: Evaluating Hypotheses

Evaluating Hypotheses 198/29/03

10-fold Appears to be Most Common Default

Weka and DataMiner both default to 10-fold

Could be just as easily be 20-fold or 25-fold With 20-fold it would be a 95-5 split

Performance is reported as the average accuracy across the K runs

Page 20: Evaluating Hypotheses

Evaluating Hypotheses 208/29/03

What is the best K? Related to the question of

How large should the training set be Should be large enough to

support a test set of size n such that

Rule of thumb At least 30 examples not too close to 0 or 1

For ten-fold If 1/10th must be 30, Training set must

be 300 K-Fo

ldIf 10-fold satisfies this should be in good shape

Page 21: Evaluating Hypotheses

Evaluating Hypotheses 218/29/03

Can Even Use K=1 Called of leave-one-out Disadvantage: slow Largest possible training set Smallest possible test set

Has been promoted as an unbiased estimator or error

Recent studies indicate that there is no unbiased estimator

Page 22: Evaluating Hypotheses

Evaluating Hypotheses 228/29/03

Recap Can calculate confidence interval with a

single test set More runs (K-fold) gives us more

confidence that we didn’t just get lucky in test set selection

Do these runs help narrow the confidence interval?

CONFIDENCE INTERVAL

Page 23: Evaluating Hypotheses

Evaluating Hypotheses 238/29/03

When we average the performance…

Central limit applies As the number of runs grows the

distribution approaches normal With a reasonably large number of runs

we can derive a more trustworthy confidence interval

With 30 test runs (30-fold) can use traditional approaches to calculating mean and standard deviations, and therefore: confidence intervals

Page 24: Evaluating Hypotheses

Evaluating Hypotheses 248/29/03

Central Limit Theorem Consider a set of independent, identically

distributed random variables Y1…Yn governed by an arbitrary probability distribution with mean and finite variance . Define the sample mean, then as

the distribution governing approaches a Normal distribution, with zero mean and standard deviation equal to 1

Book: This is a quite surprising fact, because it states that we know the form of the distribution that governs the sample mean even when we do not know the form of the underlying distribution that governs the individual

Page 25: Evaluating Hypotheses

Evaluating Hypotheses 258/29/03

Checking accuracy in R

meanAcc = mean(accuracies)sdAcc = sd(accuracies)qnorm(.975,meanAcc,sdAcc)0.9980772qnorm(.025,meanAcc,sdAcc)0.8169336

0.0 0.2 0.4 0.6 0.8 1.00.000

0.002

0.004

0.006

0.008

Distribution of Accuracies

Accuracy

Page 26: Evaluating Hypotheses

Evaluating Hypotheses 268/29/03

Can we say that one classifier is significantly better than another

T-test Null hypothesis:

they are from the same distribution

My Classifier’s Better than Yours

0.0 0.2 0.4 0.6 0.8 1.0

0.000

0.002

0.004

0.006

0.008

Two Accuracy distributions

Accuracy

Page 27: Evaluating Hypotheses

Evaluating Hypotheses 278/29/03

T-testIn Rt.test(distOne,distTwo,paired = TRUE)

Paired t-test

data: distOne and distTwo t = -55.8756, df = 29, p-value < 2.2e-16alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2052696 -0.1907732 sample estimates:mean of the differences -0.1980214

-0.2 -0.1 0.0 0.1 0.2

0.000

0.002

0.004

0.006

0.008

Student's T Distribution

Differences

Page 28: Evaluating Hypotheses

Evaluating Hypotheses 288/29/03

T-test

In Perluse Statistics::TTest;

my $ttest = new Statistics::TTest; $ttest->load_data(\@r1,\@r2); $ttest->set_significance(95);$ttest->print_t_test(); print "\n\nt statistic is ".

$ttest->t_statistic."\n";print "p val ".$ttest->{t_prob}."\n";

-0.2 -0.1 0.0 0.1 0.2

0.000

0.002

0.004

0.006

0.008

Student's T Distribution

Differences

t_prob: 0significance: 95…df1: 29alpha: 0.025t_statistic: 12.8137016607408null_hypothesis: rejected

t statistic is 12.8137016607408p val 0

Page 29: Evaluating Hypotheses

Evaluating Hypotheses 298/29/03

Example, would you trust this classifier?

The classifier performed exceptionally well achieving 99.9% classifier accuracy on the 1,000 member training set.

The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000.

The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000. The variance in the ten accuracy measures indicates a 95% confidence interval of 97%-98%.

The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 30-fold cross-validation on a training-set of size 1,000. The variance in the thirty accuracy measures indicates a 95% confidence interval of 97%-98%.

Page 30: Evaluating Hypotheses

Evaluating Hypotheses 308/29/03

Randomly permute an array From the Perl Cookbook

http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm

A Useful Technique

sub fisher_yates_shuffle { my $array = shift; my $i; for ($i = @$array; --$i; ) { my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } }

Page 31: Evaluating Hypotheses

Evaluating Hypotheses 318/29/03

Page 32: Evaluating Hypotheses

Evaluating Hypotheses 328/29/03

What about chi squared


Recommended