Evaluating Hypotheses

EVALUATING HYPOTHESESHow good is my classifier?

Evaluating Hypotheses 28/29/03

How good is my classifier? Have seen the accuracy metric

Classifier performance on a test set


If we are to trust a classifier’s results

Must keep the classifier blindfolded

Make sure that classifier never sees the test data

When things seem too good to be true…

First and Foremost…


Confusion Matrix

Could collect more information

PredictedActual

class pos negpos true pos false negneg false pos true neg


Sensitivity vs. Specificity Sensitivity

Out of the things predicted as being positive, how many were correct

Specificity Out of the things predicted as being negative

how many were correct𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=

𝑡𝑛𝑡𝑛+ 𝑓𝑝

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑡𝑝

𝑡𝑝+ 𝑓𝑛

PredictedActual

class pos negpos true pos false

negneg false

postrue neg

• Not as sensitive if begins missing what it is trying to detect• If identify more and more things as

target class, then beginning to get less specific


Can we quantify our Uncertainty? Will the accuracy hold with brand

new, never before seen data?

Once we’re sure no cheating is going on…


Binomial DistributionDiscrete probability distribution of the number of successes in a sequence of n independent yes/no experiments

Successes or failures—Just what we’re looking for!


Binomial DistributionPr (𝑅=𝑟 )= 𝑛!

𝑟 ! (𝑛−𝑟 )!𝑝𝑟(1−𝑝 )𝑛−𝑟

Probability that the random variable R will take on a specific value r

Might be probability of an error or of a positive

Since we have been working with accuracy let’s go with positive

Book works with errors


Binomial Distribution Very simple calculations

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

PDF with p = .5

quantile

Density


What Does This Mean? We can use as an estimator of p Now have p and the distribution given p We have the tools to figure out how

confident we should be in our estimator


The question How confident should I be in the accuracy

measure? If we can live with statements like:

95% of the accuracy measures will fall in the range of 94% and 97%

Life is good Confidence interval

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

PDF with p = .5

quantile

Density


How calculate We want the quantiles where area

outside is 5% We can estimate p

There are tools available in most programming languages

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

PDF with p = .5

quantile

Density


Example

0 20 40 60 80 100

0.00

0.05

0.10

0.15

PDF with p = .95

quantile

Density

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

PDF with p = .5

quantile

Density

In R lb=qbinom(.025,n,p) ub=qbinom(.975,n,p) Lower and upper bound constitute confidence

interval


Still, Are We Really This Confident?

3 4 5 6 7 8 9

23

45

67

Xs

Ys

What if none of the small cluster of Blues were in the training set?

All of them would be in the test set

How well would it do?

Sample error vs. true error

Might have been an accident—a pathological case


Cross-Validation What if we could test the classifier

several times with different test sets If it performed well each time wouldn’t

we be more confident in the results?

ReproducibilityConsistency


K-fold Cross-Validation

Usually we have a big chunk of training data

If we bust it up into randomly drawn chunks

Can train on remainder

And test with chunk

1

2

3

4

5

6

7

8

9

10

Training Data Segregated Into Ten Equaly Sized Random Sets


K-fold Cross-Validation

If 10 chunks Train 10 times Now have

performance data on ten completely different test datasets

1

2

3

4

5

6

7

8

9

10

Training Data Segregated Into Ten Equaly Sized Random Sets


Must stay blindfolded while training Must discard all lessons after each fold

Remember, No Cheating


10-fold Appears to be Most Common Default

Weka and DataMiner both default to 10-fold

Could be just as easily be 20-fold or 25-fold With 20-fold it would be a 95-5 split

Performance is reported as the average accuracy across the K runs


What is the best K? Related to the question of

How large should the training set be Should be large enough to

support a test set of size n such that

Rule of thumb At least 30 examples not too close to 0 or 1

For ten-fold If 1/10th must be 30, Training set must

be 300 K-Fo

ldIf 10-fold satisfies this should be in good shape


Can Even Use K=1 Called of leave-one-out Disadvantage: slow Largest possible training set Smallest possible test set

Has been promoted as an unbiased estimator or error

Recent studies indicate that there is no unbiased estimator


Recap Can calculate confidence interval with a

single test set More runs (K-fold) gives us more

confidence that we didn’t just get lucky in test set selection

Do these runs help narrow the confidence interval?

CONFIDENCE INTERVAL


When we average the performance…

Central limit applies As the number of runs grows the

distribution approaches normal With a reasonably large number of runs

we can derive a more trustworthy confidence interval

With 30 test runs (30-fold) can use traditional approaches to calculating mean and standard deviations, and therefore: confidence intervals


Central Limit Theorem Consider a set of independent, identically

distributed random variables Y1…Yn governed by an arbitrary probability distribution with mean and finite variance . Define the sample mean, then as

the distribution governing approaches a Normal distribution, with zero mean and standard deviation equal to 1

Book: This is a quite surprising fact, because it states that we know the form of the distribution that governs the sample mean even when we do not know the form of the underlying distribution that governs the individual


Checking accuracy in R

meanAcc = mean(accuracies)sdAcc = sd(accuracies)qnorm(.975,meanAcc,sdAcc)0.9980772qnorm(.025,meanAcc,sdAcc)0.8169336

0.0 0.2 0.4 0.6 0.8 1.00.000

0.002

0.004

0.006

0.008

Distribution of Accuracies

Accuracy


Can we say that one classifier is significantly better than another

T-test Null hypothesis:

they are from the same distribution

My Classifier’s Better than Yours

0.0 0.2 0.4 0.6 0.8 1.0

0.000

0.002

0.004

0.006

0.008

Two Accuracy distributions

Accuracy


T-testIn Rt.test(distOne,distTwo,paired = TRUE)

Paired t-test

data: distOne and distTwo t = -55.8756, df = 29, p-value < 2.2e-16alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2052696 -0.1907732 sample estimates:mean of the differences -0.1980214

-0.2 -0.1 0.0 0.1 0.2

0.000

0.002

0.004

0.006

0.008

Student's T Distribution

Differences


T-test

In Perluse Statistics::TTest;

my $ttest = new Statistics::TTest; $ttest->load_data(\@r1,\@r2); $ttest->set_significance(95);$ttest->print_t_test(); print "\n\nt statistic is ".

$ttest->t_statistic."\n";print "p val ".$ttest->{t_prob}."\n";

-0.2 -0.1 0.0 0.1 0.2

0.000

0.002

0.004

0.006

0.008

Student's T Distribution

Differences

t_prob: 0significance: 95…df1: 29alpha: 0.025t_statistic: 12.8137016607408null_hypothesis: rejected

t statistic is 12.8137016607408p val 0


Example, would you trust this classifier?

The classifier performed exceptionally well achieving 99.9% classifier accuracy on the 1,000 member training set.

The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000.

The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000. The variance in the ten accuracy measures indicates a 95% confidence interval of 97%-98%.

The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 30-fold cross-validation on a training-set of size 1,000. The variance in the thirty accuracy measures indicates a 95% confidence interval of 97%-98%.


Randomly permute an array From the Perl Cookbook

http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm

A Useful Technique

sub fisher_yates_shuffle { my $array = shift; my $i; for ($i = @$array; --$i; ) { my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } }





What about chi squared

Date post:	24-Feb-2016
Category:	Documents
Upload:	kasa
View:	72 times
Download:	0 times

Evaluating Hypotheses

Documents