+ All Categories
Home > Documents > Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido...

Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido...

Date post: 19-Dec-2015
Category:
View: 218 times
Download: 0 times
Share this document with a friend
29
Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan 2 χ
Transcript
Page 1: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Statistical Methods in Computer Science

Hypothesis Testing III:Categorical dependence

and

Ido Dagan

Page 2: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

2

Testing Categorical Data

Single-factor, with categorical dependent variable Determine effect of independent variable values

(nominal/ categorical too) on the dependent variable

treatment1 Ind1 & Ex1 & Ex2 & .... & Exn ==> Dep1

treatment2 Ind2 & Ex1 & Ex2 & .... & Exn ==> Dep2

control Ex1 & Ex2 & .... & Exn ==> Dep3

Compare performance of algorithm A to B to C ....

Cannot use the typical ANOVA testing

Page 3: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

3

Testing Categorical Data

Single-factor, with categorical dependent variable Determine effect of independent variable values

(nominal/ categorical too) on the dependent variable

treatment1 Ind1 & Ex1 & Ex2 & .... & Exn ==> Dep1

treatment2 Ind2 & Ex1 & Ex2 & .... & Exn ==> Dep2

control Ex1 & Ex2 & .... & Exn ==> Dep3

Compare performance of algorithm A to B to C .... Dep1, Dep2, Dep3, .... are categorical data

Cannot use the numerical ANOVA testing

Values of independent variable

Values of dependent variable

Page 4: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

4

Contingency tables

We collect categorical data in contingency tables

BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100

140 160 100 400

Different screens and their hue bias

Page 5: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

5

Contingency tables

We collect categorical data in contingency tables

BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100

140 160 100 400

Coffee -Coffee Tea 20 5 25-Tea 70 5 75

90 10 100

Different screens and their hue bias

Coffee and tea drinkers in a sample of 100

people

Page 6: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

6

Contingency tables

We collect categorical data in contingency tables

BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100

140 160 100 400

Coffee -Coffee Tea 20 5 25-Tea 70 5 75

90 10 100

Different screens and their hue bias

Coffee and tea drinkers in a sample of 100

people

Windows Linux MacOS X Other130 55 50 21 256

Operating System used at home

Page 7: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

7

Is there a relation?

We want to know whether there exists a dependence between the variables, or between what we observed and some expectation e.g., are tea-drinkers more likely to be coffee-drinkers? e.g., is the selection of operating systems different from

expected? To find out, cannot ask about means, medians, ... Focus instead on proportions/distributions:

Does the observed distribution differ from the expected one?

Page 8: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

8

Single Sample

Suppose we have a-priori notion of what's expected For instance, selection of OS is expected to be uniform

How do we tell whether the observed data is different?

Windows Linux MacOS X Other130 55 50 21 256

Windows Linux Other25.00% 25.00% 25.00% 25.00% 100.00%

MacOS X

Page 9: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

9

Hypotheses

Refers to the number of data points of each value

H0 : Expected distribution = underlying distribution of observed data

H1 : Expected != Observed

Need a method to measure likelihood of null hypothesis

Page 10: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

10

Difference in Expectations

Focus on the difference between expected and observed

For instance: Expected (out of 256 sampled homes)

25% each operating system = 64 homes each operating system

Observed (out of 256 sampled homes)Windows Linux MacOS X Other130 55 50 21 256

Windows Linux MacOS X Other25.00% 25.00% 25.00% 25.00% 100.00%

Windows Linux Other64 64 64 64 256

MacOS X

Page 11: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

11

Chi-square The chi-square ( ) of given data is:

Where: fe is the expected frequency of the value fo is the observed frequency of the value The sum is over all values

Recall from Cramer correlation Chi-square values of random samples are distributed

according to the chi-square distribution A family of distributions, according to degrees-of-freedom Large enough chi-square indicates significance – reject null

hypothesisToo low values are unlikely – check for error

e

eo

f

ff=χ

22

Page 12: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

12

In our example

This is compared to chi-square distribution with df = 3

Result: highly significant

101.28

64

6421

64

6450

64

6455

64

64130 222222 =+++=

f

ff=χ

e

eo

Page 13: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

13

Features of chi-square

Cannot be negative (sum of squares) Only tests for difference from expected distribution

But here it is a one-tailed test (only tests for greater value)

Equals 0 when all frequencies are exactly as expected

It's not the size of the discrepancy, but its relative size Relative to the expected frequency

Depends on the number of discrepancies This is why we have to consider degrees of freedom – the

more df, the more discrepancies we might get by chance

Page 14: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

14Comparing against arbitrary

distribution

Do we always need to assume population is uniform/known?

No: Instead of expecting:

Can have different expectations (from past), for instance:

Windows Linux MacOS X Other25.00% 25.00% 25.00% 25.00% 100.00%

Windows Linux MacOS X Other64 64 64 64 256

Windows Linux Other50.00% 20.00% 20.00% 10.00% 100.00%

MacOS X

128 51 51 26 256

Page 15: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

15Comparing against arbitrary

distributions

This is compared to chi-square distribution with df = 3 Result: not significant

Same procedure may be used for testing against any expected distribution (e.g. Normal, to decide on t-test applicability)

1.33

26

2621

51

5150

51

5155

128

128130 22222 =+++=χ

Windows Linux Other130 55 50 21 256 Observed128 51 51 26 256 Expected

MacOS X

Page 16: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

16Multiple-variable contingency tables

Correlation-testing Two numerical variables

Single-factor testing (one-way ANOVA) Independent variable: Categorical Dependent variable: Numerical

Two-way contingency table: two categorical variables As seen for Cramer correlation

BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100

140 160 100 400

Different screens and their hue bias

Page 17: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

17Hypotheses about contingency

tables

Null hypothesis: values are independent of each other Analogous to correlation and ANOVA tests

Alternative (H1): values of variables are dependent

How do we use chi-square here? Calculate expected frequency from margin probabilities Calculate chi-square value Compare to chi-square distribution

Page 18: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

18

Example

Independent probabilities: Any display has probability of

140/400 of being bluish, 160/400 of being reddish, 100/400 of being greenish

Any hue has probability of 200/400 of being of display 1, 100/400 of display 2,

100/400 of display 3

BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100

140 160 100 400

Page 19: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

19

Example

Given the above table, what is the likelihood of: Being a blueish display 1 ?

Being a greenish display 2?

BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100

140 160 100 400

p= 140400

×200400

=0.175

p=100400

×100400

=0.0625

Page 20: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

20Example: Expected Frequencies

Given the above table, what is the likelihood of: Being a blueish display 1 ?

Being a greenish display 2?

BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100

140 160 100 400

p=140400

×200400

=0.175

p=100400

×100400

=0.0625 25 cases

70 cases

Page 21: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

21Translating into expected frequencies

BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100

140 160 100 400

BlueishReddish GreenishDisplay 1 70 80 50 200Display 2 35 40 25 100Display 3 35 40 25 100

140 160 100 400

Observed

Expected

Page 22: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

22Chi-square of contingency table

In this example the chi-square is:

BlueishReddish GreenishDisplay 1 46-70 82-80 72-50Display 2 42-35 38-40 20-25Display 3 52-35 40-40 8-25

BlueishReddish GreenishDisplay 1 8.23 0.05 9.68Display 2 1.4 0.1 1Display 3 8.26 0 12

40.28

Observed-expected

Squared, summed...

chi-square

Page 23: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

23Degrees of freedom of contingency

table

There are two variables One has 2 degrees of freedom (3 possible values) Other has 2 degrees of freedom (3 possible values) Total: 2 * 2 = 4 degrees of freedom

Knowing the marginal values, setting 4 values in the table enables to derive the 5 remaining ones

In general: df = (#rows – 1) · (#columns – 1)

Comparison against chi-square distribution with df=4:

Significant: p < 0.01

Page 24: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

24

Chi-square in Spreadsheet

Excel, openoffice spreadsheets have chi-square testing

chitest(observed,expected) Where observed, expected are data-arrays of same

size Give back p value of null hypothesis

Page 25: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

25

Interpreting dependence

Knowing variables are dependent does not tell us how Analogical to single-factor ANOVA testing we learned

(but unlike Pearson correlation, which is directional) Have to consider interpretation carefully

For instance, consider the following data Suppose we are investigating the following relation:

Are people drinking tea likely to drink coffee?(i.e., tea-drinking correlated with coffee-drinking)

Page 26: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

26

Interpreting Dependence

We gather the following results:

The chi-square test is significant: p = 0.05 But what is the direction of dependence?

Coffee -Coffee Tea 20 5 25-Tea 70 5 75

90 10 100

Page 27: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

27

Interpreting Dependence

We gather the following results:

Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )

Coffee -Coffee Tea 20 5 25-Tea 70 5 75

90 10 100

Page 28: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

28

Interpreting Dependence

We gather the following results:

Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )But: 90% of people drink coffee (a-priori) 93% of non-tea drinkers drink coffee

Coffee -Coffee Tea 20 5 25-Tea 70 5 75

90 10 100

Page 29: Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

29

Interpreting Dependence We gather the following results:

Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )But: 90% of people drink coffee (a-priori) 93% of non-tea drinkers drink coffeeSo: Negative dependence between tea and coffee!

(can be revealed by comparing observed to expected)

Coffee -Coffee Tea 20 5 25-Tea 70 5 75

90 10 100


Recommended