Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido...

Statistical Methods in Computer Science

Hypothesis Testing III:Categorical dependence

and

Ido Dagan

2χ

Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan

2

Testing Categorical Data

Single-factor, with categorical dependent variable Determine effect of independent variable values

(nominal/ categorical too) on the dependent variable

treatment1 Ind1 & Ex1 & Ex2 & .... & Exn ==> Dep1


control Ex1 & Ex2 & .... & Exn ==> Dep3

Compare performance of algorithm A to B to C ....

Cannot use the typical ANOVA testing


3

Testing Categorical Data

Single-factor, with categorical dependent variable Determine effect of independent variable values

(nominal/ categorical too) on the dependent variable



control Ex1 & Ex2 & .... & Exn ==> Dep3

Compare performance of algorithm A to B to C .... Dep1, Dep2, Dep3, .... are categorical data

Cannot use the numerical ANOVA testing

Values of independent variable

Values of dependent variable


4

Contingency tables

We collect categorical data in contingency tables

BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100

140 160 100 400

Different screens and their hue bias


5

Contingency tables



140 160 100 400

Coffee -Coffee Tea 20 5 25-Tea 70 5 75

90 10 100


Coffee and tea drinkers in a sample of 100

people


6

Contingency tables



140 160 100 400


90 10 100


Coffee and tea drinkers in a sample of 100

people

Windows Linux MacOS X Other130 55 50 21 256

Operating System used at home


7

Is there a relation?

We want to know whether there exists a dependence between the variables, or between what we observed and some expectation e.g., are tea-drinkers more likely to be coffee-drinkers? e.g., is the selection of operating systems different from

expected? To find out, cannot ask about means, medians, ... Focus instead on proportions/distributions:

Does the observed distribution differ from the expected one?


8

Single Sample

Suppose we have a-priori notion of what's expected For instance, selection of OS is expected to be uniform

How do we tell whether the observed data is different?


Windows Linux Other25.00% 25.00% 25.00% 25.00% 100.00%

MacOS X


9

Hypotheses

Refers to the number of data points of each value

H0 : Expected distribution = underlying distribution of observed data

H1 : Expected != Observed

Need a method to measure likelihood of null hypothesis


10

Difference in Expectations

Focus on the difference between expected and observed

For instance: Expected (out of 256 sampled homes)

25% each operating system = 64 homes each operating system

Observed (out of 256 sampled homes)Windows Linux MacOS X Other130 55 50 21 256

Windows Linux MacOS X Other25.00% 25.00% 25.00% 25.00% 100.00%

Windows Linux Other64 64 64 64 256

MacOS X


11

Chi-square The chi-square ( ) of given data is:

Where: fe is the expected frequency of the value fo is the observed frequency of the value The sum is over all values

Recall from Cramer correlation Chi-square values of random samples are distributed

according to the chi-square distribution A family of distributions, according to degrees-of-freedom Large enough chi-square indicates significance – reject null

hypothesisToo low values are unlikely – check for error

2χ

e

eo

f

ff=χ

22


12

In our example

This is compared to chi-square distribution with df = 3

Result: highly significant

101.28

64

6421

64

6450

64

6455

64

64130 222222 =+++=

f

ff=χ

e

eo


13

Features of chi-square

Cannot be negative (sum of squares) Only tests for difference from expected distribution

But here it is a one-tailed test (only tests for greater value)

Equals 0 when all frequencies are exactly as expected

It's not the size of the discrepancy, but its relative size Relative to the expected frequency

Depends on the number of discrepancies This is why we have to consider degrees of freedom – the

more df, the more discrepancies we might get by chance


14Comparing against arbitrary

distribution

Do we always need to assume population is uniform/known?

No: Instead of expecting:

Can have different expectations (from past), for instance:

Windows Linux MacOS X Other25.00% 25.00% 25.00% 25.00% 100.00%


Windows Linux Other50.00% 20.00% 20.00% 10.00% 100.00%

MacOS X

128 51 51 26 256


15Comparing against arbitrary

distributions

This is compared to chi-square distribution with df = 3 Result: not significant

Same procedure may be used for testing against any expected distribution (e.g. Normal, to decide on t-test applicability)

1.33

26

2621

51

5150

51

5155

128

128130 22222 =+++=χ

Windows Linux Other130 55 50 21 256 Observed128 51 51 26 256 Expected

MacOS X


16Multiple-variable contingency tables

Correlation-testing Two numerical variables

Single-factor testing (one-way ANOVA) Independent variable: Categorical Dependent variable: Numerical

Two-way contingency table: two categorical variables As seen for Cramer correlation


140 160 100 400



17Hypotheses about contingency

tables

Null hypothesis: values are independent of each other Analogous to correlation and ANOVA tests

Alternative (H1): values of variables are dependent

How do we use chi-square here? Calculate expected frequency from margin probabilities Calculate chi-square value Compare to chi-square distribution


18

Example

Independent probabilities: Any display has probability of

140/400 of being bluish, 160/400 of being reddish, 100/400 of being greenish

Any hue has probability of 200/400 of being of display 1, 100/400 of display 2,

100/400 of display 3


140 160 100 400


19

Example

Given the above table, what is the likelihood of: Being a blueish display 1 ?

Being a greenish display 2?


140 160 100 400

p= 140400

×200400

=0.175

p=100400

×100400

=0.0625


20Example: Expected Frequencies

Given the above table, what is the likelihood of: Being a blueish display 1 ?

Being a greenish display 2?


140 160 100 400

p=140400

×200400

=0.175

p=100400

×100400

=0.0625 25 cases

70 cases


21Translating into expected frequencies


140 160 100 400


140 160 100 400

Observed

Expected


22Chi-square of contingency table

In this example the chi-square is:

BlueishReddish GreenishDisplay 1 46-70 82-80 72-50Display 2 42-35 38-40 20-25Display 3 52-35 40-40 8-25

BlueishReddish GreenishDisplay 1 8.23 0.05 9.68Display 2 1.4 0.1 1Display 3 8.26 0 12

40.28

Observed-expected

Squared, summed...

chi-square


23Degrees of freedom of contingency

table

There are two variables One has 2 degrees of freedom (3 possible values) Other has 2 degrees of freedom (3 possible values) Total: 2 * 2 = 4 degrees of freedom

Knowing the marginal values, setting 4 values in the table enables to derive the 5 remaining ones

In general: df = (#rows – 1) · (#columns – 1)

Comparison against chi-square distribution with df=4:

Significant: p < 0.01


24

Chi-square in Spreadsheet

Excel, openoffice spreadsheets have chi-square testing

chitest(observed,expected) Where observed, expected are data-arrays of same

size Give back p value of null hypothesis


25

Interpreting dependence

Knowing variables are dependent does not tell us how Analogical to single-factor ANOVA testing we learned

(but unlike Pearson correlation, which is directional) Have to consider interpretation carefully

For instance, consider the following data Suppose we are investigating the following relation:

Are people drinking tea likely to drink coffee?(i.e., tea-drinking correlated with coffee-drinking)


26

Interpreting Dependence

We gather the following results:

The chi-square test is significant: p = 0.05 But what is the direction of dependence?


90 10 100


27



Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )


90 10 100


28



Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )But: 90% of people drink coffee (a-priori) 93% of non-tea drinkers drink coffee


90 10 100


29

Interpreting Dependence We gather the following results:

Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )But: 90% of people drink coffee (a-priori) 93% of non-tea drinkers drink coffeeSo: Negative dependence between tea and coffee!

(can be revealed by comparing observed to expected)


90 10 100

Date post:	19-Dec-2015
Category:	Documents
View:	218 times
Download:	0 times

Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido...

Documents