Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 0 times |
Statistical Methods in Computer Science
Hypothesis Testing III:Categorical dependence
and
Ido Dagan
2χ
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
2
Testing Categorical Data
Single-factor, with categorical dependent variable Determine effect of independent variable values
(nominal/ categorical too) on the dependent variable
treatment1 Ind1 & Ex1 & Ex2 & .... & Exn ==> Dep1
treatment2 Ind2 & Ex1 & Ex2 & .... & Exn ==> Dep2
control Ex1 & Ex2 & .... & Exn ==> Dep3
Compare performance of algorithm A to B to C ....
Cannot use the typical ANOVA testing
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
3
Testing Categorical Data
Single-factor, with categorical dependent variable Determine effect of independent variable values
(nominal/ categorical too) on the dependent variable
treatment1 Ind1 & Ex1 & Ex2 & .... & Exn ==> Dep1
treatment2 Ind2 & Ex1 & Ex2 & .... & Exn ==> Dep2
control Ex1 & Ex2 & .... & Exn ==> Dep3
Compare performance of algorithm A to B to C .... Dep1, Dep2, Dep3, .... are categorical data
Cannot use the numerical ANOVA testing
Values of independent variable
Values of dependent variable
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
4
Contingency tables
We collect categorical data in contingency tables
BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100
140 160 100 400
Different screens and their hue bias
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
5
Contingency tables
We collect categorical data in contingency tables
BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100
140 160 100 400
Coffee -Coffee Tea 20 5 25-Tea 70 5 75
90 10 100
Different screens and their hue bias
Coffee and tea drinkers in a sample of 100
people
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
6
Contingency tables
We collect categorical data in contingency tables
BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100
140 160 100 400
Coffee -Coffee Tea 20 5 25-Tea 70 5 75
90 10 100
Different screens and their hue bias
Coffee and tea drinkers in a sample of 100
people
Windows Linux MacOS X Other130 55 50 21 256
Operating System used at home
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
7
Is there a relation?
We want to know whether there exists a dependence between the variables, or between what we observed and some expectation e.g., are tea-drinkers more likely to be coffee-drinkers? e.g., is the selection of operating systems different from
expected? To find out, cannot ask about means, medians, ... Focus instead on proportions/distributions:
Does the observed distribution differ from the expected one?
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
8
Single Sample
Suppose we have a-priori notion of what's expected For instance, selection of OS is expected to be uniform
How do we tell whether the observed data is different?
Windows Linux MacOS X Other130 55 50 21 256
Windows Linux Other25.00% 25.00% 25.00% 25.00% 100.00%
MacOS X
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
9
Hypotheses
Refers to the number of data points of each value
H0 : Expected distribution = underlying distribution of observed data
H1 : Expected != Observed
Need a method to measure likelihood of null hypothesis
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
10
Difference in Expectations
Focus on the difference between expected and observed
For instance: Expected (out of 256 sampled homes)
25% each operating system = 64 homes each operating system
Observed (out of 256 sampled homes)Windows Linux MacOS X Other130 55 50 21 256
Windows Linux MacOS X Other25.00% 25.00% 25.00% 25.00% 100.00%
Windows Linux Other64 64 64 64 256
MacOS X
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
11
Chi-square The chi-square ( ) of given data is:
Where: fe is the expected frequency of the value fo is the observed frequency of the value The sum is over all values
Recall from Cramer correlation Chi-square values of random samples are distributed
according to the chi-square distribution A family of distributions, according to degrees-of-freedom Large enough chi-square indicates significance – reject null
hypothesisToo low values are unlikely – check for error
2χ
e
eo
f
ff=χ
22
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
12
In our example
This is compared to chi-square distribution with df = 3
Result: highly significant
101.28
64
6421
64
6450
64
6455
64
64130 222222 =+++=
f
ff=χ
e
eo
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
13
Features of chi-square
Cannot be negative (sum of squares) Only tests for difference from expected distribution
But here it is a one-tailed test (only tests for greater value)
Equals 0 when all frequencies are exactly as expected
It's not the size of the discrepancy, but its relative size Relative to the expected frequency
Depends on the number of discrepancies This is why we have to consider degrees of freedom – the
more df, the more discrepancies we might get by chance
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
14Comparing against arbitrary
distribution
Do we always need to assume population is uniform/known?
No: Instead of expecting:
Can have different expectations (from past), for instance:
Windows Linux MacOS X Other25.00% 25.00% 25.00% 25.00% 100.00%
Windows Linux MacOS X Other64 64 64 64 256
Windows Linux Other50.00% 20.00% 20.00% 10.00% 100.00%
MacOS X
128 51 51 26 256
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
15Comparing against arbitrary
distributions
This is compared to chi-square distribution with df = 3 Result: not significant
Same procedure may be used for testing against any expected distribution (e.g. Normal, to decide on t-test applicability)
1.33
26
2621
51
5150
51
5155
128
128130 22222 =+++=χ
Windows Linux Other130 55 50 21 256 Observed128 51 51 26 256 Expected
MacOS X
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
16Multiple-variable contingency tables
Correlation-testing Two numerical variables
Single-factor testing (one-way ANOVA) Independent variable: Categorical Dependent variable: Numerical
Two-way contingency table: two categorical variables As seen for Cramer correlation
BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100
140 160 100 400
Different screens and their hue bias
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
17Hypotheses about contingency
tables
Null hypothesis: values are independent of each other Analogous to correlation and ANOVA tests
Alternative (H1): values of variables are dependent
How do we use chi-square here? Calculate expected frequency from margin probabilities Calculate chi-square value Compare to chi-square distribution
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
18
Example
Independent probabilities: Any display has probability of
140/400 of being bluish, 160/400 of being reddish, 100/400 of being greenish
Any hue has probability of 200/400 of being of display 1, 100/400 of display 2,
100/400 of display 3
BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100
140 160 100 400
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
19
Example
Given the above table, what is the likelihood of: Being a blueish display 1 ?
Being a greenish display 2?
BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100
140 160 100 400
p= 140400
×200400
=0.175
p=100400
×100400
=0.0625
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
20Example: Expected Frequencies
Given the above table, what is the likelihood of: Being a blueish display 1 ?
Being a greenish display 2?
BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100
140 160 100 400
p=140400
×200400
=0.175
p=100400
×100400
=0.0625 25 cases
70 cases
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
21Translating into expected frequencies
BlueishReddish GreenishDisplay 1 46 82 72 200Display 2 42 38 20 100Display 3 52 40 8 100
140 160 100 400
BlueishReddish GreenishDisplay 1 70 80 50 200Display 2 35 40 25 100Display 3 35 40 25 100
140 160 100 400
Observed
Expected
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
22Chi-square of contingency table
In this example the chi-square is:
BlueishReddish GreenishDisplay 1 46-70 82-80 72-50Display 2 42-35 38-40 20-25Display 3 52-35 40-40 8-25
BlueishReddish GreenishDisplay 1 8.23 0.05 9.68Display 2 1.4 0.1 1Display 3 8.26 0 12
40.28
Observed-expected
Squared, summed...
chi-square
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
23Degrees of freedom of contingency
table
There are two variables One has 2 degrees of freedom (3 possible values) Other has 2 degrees of freedom (3 possible values) Total: 2 * 2 = 4 degrees of freedom
Knowing the marginal values, setting 4 values in the table enables to derive the 5 remaining ones
In general: df = (#rows – 1) · (#columns – 1)
Comparison against chi-square distribution with df=4:
Significant: p < 0.01
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
24
Chi-square in Spreadsheet
Excel, openoffice spreadsheets have chi-square testing
chitest(observed,expected) Where observed, expected are data-arrays of same
size Give back p value of null hypothesis
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
25
Interpreting dependence
Knowing variables are dependent does not tell us how Analogical to single-factor ANOVA testing we learned
(but unlike Pearson correlation, which is directional) Have to consider interpretation carefully
For instance, consider the following data Suppose we are investigating the following relation:
Are people drinking tea likely to drink coffee?(i.e., tea-drinking correlated with coffee-drinking)
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
26
Interpreting Dependence
We gather the following results:
The chi-square test is significant: p = 0.05 But what is the direction of dependence?
Coffee -Coffee Tea 20 5 25-Tea 70 5 75
90 10 100
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
27
Interpreting Dependence
We gather the following results:
Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )
Coffee -Coffee Tea 20 5 25-Tea 70 5 75
90 10 100
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
28
Interpreting Dependence
We gather the following results:
Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )But: 90% of people drink coffee (a-priori) 93% of non-tea drinkers drink coffee
Coffee -Coffee Tea 20 5 25-Tea 70 5 75
90 10 100
Empirical Methods in Computer Science © 2006-now Gal Kaminka/Ido Dagan
29
Interpreting Dependence We gather the following results:
Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )But: 90% of people drink coffee (a-priori) 93% of non-tea drinkers drink coffeeSo: Negative dependence between tea and coffee!
(can be revealed by comparing observed to expected)
Coffee -Coffee Tea 20 5 25-Tea 70 5 75
90 10 100