Post on 18-Jan-2016
transcript
Contingency Tables
• Structure of contingency tables.• Symmetry.• Rules.• Populations and sub-populations.• Grand, Column, and Row percentiles.• Simpson’s Paradox.• Using tables for inference testing – Chi2.
Contingency Tables as Descriptive Statistical Tools
One of the most common types of statistical tools, so-called because the two variables in the table are
contingent (or dependent) upon one another.
Also called crosstabs because the two variables cross tabulate (or refer to) a single subject.
They can be a descriptive or inferential using the Chi2 statistic (pronounced “k-eye” square).
Deceptively simple tool; deceptive because they look easy to interpret but are not!
Show a wealth of information about population and subsets of that population
Contingency Tables as Descriptive Statistical Tools
Structure of Contingency Tables
The variables are in a rXc structure not just columns.
Groups of rows and columns are defined by one variable each such as eye and hair colour.
Individual rows and columns represent the sub categories within each variable - e.g. blue, brown,
black etc.
The number of subcategories define the size of the table – e.g. a 4X4 table would have two variables
each with four sub categories as follows…
Hair and Eye Colour Among Non-Indigenous North American Population
HAIR
BLACK BROWN RED BLOND TOTAL
EYES
Brown 68 119 26 7 220Blue 20 84 17 94 215
Hazel 15 54 14 10 93Green 5 29 14 16 64
TOTAL 108 286 71 127 592
Anatomy of a 4X4 table
Variable #1
Variable #2
Subcategory #2
Subcategory #1
Row totals are for VAR #2
Column totals are for VAR #1
GrandTotal
Data values for subjects in cells
Marginal Totals
Differences Between Contingency Tables and
SpreadsheetsSpreadsheets Contingency Tables
Columns are variables. Columns are a subset of a variable.Rows are cases. Rows are a subset of a variable.Cells are individual case values (e.g. a person) within a single variable.
Cells are total case values (e.g. # of people) with both variables.
Column totals are the sum of all cases within that variable.
Column totals are the sum of all cases within that subset of a variable.
Row totals are the sum of a case across all variables.
Row totals are the sum of all cases within that subset of a variable.
Grand totals (sum of row sums and column sums) are total of all cases in the dataset across all variables.
Grand totals (sum of row sums and column sums) are total of all cases in the dataset across all variables.
Contingency Table Symmetry
Tables are symmetrical when there are the same number of sub-category rows and columns – e.g. a
4X4 table.
Tables are asymmetric when there are dissimilar # of sub-category rows and columns – e.g. a 4X6 table.
Asymmetric tables are difficult to do inferential testing on so avoid them.
High level tables that have more than two variables also exist but are complex to analyse.
Contingency Table Symmetry
VARIABLE
VARIABLE
SUB-CAT SUB-CAT SUB-CAT SUB-CAT
SUB-CAT 1 2 3 4
SUB-CAT 2
SUB-CAT 3
SUB-CAT 4
VARIABLE
VARIABLE
SUB-CAT SUB-CAT SUB-CAT SUB-CAT SUB-CAT SUB-CAT
SUB-CAT 1 2 3 4 5 6
SUB-CAT 2
SUB-CAT 3
SUB-CAT 4
A 4X4 Table
A 4X6 Table
Eye Colour
Brown Blue
Hair ColourBrown
Blue
Eye Colour
Brown Blue
GenderMale
Female
Fat Intake
High Low
CholesterolHigh
Low
Table rules for mix of variables
If variables have a potentialrelationship then the dependent
variable goes on the ‘Y’ axis
If one variable is a population then that variable goes on the ‘Y’ axis
If both variables are categorical then it does not matter
where each goes
Table Rules Cont…
Independence: every subject must have the same chance of being selected.
Exclusivity: subjects can fall only into one cell; e.g. cannot use data drawn from multiple
responses to a single question (no one eye blue, one eye brown or no eye colour!).
Exhaustive: subcategories should include all responses received (i.e. the sum of rows must equal the sum of columns, or no-one can have hair colour without eye colour or vice versa).
Raw Data
HAIR
BLACK BROWN RED BLOND TOTAL
EYES
Brown 68 119 26 7 220
Blue 20 84 17 94 215
Hazel 15 54 14 10 93
Green 5 29 14 16 64
TOTAL 108 286 71 127 592
Grand Total Percentiles
Black Brown Red Blond TOTAL
EYES
Brown 11.49% 20.10% 4.39% 1.18% 37.16%
Blue 3.38% 14.19% 2.87% 15.88% 36.32%
Hazel 2.53% 9.12% 2.36% 1.69% 15.71%
Green 0.84% 4.90% 2.36% 2.70% 10.81%
TOTAL 18.24% 48.31% 11.99% 21.45% 100.00%
There are fewer black haired than blonde haired people.Rarest combinations:
1. Black hair/green eyes = <1 person in 1002. Blonde hair/brown eyes = @1.2 people in 100
Commonest combinations:1. Brown hair/brown eyes = 20% of population
2. Blonde hair/blue eyes = @ 16% of population
And for the curious…
Deriving Useful Information
Tables depend on proportional calculations using marginal totals (row and column) to be useful.
In reality you are dealing with sub groups of the population.
Each row total and column total represents a sub-population.
Three proportional (percentile) tables are derived and analysed:
Raw Data HAIR BLACK BROWN RED BLOND TOTAL
EYES
Brown 68 119 26 7 220Blue 20 84 17 94 215Hazel 15 54 14 10 93Green 5 29 14 16 64
TOTAL 108 286 71 127 592
Population as a whole.Variable #1 Sub-population – Hair across all eye categories.Variable #2 sub-population – Eyes across all hair categories.
Populations and Sub-Populations
BROWNBlack
GREENBrown
HAZELBrown
BLUEBlonde
BROWNBrown
BROWNBrown
BROWNBrown
BROWNBrown
BROWNBrown
BROWNBlack
BROWNBlack
BROWNBrown
HAZELBlack
HAZELRed
HAZELBlack
HAZELBlack
HAZELRed
HAZELRed
HAZELBrown
HAZELBlonde
HAZELRed GREEN
Brown
GREENBrown
GREENBlack
GREENBlack
BLUEBlonde
BLUEBlonde
BLUEBrown
BLUEBlack
BLUEBrown
BLUEBlonde
BLUEBlonde
BLUEBlonde
BLUEBlack
This is a population of 34 people. Each person has an EYE colour and a Hair colour.
BROWNBlack
GREENBrown
HAZELBrownBLUE
Blonde
BROWNBrown
BROWNBrown
BROWNBrown
BROWNBrown
BROWNBrown
BROWNBlack
BROWNBlack
BROWNBrown
HAZELBlack
HAZELRed
HAZELBlack
HAZELBlack
HAZELRed
HAZELRed
HAZELBrown
HAZELBlonde
HAZELRed
GREENBrown
GREENBrown
GREENBlack
GREENBlack
BLUEBlonde
BLUEBlonde
BLUEBrown
BLUEBlack
BLUEBrown
BLUEBlonde
BLUEBlonde
BLUEBlonde
BLUEBlack
This is the same population of 34. They have been told to group themselves into four sub-populations according to EYE colour.
n=5n=9
n=10
n=10
Subpop #1
Subpop #2
Subpop #3
Subpop #4
BROWNBlack
GREENBrown
HAZELBrown
BLUEBlonde
BROWNBrown
BROWNBrown
BROWNBrown
BROWNBrown
BROWNBrown
BROWNBlack
BROWNBlack
BROWNBrown
HAZELBlack
HAZELRed
HAZELBlack
HAZELBlack
HAZELRed
HAZELRed
HAZELBrown
HAZELBlonde HAZEL
Red
GREENBrown
GREENBrown
GREENBlack
GREENBlack
BLUEBlonde
BLUEBlonde
BLUEBrown
BLUEBlack
BLUEBrown
BLUEBlonde BLUE
BlondeBLUE
Blonde
BLUEBlack
This is the same population of 34. Now they have been told to group themselves into four sub-populations according to Hair colour.
n=4n=7
n=13 n=10Subpop #1 Subpop #2
Subpop #3Subpop #4
TABLE 2: NO SUBSETSGRAND TOTAL %ages
SmokeYes No Total
Disease
Yes 6.5% 3% 9.5%No 18.5% 72% 90.5%
Total 25% 75% 100%
TABLE 3: SUBSET DISEASED ROW %ages
SmokeYes No Total
Disease
Yes 68.4% 31.6% 100%No 20.4% 79.6% 100%
Total 25.0% 75.0% 100%
TABLE 4: SUBSET SMOKERS COLUMN %ages
SmokeYes No Total
Disease
Yes 26.0% 4.0% 9.5%
No 74.0% 96.0% 90.5%Total 100% 100% 100%
TABLE 1: RAW DATA Smoke
Yes No Total
DiseaseYes 13 6 19No 37 144 181
Total 50 150 200
200 total subjects proportionalised to grand
total
200 subjects divided among four categories: yes smoke, no smoke,
yes disease, no disease
200 total subjects proportionalised to row
(diseased) total
200 total subjects proportionalised to column
(smoke) total
ALL SUBJECTS’ OVERALL STATUS
DISEASED’ SMOKING STATUS
SMOKERS’ DISEASE STATUS
ALL SUBJECTS’ RAW DATA
TABLE 2: NO SUBSETSGRAND TOTAL %ages
SmokeYes No Total
Disease
Yes 6.5% 3% 9.5%No 18.5% 72% 90.5%
Total 25% 75% 100%
TABLE 3: SUBSET DISEASED ROW %ages
SmokeYes No Total
Disease
Yes 68.4% 31.6% 100%No 20.4% 79.6% 100%
Total 25.0% 75.0% 100%
TABLE 4: SUBSET SMOKERS COLUMN %ages
SmokeYes No Total
Disease
Yes 26.0% 4.0% 9.5%
No 74.0% 96.0% 90.5%Total 100% 100% 100%
TABLE 1: RAW DATA Smoke
Yes No Total
DiseaseYes 13 6 19No 37 144 181
Total 50 150 200
200 total subjects proportionalised to
grand total
200 subjects divided among four categories: yes smoke, no
smoke, yes disease, no disease
200 total subjects proportionalised to row
(diseased) total
200 total subjects proportionalised to
column (smoke) total
ALL SUBJECTS’ OVERALL STATUS
DISEASED’ SMOKING STATUS
SMOKERS’ DISEASE STATUS
ALL SUBJECTS’ RAW DATA?
TABLE 2: NO SUBSETSGRAND TOTAL %ages
SmokeYes No Total
Disease
Yes 6.5% 3% 9.5%No 18.5% 72% 90.5%
Total 25% 75% 100%
TABLE 3: SUBSET DISEASED ROW %ages
SmokeYes No Total
Disease
Yes 68.4% 31.6% 100%No 20.4% 79.6% 100%
Total 25.0% 75.0% 100%
TABLE 4: SUBSET SMOKERS COLUMN %ages
SmokeYes No Total
Disease
Yes 26.0% 4.0% 9.5%
No 74.0% 96.0% 90.5%Total 100% 100% 100%
TABLE 1: RAW DATA Smoke
Yes No Total
DiseaseYes 13 6 19No 37 144 181
Total 50 150 200
200 total subjects proportionalised to
grand total
200 subjects divided among four categories: yes smoke, no
smoke, yes disease, no disease
200 total subjects proportionalised to row
(diseased) total
200 total subjects proportionalised to
column (smoke) total
ALL SUBJECTS’ OVERALL STATUS
DISEASED’ SMOKING STATUS
SMOKERS’ DISEASE STATUS
ALL SUBJECTS’ RAW DATA
?
TABLE 2: NO SUBSETSGRAND TOTAL %ages
SmokeYes No Total
Disease
Yes 6.5% 3% 9.5%No 18.5% 72% 90.5%
Total 25% 75% 100%
TABLE 3: SUBSET DISEASED ROW %ages
SmokeYes No Total
Disease
Yes 68.4% 31.6% 100%No 20.4% 79.6% 100%
Total 25.0% 75.0% 100%
TABLE 4: SUBSET SMOKERS COLUMN %ages
SmokeYes No Total
Disease
Yes 26.0% 4.0% 9.5%
No 74.0% 96.0% 90.5%Total 100% 100% 100%
TABLE 1: RAW DATA Smoke
Yes No Total
DiseaseYes 13 6 19No 37 144 181
Total 50 150 200
200 total subjects proportionalised to
grand total
200 subjects divided among four categories: yes smoke, no
smoke, yes disease, no disease
200 total subjects proportionalised to row
(diseased) total
200 total subjects proportionalised to
column (smoke) total
ALL SUBJECTS’ OVERALL STATUS
DISEASED’ SMOKING STATUS
SMOKERS’ DISEASE STATUS
ALL SUBJECTS’ RAW DATA?
TABLE 1: RAW DATA
SmokeYes No Total
Disease
Yes 13 6 19No 37 144 181
Total 50 150 200All 200 subjects are divided up among the four categories:
Smoker with disease (n=13)Smoker with no disease (n=37)Non-smoker with disease (n=6)
Non-smoker with no disease (n=144)And there are four sub-totals:
Not diseased (n=181)Non-smokers (n=150)
Diseased (n=19)Smokers (n=50)
Interpreting Raw Data
How do we draw conclusions about the risks of smoking from these data?
Interpreting Grand Total Percentiles
All 200 subjects are proportionalised to the grand total. Now:
Population who smoked and had heart disease (6.5%)Population who smoked and had no heart disease(18.5%)
Population who didn’t smoke and had heart disease (3.0%)Population who didn’t smoke and had no heart disease (72%)
Are smoking and disease related?
Only 26% of smokers were diseased (6.5%/25%*100).Yet 68% of diseased people were smokers (6.5%/9.5%*100)
TABLE 1: RAW DATA
Smoke Yes No Total
Disease Yes 13 6 19 No 37 144 181
Total 50 150 200
TABLE 2: NO SUBSETS GRAND TOTAL PERCENTAGES
Smoke Yes No Total
Disease Yes 6.5% 3% 9.5% No 18.5% 72% 90.5%
Total 25% 75% 100%
Interpreting Column Percentiles
All 200 subjects are proportionalised to the column total. Now we are interpreting the data from the perspective of a subset of the sample – a
person’s smoking status. Now:
Smoker with disease (26%)Smoker with no disease (74%)Non-smoker with disease (4%)
Non-smoker with no disease (96%)
Now what do we say?About three quarters of smokers don’t get sick!
TABLE 4: SUBSET SMOKERS COLUMN PERCENTAGES
Smoke Yes No Total
Disease Yes 26.0% 4.0% 9.5% No 74.0% 96.0% 90.5%
Total 100% 100% 100%
That’s where you would stop the analysis if you worked for the tobacco companies
Interpreting Row Percentiles
All 200 subjects are proportionalised to the row total. Now we are interpreting the data from the perspective of the other subset
of the sample – a person’s disease status. Now:
Diseased and smoker (68.4%)Not diseased and smoker (20.4%)Diseased and non-smoker (31.6%)
Not diseased and non-smoker (79.6%)Now what do we say?
Sixty-eight percent of people with heart disease also smoke while only about 20% of the sample who were free of heart disease
were smokers.
TABLE 3: SUBSET DISEASED ROW PERCENTAGES
Smoke Yes No Total
Disease Yes 68.4% 31.6% 100% No 20.4% 79.6% 100%
Total 25.0% 75.0% 100%
Summary
Two main points:
1. The different tables give different perspectives so have to be careful to…
Use correct subset interpretation – for example, the row-based percentiles in our analysis were about disease status and not smoking status:
68% of people with heart disease smoke, and not 68% of smokers have heart disease!
Summary
2. Watch proportions and size of sample subsets:
Only 50 of 200 smoked and…only 19 of 200 had heart disease and…
only 13 of 200 had heart disease and smoked.
The effect of so many not being diseased and not smoking can overwhelm the other effects, either
masking them or exaggerating them.
Simpson’s Paradox
Crops up often when using contingency tables in the social sciences.
Refers to the apparent reversal of relationships seen in disaggregated data when it is combined.
Product of disproportionality among subsets and lurking variables (note the previous
smoker/disease data).
An example:
Because dead smokers tell no tales!
Smokers die off considerably faster in the earlier period and
there are fewer of them around to be counted in
the later one. As well, older people’s mortality
is obviously higher.
In both surveys smoker’s die off rates are higher than non-
smokers.
Example of Simpson’s ParadoxResults of two surveys done 20 years apart.
Age 55-64
Dead Alive Total
Smokers 51=44% 64=56% 115=100%
Non-smokers 40=33% 81=67% 121=100%
Total 91=39% 145=61% 236=100%
Age 65-74
Dead Alive Total
Smokers 29=80% 7=20% 36=100%
Non-smokers 101=78% 28=22% 129=100%
Total 130=79% 35=21% 165=100%
Age 55-74 Combined
Dead Alive Total
Smokers 80=53% 71=47% 151=100%
Non-smokers 141=56% 109=44% 250=100%
Total 221=55% 180=45% 401=100%
But when tables are combined, smokers’ die off rates for the whole period are
lower. Why?
Using Tables for Inference Testing
Test for significant differences or relationships rather than just describing the data.
Based on comparing the observed cell values to those that could be expected using probability theory and assuming there are no significant
differences or relationships.
Stated:The probability of falling into a particular cell is the product of the probability of being in a particular row and the probability of being in a particular
column.
Calculating Chi Square
The statistic most frequently used in inferring with contingency tables is called the Chi Square statistic, written as chi2 and given
by the Greek letter χ.
It is based on an expected versus actual values methodology and its formula is:
22
1 1
( )r cij ij
i j ij
x e
e
Calculating Chi Square
Translated this says:
where the expected cell counts are given by:
22 (observed-expected)
the sum of expected
Chi
(row total)*(column total)
grandtotal
An Example
Are e coli counts different between two lakes in Muskoka, one with cottages and one without?
1. Collect 200 samples of water from each.2. Measure e coli concentrations.3. Is the sample above or below acceptable
background limit?
How to test this?
Lakes
No Cottage Lake
Cottage Lake Total
Above 43 81 124
Below 157 119 276
Total 200 200 400
Collect Observed valuesFour hundred samples, 200 from each lake
LakeNo
CottagesCottages Total
Above (observed)Above (expected)
4362
8162
124
Below (observed)Below (expected)
157138
119138
276
Total 200 200 400
Calculate Expected Values(row total)*(column total)
grandtotal
E.G. 124*200/400 = 62
LakeNo
CottagesCottages Total
Above (observed)Above (expected)(O-E)2
4362
361
8162
361
124
Below (observed)Below (expected)(O-E)2
157138361
119138361
276
Total 200 200 400
Calculate Deviation Error Squared (O-E)2 Values for Cells2
2 (observed-expected)the sum of
expectedChi
E.G. (43-62)2 = 361
LakeNo Cottages Cottages Total
Above (observed)Above (expected)(O-E)2
(O-E)2/E
4362
3615.82
8162
3615.82
124
Below (observed)Below (expected)(O-E)2
(O-E)2/E
1571383612.61
1191383612.61
276
Total 200 200 400
Divide (O-E)2 Values by Expected Values2
2 (observed-expected)the sum of
expectedChi
E.G. 361/62 = 5.82
LakesNo Cottages Cottages Total
Above (observed)Above (expected)
(O-E)2
(O-E)2/E
4362
3615.82
8162
3615.82
124
Below (observed)Below (expected)
(O-E)2
(O-E)2/E
1571383612.61
1191383612.61
276
Total 200 200 400
Sum the Squared (O-E)2 /Expected Values2
2 (observed-expected)the sum of
expectedChi
Chi2= 5.82 +5.82 +2.61 +2.61 = 16.86
Compare 16.86 to the ‘book’ value.If it is greater than book value, there are
significant differences in the table.
Interpreting the ExampleWe observed 43 samples from no cottage lakes above background but expected 62
We observed 81 samples from cottage lakes above background but expected 62We observed 157 samples from no cottage lakes below background but expected 138
We observed 119 samples from cottage lakes below background but expected 138Lakes
No Cottages Cottages TotalAbove (observed)Above (expected)(O-E)2
(O-E)2/E
4362
5.820.094
8162
5.820.094
124
Below (observed)Below (expected)(O-E)2
(O-E)2/E
1571382.61
0.019
1191382.61
0.019
276
Total 200 200 400
Remember.Watch your table manners.