Structure of contingency tables. Symmetry. Rules. Populations and sub-populations. Grand, Column,...

transcript

Contingency Tables

• Structure of contingency tables.• Symmetry.• Rules.• Populations and sub-populations.• Grand, Column, and Row percentiles.• Simpson’s Paradox.• Using tables for inference testing – Chi2.

Contingency Tables as Descriptive Statistical Tools

One of the most common types of statistical tools, so-called because the two variables in the table are

contingent (or dependent) upon one another.

Also called crosstabs because the two variables cross tabulate (or refer to) a single subject.

They can be a descriptive or inferential using the Chi2 statistic (pronounced “k-eye” square).

Deceptively simple tool; deceptive because they look easy to interpret but are not!

Show a wealth of information about population and subsets of that population

Contingency Tables as Descriptive Statistical Tools

Structure of Contingency Tables

The variables are in a rXc structure not just columns.

Groups of rows and columns are defined by one variable each such as eye and hair colour.

Individual rows and columns represent the sub categories within each variable - e.g. blue, brown,

black etc.

The number of subcategories define the size of the table – e.g. a 4X4 table would have two variables

each with four sub categories as follows…

Hair and Eye Colour Among Non-Indigenous North American Population

BLACK BROWN RED BLOND TOTAL

Brown 68 119 26 7 220Blue 20 84 17 94 215

Hazel 15 54 14 10 93Green 5 29 14 16 64

TOTAL 108 286 71 127 592

Anatomy of a 4X4 table

Variable #1

Variable #2

Subcategory #2

Subcategory #1

Row totals are for VAR #2

Column totals are for VAR #1

GrandTotal

Data values for subjects in cells

Marginal Totals

Differences Between Contingency Tables and

SpreadsheetsSpreadsheets Contingency Tables

Columns are variables. Columns are a subset of a variable.Rows are cases. Rows are a subset of a variable.Cells are individual case values (e.g. a person) within a single variable.

Cells are total case values (e.g. # of people) with both variables.

Column totals are the sum of all cases within that variable.

Column totals are the sum of all cases within that subset of a variable.

Row totals are the sum of a case across all variables.

Row totals are the sum of all cases within that subset of a variable.

Grand totals (sum of row sums and column sums) are total of all cases in the dataset across all variables.

Contingency Table Symmetry

Tables are symmetrical when there are the same number of sub-category rows and columns – e.g. a

4X4 table.

Tables are asymmetric when there are dissimilar # of sub-category rows and columns – e.g. a 4X6 table.

Asymmetric tables are difficult to do inferential testing on so avoid them.

High level tables that have more than two variables also exist but are complex to analyse.

Contingency Table Symmetry

VARIABLE

SUB-CAT SUB-CAT SUB-CAT SUB-CAT

SUB-CAT 1 2 3 4

SUB-CAT 2

SUB-CAT 3

SUB-CAT 4

VARIABLE

SUB-CAT SUB-CAT SUB-CAT SUB-CAT SUB-CAT SUB-CAT

SUB-CAT 1 2 3 4 5 6

SUB-CAT 2

SUB-CAT 3

SUB-CAT 4

A 4X4 Table

A 4X6 Table

Eye Colour

Brown Blue

Hair ColourBrown

Eye Colour

Brown Blue

GenderMale

Female

Fat Intake

High Low

CholesterolHigh

Table rules for mix of variables

If variables have a potentialrelationship then the dependent

variable goes on the ‘Y’ axis

If one variable is a population then that variable goes on the ‘Y’ axis

If both variables are categorical then it does not matter

where each goes

Table Rules Cont…

Independence: every subject must have the same chance of being selected.

Exclusivity: subjects can fall only into one cell; e.g. cannot use data drawn from multiple

responses to a single question (no one eye blue, one eye brown or no eye colour!).

Exhaustive: subcategories should include all responses received (i.e. the sum of rows must equal the sum of columns, or no-one can have hair colour without eye colour or vice versa).

Raw Data

BLACK BROWN RED BLOND TOTAL

Brown 68 119 26 7 220

Blue 20 84 17 94 215

Hazel 15 54 14 10 93

Green 5 29 14 16 64

TOTAL 108 286 71 127 592

Grand Total Percentiles

Black Brown Red Blond TOTAL

Brown 11.49% 20.10% 4.39% 1.18% 37.16%

Blue 3.38% 14.19% 2.87% 15.88% 36.32%

Hazel 2.53% 9.12% 2.36% 1.69% 15.71%

Green 0.84% 4.90% 2.36% 2.70% 10.81%

TOTAL 18.24% 48.31% 11.99% 21.45% 100.00%

There are fewer black haired than blonde haired people.Rarest combinations:

1. Black hair/green eyes = <1 person in 1002. Blonde hair/brown eyes = @1.2 people in 100

Commonest combinations:1. Brown hair/brown eyes = 20% of population

2. Blonde hair/blue eyes = @ 16% of population

And for the curious…

Deriving Useful Information

Tables depend on proportional calculations using marginal totals (row and column) to be useful.

In reality you are dealing with sub groups of the population.

Each row total and column total represents a sub-population.

Three proportional (percentile) tables are derived and analysed:

Raw Data HAIR BLACK BROWN RED BLOND TOTAL

Brown 68 119 26 7 220Blue 20 84 17 94 215Hazel 15 54 14 10 93Green 5 29 14 16 64

TOTAL 108 286 71 127 592

Population as a whole.Variable #1 Sub-population – Hair across all eye categories.Variable #2 sub-population – Eyes across all hair categories.

Populations and Sub-Populations

BROWNBlack

GREENBrown

HAZELBrown

BLUEBlonde

BROWNBrown

BROWNBlack

BROWNBrown

HAZELBlack

HAZELRed

HAZELBlack

HAZELRed

HAZELBrown

HAZELBlonde

HAZELRed GREEN

GREENBrown

GREENBlack

BLUEBlonde

BLUEBrown

BLUEBlack

BLUEBrown

BLUEBlonde

BLUEBlack

This is a population of 34 people. Each person has an EYE colour and a Hair colour.

BROWNBlack

GREENBrown

HAZELBrownBLUE

Blonde

BROWNBrown

BROWNBlack

BROWNBrown

HAZELBlack

HAZELRed

HAZELBlack

HAZELRed

HAZELBrown

HAZELBlonde

HAZELRed

GREENBrown

GREENBlack

BLUEBlonde

BLUEBrown

BLUEBlack

BLUEBrown

BLUEBlonde

BLUEBlack

This is the same population of 34. They have been told to group themselves into four sub-populations according to EYE colour.

n=5n=9

Subpop #1

Subpop #2

Subpop #3

Subpop #4

BROWNBlack

GREENBrown

HAZELBrown

BLUEBlonde

BROWNBrown

BROWNBlack

BROWNBrown

HAZELBlack

HAZELRed

HAZELBlack

HAZELRed

HAZELBrown

HAZELBlonde HAZEL

GREENBrown

GREENBlack

BLUEBlonde

BLUEBrown

BLUEBlack

BLUEBrown

BLUEBlonde BLUE

BlondeBLUE

Blonde

BLUEBlack

This is the same population of 34. Now they have been told to group themselves into four sub-populations according to Hair colour.

n=4n=7

n=13 n=10Subpop #1 Subpop #2

Subpop #3Subpop #4

TABLE 2: NO SUBSETSGRAND TOTAL %ages

SmokeYes No Total

Disease

Yes 6.5% 3% 9.5%No 18.5% 72% 90.5%

Total 25% 75% 100%

TABLE 3: SUBSET DISEASED ROW %ages

SmokeYes No Total

Disease

Yes 68.4% 31.6% 100%No 20.4% 79.6% 100%

Total 25.0% 75.0% 100%

TABLE 4: SUBSET SMOKERS COLUMN %ages

SmokeYes No Total

Disease

Yes 26.0% 4.0% 9.5%

No 74.0% 96.0% 90.5%Total 100% 100% 100%

TABLE 1: RAW DATA Smoke

Yes No Total

DiseaseYes 13 6 19No 37 144 181

Total 50 150 200

200 total subjects proportionalised to grand

200 subjects divided among four categories: yes smoke, no smoke,

yes disease, no disease

200 total subjects proportionalised to row

(diseased) total

200 total subjects proportionalised to column

(smoke) total

ALL SUBJECTS’ OVERALL STATUS

DISEASED’ SMOKING STATUS

SMOKERS’ DISEASE STATUS

ALL SUBJECTS’ RAW DATA

SmokeYes No Total

Disease

Yes 6.5% 3% 9.5%No 18.5% 72% 90.5%

Total 25% 75% 100%

SmokeYes No Total

Disease

Yes 68.4% 31.6% 100%No 20.4% 79.6% 100%

Total 25.0% 75.0% 100%

SmokeYes No Total

Disease

Yes 26.0% 4.0% 9.5%

No 74.0% 96.0% 90.5%Total 100% 100% 100%

Yes No Total

DiseaseYes 13 6 19No 37 144 181

Total 50 150 200

200 total subjects proportionalised to

grand total

200 subjects divided among four categories: yes smoke, no

smoke, yes disease, no disease

(diseased) total

column (smoke) total

ALL SUBJECTS’ RAW DATA?

SmokeYes No Total

Disease

Yes 6.5% 3% 9.5%No 18.5% 72% 90.5%

Total 25% 75% 100%

SmokeYes No Total

Disease

Yes 68.4% 31.6% 100%No 20.4% 79.6% 100%

Total 25.0% 75.0% 100%

SmokeYes No Total

Disease

Yes 26.0% 4.0% 9.5%

No 74.0% 96.0% 90.5%Total 100% 100% 100%

Yes No Total

DiseaseYes 13 6 19No 37 144 181

Total 50 150 200

grand total

(diseased) total

ALL SUBJECTS’ RAW DATA

SmokeYes No Total

Disease

Yes 6.5% 3% 9.5%No 18.5% 72% 90.5%

Total 25% 75% 100%

SmokeYes No Total

Disease

Yes 68.4% 31.6% 100%No 20.4% 79.6% 100%

Total 25.0% 75.0% 100%

SmokeYes No Total

Disease

Yes 26.0% 4.0% 9.5%

No 74.0% 96.0% 90.5%Total 100% 100% 100%

Yes No Total

DiseaseYes 13 6 19No 37 144 181

Total 50 150 200

grand total

(diseased) total

ALL SUBJECTS’ RAW DATA?

TABLE 1: RAW DATA

SmokeYes No Total

Disease

Yes 13 6 19No 37 144 181

Total 50 150 200All 200 subjects are divided up among the four categories:

Smoker with disease (n=13)Smoker with no disease (n=37)Non-smoker with disease (n=6)

Non-smoker with no disease (n=144)And there are four sub-totals:

Not diseased (n=181)Non-smokers (n=150)

Diseased (n=19)Smokers (n=50)

Interpreting Raw Data

How do we draw conclusions about the risks of smoking from these data?

Interpreting Grand Total Percentiles

All 200 subjects are proportionalised to the grand total. Now:

Population who smoked and had heart disease (6.5%)Population who smoked and had no heart disease(18.5%)

Population who didn’t smoke and had heart disease (3.0%)Population who didn’t smoke and had no heart disease (72%)

Are smoking and disease related?

Only 26% of smokers were diseased (6.5%/25%*100).Yet 68% of diseased people were smokers (6.5%/9.5%*100)

TABLE 1: RAW DATA

Smoke Yes No Total

Disease Yes 13 6 19 No 37 144 181

Total 50 150 200

TABLE 2: NO SUBSETS GRAND TOTAL PERCENTAGES

Smoke Yes No Total

Disease Yes 6.5% 3% 9.5% No 18.5% 72% 90.5%

Total 25% 75% 100%

Interpreting Column Percentiles

All 200 subjects are proportionalised to the column total. Now we are interpreting the data from the perspective of a subset of the sample – a

person’s smoking status. Now:

Smoker with disease (26%)Smoker with no disease (74%)Non-smoker with disease (4%)

Non-smoker with no disease (96%)

Now what do we say?About three quarters of smokers don’t get sick!

TABLE 4: SUBSET SMOKERS COLUMN PERCENTAGES

Smoke Yes No Total

Disease Yes 26.0% 4.0% 9.5% No 74.0% 96.0% 90.5%

Total 100% 100% 100%

That’s where you would stop the analysis if you worked for the tobacco companies

Interpreting Row Percentiles

All 200 subjects are proportionalised to the row total. Now we are interpreting the data from the perspective of the other subset

of the sample – a person’s disease status. Now:

Diseased and smoker (68.4%)Not diseased and smoker (20.4%)Diseased and non-smoker (31.6%)

Not diseased and non-smoker (79.6%)Now what do we say?

Sixty-eight percent of people with heart disease also smoke while only about 20% of the sample who were free of heart disease

were smokers.

TABLE 3: SUBSET DISEASED ROW PERCENTAGES

Smoke Yes No Total

Disease Yes 68.4% 31.6% 100% No 20.4% 79.6% 100%

Total 25.0% 75.0% 100%

Summary

Two main points:

1. The different tables give different perspectives so have to be careful to…

Use correct subset interpretation – for example, the row-based percentiles in our analysis were about disease status and not smoking status:

68% of people with heart disease smoke, and not 68% of smokers have heart disease!

Summary

2. Watch proportions and size of sample subsets:

Only 50 of 200 smoked and…only 19 of 200 had heart disease and…

only 13 of 200 had heart disease and smoked.

The effect of so many not being diseased and not smoking can overwhelm the other effects, either

masking them or exaggerating them.

Simpson’s Paradox

Crops up often when using contingency tables in the social sciences.

Refers to the apparent reversal of relationships seen in disaggregated data when it is combined.

Product of disproportionality among subsets and lurking variables (note the previous

smoker/disease data).

An example:

Because dead smokers tell no tales!

Smokers die off considerably faster in the earlier period and

there are fewer of them around to be counted in

the later one. As well, older people’s mortality

is obviously higher.

In both surveys smoker’s die off rates are higher than non-

smokers.

Example of Simpson’s ParadoxResults of two surveys done 20 years apart.

Age 55-64

Dead Alive Total

Smokers 51=44% 64=56% 115=100%

Non-smokers 40=33% 81=67% 121=100%

Total 91=39% 145=61% 236=100%

Age 65-74

Dead Alive Total

Smokers 29=80% 7=20% 36=100%

Non-smokers 101=78% 28=22% 129=100%

Total 130=79% 35=21% 165=100%

Age 55-74 Combined

Dead Alive Total

Smokers 80=53% 71=47% 151=100%

Non-smokers 141=56% 109=44% 250=100%

Total 221=55% 180=45% 401=100%

But when tables are combined, smokers’ die off rates for the whole period are

lower. Why?

Using Tables for Inference Testing

Test for significant differences or relationships rather than just describing the data.

Based on comparing the observed cell values to those that could be expected using probability theory and assuming there are no significant

differences or relationships.

Stated:The probability of falling into a particular cell is the product of the probability of being in a particular row and the probability of being in a particular

column.

Calculating Chi Square

The statistic most frequently used in inferring with contingency tables is called the Chi Square statistic, written as chi2 and given

by the Greek letter χ.

It is based on an expected versus actual values methodology and its formula is:

( )r cij ij

i j ij

Calculating Chi Square

Translated this says:

where the expected cell counts are given by:

22 (observed-expected)

the sum of expected

(row total)*(column total)

grandtotal

An Example

Are e coli counts different between two lakes in Muskoka, one with cottages and one without?

1. Collect 200 samples of water from each.2. Measure e coli concentrations.3. Is the sample above or below acceptable

background limit?

How to test this?

No Cottage Lake

Cottage Lake Total

Above 43 81 124

Below 157 119 276

Total 200 200 400

Collect Observed valuesFour hundred samples, 200 from each lake

LakeNo

CottagesCottages Total

Above (observed)Above (expected)

Below (observed)Below (expected)

157138

119138

Total 200 200 400

Calculate Expected Values(row total)*(column total)

grandtotal

E.G. 124*200/400 = 62

LakeNo

CottagesCottages Total

Above (observed)Above (expected)(O-E)2

Below (observed)Below (expected)(O-E)2

157138361

119138361

Total 200 200 400

Calculate Deviation Error Squared (O-E)2 Values for Cells2

2 (observed-expected)the sum of

expectedChi

E.G. (43-62)2 = 361

LakeNo Cottages Cottages Total

Above (observed)Above (expected)(O-E)2

(O-E)2/E

3615.82

(O-E)2/E

1571383612.61

1191383612.61

Total 200 200 400

Divide (O-E)2 Values by Expected Values2

expectedChi

E.G. 361/62 = 5.82

LakesNo Cottages Cottages Total

Above (observed)Above (expected)

(O-E)2

(O-E)2/E

3615.82

Below (observed)Below (expected)

(O-E)2

(O-E)2/E

1571383612.61

1191383612.61

Total 200 200 400

Sum the Squared (O-E)2 /Expected Values2

expectedChi

Chi2= 5.82 +5.82 +2.61 +2.61 = 16.86

Compare 16.86 to the ‘book’ value.If it is greater than book value, there are

significant differences in the table.

Interpreting the ExampleWe observed 43 samples from no cottage lakes above background but expected 62

We observed 81 samples from cottage lakes above background but expected 62We observed 157 samples from no cottage lakes below background but expected 138

We observed 119 samples from cottage lakes below background but expected 138Lakes

No Cottages Cottages TotalAbove (observed)Above (expected)(O-E)2

(O-E)2/E

5.820.094

(O-E)2/E

1571382.61

1191382.61

Total 200 200 400

Remember.Watch your table manners.

Structure of contingency tables. Symmetry. Rules. Populations and sub-populations. Grand, Column,...

Documents