Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 213 times |
Download: | 0 times |
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 1
Unit 9: Categorical predictors, II—Polychotomies and ANOVA
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 2
The S-030 roadmap: Where’s this unit in the big picture?
Unit 2:Correlation
and causality
Unit 3:Inference for the regression model
Unit 4:Regression assumptions:
Evaluating their tenability
Unit 5:Transformations
to achieve linearity
Unit 6:The basics of
multiple regression
Unit 7:Statistical control in
depth:Correlation and
collinearity
Unit 10:Interaction and quadratic effects
Unit 8:Categorical predictors I:
Dichotomies
Unit 9:Categorical predictors II:
Polychotomies
Unit 11:Regression modeling
in practice
Unit 1:Introduction to
simple linear regression
Building a solid
foundation
Mastering the
subtleties
Adding additional predictors
Generalizing to other types of
predictors and effects
Pulling it all
together
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 3
In this unit, we’re going to learn about…
• Distinguishing between nominal and ordinal predictors• How a series of 0/1 dummy variables can represent a
nominal predictor– Why does regressing Y on all but one dummy variable yield the desired
model?– Consequences of changing the reference category for parameter
estimates and hypothesis tests
• The problem of multiple comparisons: How many contrasts have we examined?– The Bonferroni multiple comparison procedure: Splitting the p-value
• An alternative way of getting the identical results: The analysis of variance (ANOVA)
• What else might we do if we have an ordinal predictor?• Presenting adjusted means when the question predictor is
polychotomous• Untangling the nomenclature: Regression, analysis of
variance and analysis of covariance
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 4
Distinguishing between nominal and ordinal predictors
AmericanNative
American Asian 4
Latino 3
American African2
White1
5
cityRace/Ethni
Jewish 4
Muslim 3
Protestant 2
Catholic 1
Religion
graduate college
college some3
graduate HS 2
dropout HS 1
4
Education
veconservati 3
moderate 2
liberal 1
views Political
Never directly include a nominal predictor in a
regression model.Never!
You can directly include an ordinal predictor in a
regression model, but be sure that’s what you want.
It’s often not!
Nominal predictorsVariables whose values offer no
meaningful quantitative information but simply
distinguish between categories
Ordinal predictorsVariables whose values do
reflect an underlying ordering of categories, but not necessarily
the “distance” between categories
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 5
Regional differences in the price of fine French wine
Source: Thrane, C (2004). In defence of the price hedonic model in wine research, Journal of Wine Research, 15(2)
, 123-134
ID Region Area Price Lprice Year Vintage
3 2 Bordeaux 13.2286 2.58238 3 2001109 4 Languedoc 13.2571 2.58454 2 2000 110 4 Languedoc 13.4286 2.59738 3 2001131 3 Rhone 13.4429 2.59845 3 2001133 3 Rhone 13.5000 2.60269 1 1999111 4 Languedoc 13.5571 2.60691 3 2001 61 1 Burgundy 14.1286 2.64820 3 2001
. . .
57 2 Bordeaux 47.0714 3.85167 0 1998(-) 58 2 Bordeaux 50.1429 3.91488 0 1998(-)178 1 Burgundy 52.6429 3.96353 2 2000160 3 Rhone 52.9000 3.96840 2 2000183 1 Burgundy 62.9571 4.14245 3 2001 60 2 Bordeaux 66.4000 4.19570 2 2000
RQ 1: Do wine prices vary (significantly) by region and vintage?
RQ 2: If so, which regions and vintages
are (significantly) different from which
other regions and vintages?
Languedoc 4
Rhone 3
Bordeaux 2
Burgundy 1
aRegion/Are
'01 3
'00 2
'99 1
olderor '98 0
geYear/Vinta
Price
:Outcome
Nominal Ordinal
n = 113
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 6
How do wine prices vary by REGION?
Bordeaux RhoneBurgundy Languedoc
3.25 3.063.39 2.65mean
(0.40) (0.36)(0.48) (0.19)(sd)
Much variability between regions:
Burgundy is most expensive, on average
Is there heteroscedasticity? SD’s vary: highest (.48) is
2.5 times higher than the lowest (.19)
You can buy a “cheap” (for Norway…) bottle from
anywhere:e.g., each region’s range
includes Lprice ≈ 2.75 (~$15)
That said, there’s great variability within regions
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 7
How do wine prices vary by VINTAGE?
Much variability between vintages:
On average, older wines are more expensive than
younger ones
Less heteroscedasticity? SD’s still vary but
appear more stable (with sd’s ≈ .40)
You can buy a “cheap” (for Norway…) bottle from
any vintage:each vintage’s range includes Lprice ≈ 2.75
(~$15)
That said, there’s great variability within vintages
3.20 3.133.46 2.85mean
(0.41) (0.43)(0.34) (0.37)(sd)
1999 2000<= ‘98 2001
Two Qs we want to ask about group differences:
(1) How much credence should we give to observed differences between group means?
(2) In what context can we place these observed differences to evaluate their magnitude?
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 8
††
†
†
††
†
†
††
†
†
Why within-group variance is key to evaluating between-group differences
…if there wereequally moderate variability
within groups?
…if there were equally little variability
within groups?
…if there were equally great variability
within groups?
Important message:Within-group
variation provides a key
context for evaluating the
magnitude ofbetween-group
variation
Let’s imagine 3 different data sets for a 4-
level categorical predictor
where the set of 4 means is identical in
each
How much attention
would you give to these observed
group-to-group differences in
means…
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 9
Towards postulating a statistical model for group differences
Bordeaux RhoneBurgundy Languedoc 1999 2000<= ‘98 2001
Regional variation Vintage variation
We seek a statistical model that includes the effects of categorical predictors in a way that is similar to regression (in that its parameters
represent population means)but that doesn’t force us to hypothesize the existence of a linear
relationship
3.20 3.133.46 2.85mean3.25 3.063.39 2.65mean
(0.40) (0.36)(0.48) (0.19)(sd) (0.41) (0.43)(0.34) (0.37)(sd)
What happens if we incorrectly include REGION as a continuous
predictor?
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 10
How do we include a polychotomy in a regression model?Creating a series of 0/1 dummy variables
1122110 kk DUMMYDUMMYDUMMYY
Step One:Create a series of 0/1 dummy variables, one for every value of the categorical predictor
Step Two:Include all but one of the dummy variables
in the multiple regression model (for K groups, you need only
K-1 dummies)
ID LPrice Region Area
61 2.64820 1 Burgundy 66 3.00285 1 Burgundy 67 3.00498 1 Burgundy 72 3.67449 1 Burgundy 178 3.96353 1 Burgundy
7 2.65826 2 Bordeaux 48 3.70658 2 Bordeaux 53 3.75754 2 Bordeaux 55 3.79324 2 Bordeaux 56 3.81991 2 Bordeaux 146 2.99502 3 Rhone 145 2.99502 3 Rhone 151 3.20100 3 Rhone 152 3.22457 3 Rhone 154 3.40120 3 Rhone
119 2.71753 4 Languedoc120 2.75366 4 Languedoc122 2.84075 4 Languedoc127 2.96674 4 Languedoc180 2.96821 4 Languedoc
Languedoc 4
Rhone 3
Bordeaux 2
Burgundy 1
Region
Burgundyif 1
Burgundy notif 0Burgundy
Bordeauxif 1
Bordeaux notif 0Bordeaux
Rhoneif 1
Rhone notif 0Rhone
Languedocif 1
Languedoc notif 0Languedoc
RhoneBordeauxBurgundyY 3210
Bordeaux
0 0 0 0 0
1 1 1 1 1
0 0 0 0 0
0 0 0 0 0
Burgundy
1 1 1 1 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Rhone
0 0 0 0 0
0 0 0 0 0
1 1 1 1 1
0 0 0 0 0
Languedoc
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 1 1 1 1
Collectively, the 4 dummies identify every wine’s specific region
But 3 dummies would also be sufficient to identify every wine’s specific region!
Because the Y-intercept is value of Y when all predictors = 0 it represents the mean outcome for the reference
category
These 3 dummies are mutually exclusive and
exhaustive
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 11
Why does regressing Y on all but one dummy yield our postulated model?
0̂
1̂2̂
3̂
30ˆˆ 20ˆˆ 10ˆˆ
0̂
Languedoc average the
andBurgundy average
thebetween
price the estimates
difference
β1ˆ
Languedoca
rice of average p
the estimatesβ0ˆ
10
3210
ˆˆ
0ˆ0ˆ1ˆˆˆ
001
ββ
)(β)(β)(ββY
, Rhone, BordeauxBurgundy
Burgundy
0
3210
ˆ
0ˆ0ˆ0ˆˆˆ
000
β
)(β)(β)(ββY
, Rhone, BordeauxBurgundy
Languedoc
20
3210
ˆˆ
0ˆ1ˆ0ˆˆˆ
010
ββ
)(β)(β)(ββY
, Rhone, BordeauxBurgundy
Bordeaux
30
3210
ˆˆ
1ˆ0ˆ0ˆˆˆ
100
ββ
)(β)(β)(ββY
, Rhone, BordeauxBurgundy
Rhone
RhoneBordeauxBurgundyY 3210ˆˆˆˆˆ
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 12
Results of regressing LPrice on 3 regional dummies (Burgundy, Bordeaux and Rhone—Languedoc is the reference
category)
The REG ProcedureDependent Variable: Lprice
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > F
Model 3 7.88009 2.62670 20.62 <.0001Error 109 13.88838 0.12742Corrected Total 112 21.76847
Root MSE 0.35695 R-Square 0.3620Dependent Mean 3.06102 Adj R-Sq 0.3444Coeff Var 11.66128
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 2.64606 0.06517 40.60 <.0001Burgundy 1 0.74061 0.13566 5.46 <.0001Bordeaux 1 0.60072 0.08213 7.31 <.0001Rhone 1 0.41690 0.09893 4.21 <.0001
Each regression coefficient estimates the differential between the mean of that group and the mean of the reference group (NOT the overall mean): The estimated mean log(price) of each region’s wine is significantly higher
than that of the Languedoc (p<.0001)…
Region “explains” just over 1/3 of the variation in price (R2=36.2%)
The intercept provides the
estimated mean for the reference
category: The estimated mean
log(price) for Languedoc wines is 2.65
RMSE estimates the average within-group
standard deviation
To side by side boxplots
Wine prices vary significantly by
region. We reject H0: 1 = 2 = 3 = 0
at the p<.0001 level. (Note that this is now
a very interesting test.)
BUT we don’t yet know if there are significant price differences between Burgundy and Rhone, Rhone and
Bordeaux, etc.
RhoneBordeauxBurgundyY 42.060.074.065.2ˆ
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 13
for the reference
category (e.g,. the estimated
mean for the Languedoc is 2.65)
Relating the fitted model to the sample data
0̂
RhoneBordeauxBurgundyY 42.060.074.065.2ˆ
2.65
3.06
3.25
3.39
65.2ˆ0
74.01̂ 60.0ˆ
2
42.0ˆ3
estimated mean
Parameter estimate for each dummy variable
=
in means between this category and the
reference category (e.g., we estimate that the mean difference in
Lprice between Burgundy and the Languedoc is 0.74,
which is the difference between
3.39 and 2.65)
estimated difference
Interpretation of estimates for categorical predictors depends, then, on choice of the reference
category...So choose your reference category
wisely
RhoneBordeauxBurgundyY 3210ˆˆˆˆˆ
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 14
Reference Category: Languedoc
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 2.64606 0.06517 40.60 <.0001Burgundy 1 0.74061 0.13566 5.46 <.0001Bordeaux 1 0.60072 0.08213 7.31 <.0001Rhone 1 0.41690 0.09893 4.21 <.0001
Reference Category: Burgundy
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 3.38667 0.11898 28.46 <.0001Bordeaux 1 -0.13990 0.12906 -1.08 0.2808Rhone 1 -0.32372 0.14035 -2.31 0.0230Languedoc 1 -0.74061 0.13566 -5.46 <.0001
Reference Category: Rhone
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 3.06296 0.07443 41.15 <.0001Burgundy 1 0.32372 0.14035 2.31 0.0230Bordeaux 1 0.18382 0.08966 2.05 0.0427Languedoc 1 -0.41690 0.09893 -4.21 <.0001
Reference Category: Bordeaux
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 3.24678 0.04998 64.96 <.0001Burgundy 1 0.13990 0.12906 1.08 0.2808Rhone 1 -0.18382 0.08966 -2.05 0.0427Languedoc 1 -0.60072 0.08213 -7.31 <.0001
What happens if we change the model’s “reference category”?
Model A Model B Model C Model D
Reference Group
Languedoc Rhone Bordeaux Burgundy
Intercept2.65
t=40.60p<0.0001
3.06t=41.15
p<0.0001
3.25t=64.96
p<0.0001
3.39t=28.46
p<0.0001
Burgundy0.74
t=5.46p<0.0001
0.32t=2.31
p=0.0230
0.14t=1.08
p=0.2808
Bordeaux0.60
t=7.31p<0.0001
0.18t=2.05
p=0.0427
-0.14t=-1.08
p=0.2808
Rhone0.42
t=4.21p<0.0001
-0.18t=-2.05
p=0.0427
-0.32t=-2.31
p=0.0230
Languedoc-0.42
t=-4.21p<0.0001
-0.60t=-7.31
p<0.0001
-0.74t=-5.46
p<0.0001
R2 36.20 36.20 36.20 36.20
FdfP
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 15
Understanding the consequences of changing the reference category on estimated regression coefficients and hypothesis
tests
The intercept is always the estimated mean of Y in the
reference category
Model A Model B Model C Model D
Reference Group
Languedoc Rhone Bordeaux Burgundy
Intercept2.65
t=40.60p<0.0001
3.06t=41.15
p<0.0001
3.25t=64.96
p<0.0001
3.39t=28.46
p<0.0001
Burgundy0.74
t=5.46p<0.0001
0.32t=2.31
p=0.0230
0.14t=1.08
p=0.2808
Bordeaux0.60
t=7.31p<0.0001
0.18t=2.05
p=0.0427
-0.14t=-1.08
p=0.2808
Rhone0.42
t=4.21p<0.0001
-0.18t=-2.05
p=0.0427
-0.32t=-2.31
p=0.0230
Languedoc-0.42
t=-4.21p<0.0001
-0.60t=-7.31
p<0.0001
-0.74t=-5.46
p<0.0001
R2 36.20 36.20 36.20 36.20
FdfP
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
Parameter estimates (& associated tests) for each dummy variable will change because they always refer to the estimated difference between the mean for that group and the mean for
that model’s ref category.Even though there’s significant
variation between regions, not all regions are significantly different
from each otherThe estimate and associated test for all specific contrasts remain the same (although the sign will change to
reflect the reversal of the contrast’s direction)
Are we sure that we know?
RQ 1: Do wine prices vary (significantly) by
region?
RQ 2: If so, which regions are
(significantly) different from which
other regions?
YES: F(3,109)=20.62,
p<0.0001
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 16
The problem of multiple comparisons: How many contrasts have we examined & taken together, should we ‘believe’ all these tests?
Model A Model B Model C Model D
Reference Group
Languedoc Rhone Bordeaux Burgundy
Intercept2.65
t=40.60p<0.0001
3.06t=41.15
p<0.0001
3.25t=64.96
p<0.0001
3.39t=28.46
p<0.0001
Burgundy0.74
t=5.46p<0.0001
0.32t=2.31
p=0.0230
0.14t=1.08
p=0.2808
Bordeaux0.60
t=7.31p<0.0001
0.18t=2.05
p=0.0427
-0.14t=-1.08
p=0.2808
Rhone0.42
t=4.21p<0.0001
-0.18t=-2.05
p=0.0427
-0.32t=-2.31
p=0.0230
Languedoc-0.42
t=-4.21p<0.0001
-0.60t=-7.31
p<0.0001
-0.74t=-5.46
p<0.0001
R2 36.20 36.20 36.20 36.20
FdfP
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
Two types of errors we can make every time we conduct a hypothesis test:
Type I error: Rejecting H0 when it’s true—saying there’s a difference in means when there isn’t
Type II error: Failing to reject H0 when it’s false—saying we can’t find a difference in means when there really is one
Effect of making multiple
comparisons on the Type I error for the entire family of
tests (p=0.05)
# tests#
wrong
1 0.05
2 0.10
5 0.25
10 0.50
20 1.00
50 2.50
100 5.00
Idea:Instead of using p=0.05 for each individual test,
why not use p=0.05 for the entire family of tests when we
examine multiple
contrasts to test a single hypothesis
We focus on minimizing Type I error when we set
p=0.05 for our tests, but as we
conduct multiple tests, the Type I error for the “family of tests” grows
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 17
Multiple comparison procedures: What they are and how they’re used
0.0083 2.75 2.69 2.646
0.0250 2.31 2.28 2.242
0.0100
0.0125
0.0167
2.68
2.59
2.48
2.502.544
2.63
2.43
2.58
2.39
5
3
0.0005
0.0010
0.0025
0.0050
3.72
3.50
3.18
2.94
3.293.3950
3.60
3.10
2.87
3.48
3.02
2.81
20
100
10
1.961.982.010.05001
New p-value and associated t-statistic to use the Bonferroni method to keep the “family error rate” at 0.05 (two-tailed tests)
New t-statistic
df=df=100df=50New p#
tests
Some multiple comparison procedures:
•Duncan’s Multiple Range Test
•Tukey’s Honest Significant Difference
•Scheffe’s Multiple Comparison Test
•Newman Keuls Multiple Comparison Test
•Benjamini & Hochberg•…. many more, including
… •Bonferroni’s method
Issues involved in selecting an approach:
•A priori or post-hoc comparison?
•Simple or complex comparison?
•Is there a clearly identified control group?
•Equal or unequal n’s within groups?
SurfStat t-distribution calculator
The Bonferroni approach:
• Take a chosen Type I error rate (usually 0.05) and “split it” across the entire family of tests you’re conducting
• For 2 tests, conduct each at the 0.025 level
• For 5 tests, conduct each at the 0.01 level
• Use this new p-value to identify the new t-statistic for testing each individual hypothesis in the family, for a given number of degrees of freedom
testsnp old
pnew
As #
of te
sts
incre
ases, c
ritical t-v
alu
es in
cre
ase.As DF increase, critical t-values decrease.
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 18
Applying Bonferroni multiple comparisons to regional variation in wine prices
Model A Model B Model C Model D
Reference Group
Languedoc Rhone Bordeaux Burgundy
Intercept2.65
t=40.60p<0.0001
3.06t=41.15
p<0.0001
3.25t=64.96
p<0.0001
3.39t=28.46
p<0.0001
Burgundy0.74
t=5.46p<0.0001
0.32t=2.31
p=0.0230
0.14t=1.08
p=0.2808
Bordeaux0.60
t=7.31p<0.0001
0.18t=2.05
p=0.0427
-0.14t=-1.08
p=0.2808
Rhone0.42
t=4.21p<0.0001
-0.18t=-2.05
p=0.0427
-0.32t=-2.31
p=0.0230
Languedoc-0.42
t=-4.21p<0.0001
-0.60t=-7.31
p<0.0001
-0.74t=-5.46
p<0.0001
R2 36.20 36.20 36.20 36.20
FdfP
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
20.62(3,109)
<0.0001
Critical t = 2.69, p<.0.0083 (from previous
slide’s table)
The mean Bordeaux price is still indistinguishable from that of Burgundy (note: a test that
didn’t reject on its own will never reject after using a
multiple comparison procedure)
The mean Languedoc price is still significantly different from
that of Burgundy, Bordeaux and the Rhone
What changes: The mean Rhone price is now
indistinguishable from that of Burgundy and Bordeaux
Only the Languedoc
is significantly
different from all 3
others
RQ 2: If so, which regions are
(significantly) different from which
other regions?
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 19
An alternative way of getting the identical multiple regression results:
The analysis of variance (ANOVA) obtained in SAS using PROC GLM
The GLM ProcedureDependent Variable: Lprice
Sum ofSource DF Squares Mean Square F Value Pr > F
Model 3 7.88008697 2.62669566 20.62 <.0001Error 109 13.88837973 0.12741633Corrected Total 112 21.76846670
R-Square Coeff Var Root MSE Lprice Mean
0.361995 11.66128 0.356954 3.061022
StandardParameter Estimate Error t Value Pr > |t|
Intercept 2.646060467 B 0.06517063 40.60 <.0001Region 1=Burgundy 0.740613638 B 0.13566348 5.46 <.0001Region 2=Bordeaux 0.600717281 B 0.08213142 7.31 <.0001Region 3=Rhone 0.416895264 B 0.09892953 4.21 <.0001Region 4=Languedoc 0.000000000 B . . .
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter'B' are not uniquely estimable.
The GLM ProcedureLeast Squares MeansAdjustment for Multiple Comparisons: Bonferroni
Lprice LSMEANRegion LSMEAN Number
1 3.38667410 12 3.24677775 23 3.06295573 34 2.64606047 4
Least Squares Means for Effect Region t for H0: LSMean(i)=LSMean(j) / Pr > |t|
Dependent Variable: Lprice
i/j 1 2 3 4
1 1.083988 2.306561 5.459197 1.0000 0.1378 <.0001 2 -1.08399 2.050303 7.314098 1.0000 0.2564 <.0001 3 -2.30656 -2.0503 4.214063 0.1378 0.2564 0.0003 4 -5.4592 -7.3141 -4.21406 <.0001 <.0001 0.0003
PROC GLM stands for the General Linear Model, which is SAS’ procedure for conducting an analysis of
variance (ANOVA), the results of which we already obtained using PROC REG
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 20
Multiple comparisons in practice, I: Astrological signs and health
2
1)-categories )(ncategories (n
s?comparison multiplemany How
With 12 astrological signs:(12*11)/2
= 66 contrasts per diagnosis, which adds up to 14,718 comparisons across
the 223 diagnoses
p = 0.000003485t= 4.64
Austin et al (2006) Journal of Clinical Epidemiology, 59, 964-969
Studied all 10,674,945 residents of Ontario, between 18 and 100 in 2000
•Of these 223 diagnoses, there were 72 (32.3%) for which residents from one astrological sign had a significantly higher probability of hospitalization (p’s ranging from 0.0003 to 0.0488); these focused on 24 diagnoses
•Lowest p value (.0006) for Taurus being 27% more likely to have diverticula of intestine. FYI, Capricorns were 28% more likely to have abortions
•Studied the 24 diagnoses in this second sample; only 2 associations remained statistically significant
•Leos were 15% more likely to be hospitalized for gastrointestinal hemorrhage (p=0.0483); Saggitarians were 38% more likely to have fractures of the humerus (p=0.0125)
•Studied the 223 diagnoses (e.g., neck fracture, heart failure etc) that accounted for over 90% of all hospitalizations in the region
•Question predictor: Astrological sign (which has 12 categories)
If they had adjusted for multiple comparisons, none of
these 72 tests would have rejected
If they had adjusted for multiple comparisons, none of
these tests would have rejected either
Hypothesis generating sample(n=5,333,472)
Hypothesis validation sample(n=5,333,473)
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 21
Multiple comparisons in practice, II: PISA international comparisons
Source: OECD (2005) Education at
a Glance
2
1)-categories )(ncategories (n
s?comparison multiplemany How
With 29 countries:(29*28)/2
= 406 contrasts
p = 0.000124t= 3.84
What's wrong with Bonferroni adjustmentsBritish Medical Journal 1998;316:1236-1238
In controlling your overall Type I error,
you’re inevitably increasing your Type II
error—that is, decreasing your statistical power
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 22
How do we handle an ordinal categorical predictor like Vintage?
Step One:Create a series of 0/1
dummy variables, one for every value of the
categorical predictor
Step Two:Include all but one of the
dummy variables in the multiple regression model
ID Lprice Year VINTAGE
4 2.65525 0 <= '98 19 3.06606 0 <= '98 22 3.11605 0 <= '98 50 3.70658 0 <= '98 53 3.75754 0 <= '98 119 2.71753 1 '99124 2.85400 1 '99157 3.55126 1 '99 46 3.63646 1 '99 56 3.81991 1 '99
13 2.88080 2 '00141 2.88240 2 '00155 3.48781 2 '00182 3.49391 2 '00158 3.69102 2 '00
122 2.84075 3 '01 20 3.08257 3 '01150 3.11668 3 '01 42 3.55494 3 '01 45 3.59103 3 '01
'01 3
'00 2
'99 1
olderor '98 0
Year
olderor '98if 1
olderor '98 notif 0Yr98
'99if 1
'99 notif 0Yr99
'00if 1
'00 notif 0Yr00
009998 3210 YrYrYrY
Yr99
0 0 0 0 0
1 1 1 1 1
0 0 0 0 0
0 0 0 0 0
Yr98
1 1 1 1 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Yr00
0 0 0 0 0
0 0 0 0 0
1 1 1 1 1
0 0 0 0 0
Yr01
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 1 1 1 1
Collectively, the 4 dummies identify every wine’s specific vintage
But 3 dummies would still be sufficient to identify every wine’s specific vintage!
'01if 1
'01 notif 0Yr01
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 23
The REG ProcedureDependent Variable: Lprice
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > F
Model 3 5.21329 1.73776 11.44 <.0001Error 109 16.55518 0.15188Corrected Total 112 21.76847
Root MSE 0.38972 R-Square 0.2395Dependent Mean 3.06102 Adj R-Sq 0.2186Coeff Var 12.73172
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 2.84676 0.05567 51.13 <.0001yr98 1 0.61424 0.11222 5.47 <.0001yr99 1 0.35472 0.12158 2.92 0.0043yr00 1 0.27921 0.08625 3.24 0.0016
Results of regressing LPrice on 3 vintage dummies (Yr98, Yr99 and Yr01—Yr01 is the reference category)
The regression coefficient for each dummy variable estimates the mean differential between that
group and the reference category: The estimated mean log(price) of wines from all other vintages is
significantly higher than that of wines from 2001
Vintage “explains” ¼ of the variation in price (R2=24.0%)
The intercept provides the
estimated mean for the reference
category: The estimated mean
log(price) for wines from 2001 is 2.85
RMSE estimates the average within-group
standard deviation
To side by side boxplotsWine prices vary significantly by vintage. We can
reject H0: 1 = 2 = 3 = 0
at the p<.0001 level
0028.09935.09861.085.2ˆ YrYrYrY
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 24
2.50
2.75
3.00
3.25
3.50
3.75
1997 1998 1999 2000 2001 2002
What about multiple comparisons for the vintage means?
Do the estimated means for the ordinal predictor seem to follow
a pattern?
?????? 10 YEARY
Go back to dataset
The GLM ProcedureLeast Squares MeansAdjustment for Multiple Comparisons: Bonferroni
Lprice LSMEANVintage LSMEAN Number
1998(-) 3.46100054 11999 3.20148133 22000 3.12596879 32001 2.84676233 4
Least Squares Means for Effect Vintage t for H0: LSMean(i)=LSMean(j) / Pr > |t|
Dependent Variable: Lprice
i/j 1 2 3 4
1 1.783399 2.848664 5.473744 0.4638 0.0315 <.0001 2 -1.7834 0.596555 2.917458 0.4638 1.0000 0.0257 3 -2.84866 -0.59655 3.23716 0.0315 1.0000 0.0096 4 -5.47374 -2.91746 -3.23716 <.0001 0.0257 0.0096
The mean price for 2001 is significantly different from all earlier vintages
0028.09935.09861.085.2ˆ YrYrYrY
2000 is distinguishable from 1998
Aside from the’00/’01 contrast, adjacent vintages are indistinguishable
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 25
Bonferroni multiple comparisons for REGION controlling for continuous Year Lprice LSMEANRegion LSMEAN Number
1 3.39713319 12 3.17958869 23 3.11212112 34 2.71945067 4
Least Squares Means for Effect Region t for H0: LSMean(i)=LSMean(j) / Pr > |t|
Dependent Variable: Lprice
i/j 1 2 3 4
1 1.789005 2.1752 5.326594 0.4585 0.1908 <.0001 2 -1.789 0.766998 5.512103 0.4585 1.0000 <.0001 3 -2.1752 -0.767 4.253707 0.1908 1.0000 0.0003 4 -5.32659 -5.5121 -4.25371 <.0001 <.0001 0.0003
What happens when we use continuous YEAR instead of Vintage dummies?
Treating YEAR as a continuous predictor Root MSE 0.38850 R-Square 0.2304Dependent Mean 3.06102 Adj R-Sq 0.2235Coeff Var 12.69172
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 3.46733 0.07940 43.67 <.0001Year 1 -0.19962 0.03463 -5.76 <.0001
Region effects controlling for continuous Year Root MSE 0.33243 R-Square 0.4517Dependent Mean 3.06102 Adj R-Sq 0.4314Coeff Var 10.86008
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 3.00062 0.10390 28.88 <.0001Year 1 -0.13814 0.03286 -4.20 <.0001Burgundy 1 0.67768 0.12723 5.33 <.0001Bordeaux 1 0.46014 0.08348 5.51 <.0001Rhone 1 0.39267 0.09231 4.25 <.0001
Treating YEAR as a categorical predictor
Root MSE 0.38972 R-Square 0.2395Dependent Mean 3.06102 Adj R-Sq 0.2186Coeff Var 12.73172
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 3.46100 0.09743 35.52 <.0001yr99 1 -0.25952 0.14552 -1.78 0.0773yr00 1 -0.33503 0.11761 -2.85 0.0052yr01 1 -0.61424 0.11222 -5.47 <.0001
To uncontrolled Bonferroni comparisons for Region
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 26
How might we present the results of this analysis?
22.25(4,108)
<0.0001
33.23(1,111)
<0.0001
20.62(3,109)
<0.0001
FdfP
Regression results predicting the loge(price) of French wine by vintage and region (Languedoc is the omitted category)
45.2
0.39***(0.09)
0.46***(0.08)
0.68***(0.13)
-0.14***(0.03)
3.00***(0.10)
Model C
-0.20***(0.03)
Vintage(linear year)
Model BModel A
36.2
0.42***(0.10)
0.60***(0.08)
0.74***(0.14)
2.65***(0.07)
23.0R2
3.46***(0.08)
Intercept
Burgundy
Cell entries are estimated regression coefficients and standard errors. ***p<0.0001
Rhone
Bordeaux
The only statistically significant difference in regional means, after linearly controlling for
vintage, is between the Languedoc and all others
2.50
2.75
3.00
3.25
3.50
3.75
1997 1998 1999 2000 2001 2002
Burgundy
BordeauxRhone
Languedoc
Vintage
Loge(Price)
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 27
Supplemental presentation of adjusted means
2.50
2.75
3.00
3.25
3.50
3.75
1997 1998 1999 2000 2001 2002
Burgundy
BordeauxRhone
Languedoc
Year 0 1 2 3 (n) (16) (13) (35) (49)
RhoneBordeauxBurgundyYearY 39.046.067.014.000.3ˆ
RhoneBordeauxBurgundyY
RhoneBordeauxBurgundyY
2.04YearYear When
39.046.067.072.2ˆ
39.046.067.0)04.2(14.000.3ˆ
Unadjusted mean
Adjusted mean
Burgundy 3.39 3.40
Bordeaux 3.25 3.18
Rhone 3.06 3.11
Languedoc
2.64 2.72
3.40
3.18
3.11
2.72
Loge(Price)
Controlling for vintage
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 28
A word about nomenclature: Regression, GLM, ANOVA and ANCOVA
General Linear Model
Regression Model
RegressionStatistical model relating
continuous and categorical predictors to a
continuous outcome
Analysis of VarianceStatistical model relating categorical predictors to a
continuous outcome
Initially developed for observational studies & sample surveys and can be applied to designed
experiments
Early adopters: Sociologists and
economists
Initially developed to measure treatment effects in designed
experiments (ideally using a balanced—equal n—
design)
Early adopters: Psychologists and
agricultural researchers
Analysis of CovarianceStatistical model relating categorical predictors to a
continuous outcome, controlling for one or
more covariates
Initially developed to measure treatment effects in a quasi-experiment with a
covariate
Early adopters: Educational and medical
researchers
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 29
A ‘standard’ psychology dept presentation of these methods
David Howell, Statistical Methods for Psychology
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 30
A ‘standard’ economics dept presentation of these methods
Peter Kennedy, A Guide to Econometrics
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 31
What’s the big takeaway from this unit?
• Regression models can easily include polychotomies– Once you know how to include dichotomous predictors, its easy to
extend this strategy to polychotomous predictors– Can be used for either nominal or ordinal predictors– Make a wise decision about the omitted (reference) category—
results are most easily interpreted if it provides an interesting/important comparison
• Understand the issues associated with conducting multiple hypothesis tests– The more predictors you have, the more models you fit, and the
more hypothesis tests you conduct– Don’t fall into the trap of strictly interpreting p-values and
consider correcting for the multiplicity of tests
• Analysis of variance is just a special case of multiple regression– There’s nothing mysterious about ANOVA; it’s just regression on
dummy variables– By learning regression, you’re learning the more general
approach, of which classical ANOVA is just a special case
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 32
Appendix: Annotated PC-SAS Code for Using Polychotomies
*-------------------------------------------------------------------*Input Wine data and name variables in datasetCreate transformation of outcome variable PRICE Create dummy coding system for REGION/AREA and YEAR/VINTAGE *------------------------------------------------------------------*; data one; infile "m:\datasets\wine.txt"; input ID 1-3 Price 5-16 Region 19 Area $ 21-31 Year 34 Vintage $ 38-44 Rating 48-51; Lprice = log(price);
if Region = 1 then Burgundy=1; else Burgundy=0; if Region = 2 then Bordeaux=1; else Bordeaux=0; if Region = 3 then Rhone=1; else Rhone=0; if Region = 4 then Languedoc=1; else Languedoc=0;
if year=0 then yr98=1; else yr98=0; if year=1 then yr99=1; else yr99=0; if year=2 then yr00=1; else yr00=0; if year=3 then yr01=1; else yr01=0;
The data step includes code that takes the two polychotomies (Region and Year) and creates a series of 0/1 indicator variables (dummy variables) for each, The if-then-else statement specifies which categories of the polychotomies identifies the relevant categories for the new indicator variables.
The data step includes code that takes the two polychotomies (Region and Year) and creates a series of 0/1 indicator variables (dummy variables) for each, The if-then-else statement specifies which categories of the polychotomies identifies the relevant categories for the new indicator variables.
*-------------------------------------------------------------------*Fitting a general linear model LPRICE by REGION (ANOVA approach using PROC GLM) with Bonferroni multiple comparisons *------------------------------------------------------------------*;proc glm data=one; title2 "Demonstrating the equivalence of ANOVA and regression"; class region; model lprice = region/solution; lsmeans region/adjust=bon tdiff pdiff;
proc glm is SAS’ general linear model and procedure and is an easy way to fit an analysis of variance (ANOVA) model. For our purposes here, its greatest value is the simplicity with which it does multiple comparisons test. The lsmeans region/adjust=bon statement tells SAS to output Bonferroni multiple comparisons (here, by region) with adjusted t-statistics and p-values.
proc glm is SAS’ general linear model and procedure and is an easy way to fit an analysis of variance (ANOVA) model. For our purposes here, its greatest value is the simplicity with which it does multiple comparisons test. The lsmeans region/adjust=bon statement tells SAS to output Bonferroni multiple comparisons (here, by region) with adjusted t-statistics and p-values.
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 33
Glossary terms included in Unit 9
• Categorical predictor• Dummy variables• Multiple comparisons• Polychotomous predictor• Type I and Type II error
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 34
Appendix: Incorrectly including REGION as a continuous predictor
Regression results with REGION as a continuous predictor--INCORRECT, OF COURSEThe REG ProcedureModel: MODEL1Dependent Variable: Lprice
Number of Observations Read 113Number of Observations Used 113
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > F
Model 1 7.45565 7.45565 57.82 <.0001Error 111 14.31281 0.12894Corrected Total 112 21.76847
Root MSE 0.35909 R-Square 0.3425Dependent Mean 3.06102 Adj R-Sq 0.3366Coeff Var 11.73099
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 3.77344 0.09959 37.89 <.0001Region 1 -0.26834 0.03529 -7.60 <.0001
© Judith D. Singer, Harvard Graduate School of Education Unit 9/Slide 35
Appendix: What happens if you include all 4 REGIONAL dummies?
Regression results with including 4 REGIONAL dummies
The REG ProcedureModel: MODEL1Dependent Variable: Lprice
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > F
Model 3 7.88009 2.62670 20.62 <.0001Error 109 13.88838 0.12742Corrected Total 112 21.76847
Root MSE 0.35695 R-Square 0.3620Dependent Mean 3.06102 Adj R-Sq 0.3444Coeff Var 11.66128
NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
Rhone = Intercept - Languedoc - Burgundy - Bordeaux
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept B 3.06296 0.07443 41.15 <.0001Languedoc B -0.41690 0.09893 -4.21 <.0001Burgundy B 0.32372 0.14035 2.31 0.0230Bordeaux B 0.18382 0.08966 2.05 0.0427Rhone 0 0 . . .