Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | simon-hodges |
View: | 226 times |
Download: | 1 times |
© Andrew Ho, Harvard Graduate School of Education
Unit 8: Categorical predictors I: Dichotomies Class 19…
http://xkcd.com/74/http://xkcd.com/210/
Unit 8 / Page 1
© Andrew Ho, Harvard Graduate School of Education
Where is Unit 6 in our 11-Unit Sequence?
Unit 6:The basics of
multiple regression
Unit 7:Statistical control in depth:Correlation and collinearity
Unit 10:Interaction and quadratic effects
Unit 8:Categorical predictors I:
Dichotomies
Unit 9:Categorical predictors II:
Polychotomies
Unit 11:Regression in practice. Common Extensions.
Unit 1:Introduction to
simple linear regression
Unit 2:Correlation
and causality
Unit 3:Inference for the regression model
Building a solid
foundation
Unit 4:Regression assumptions:Evaluating their tenability
Unit 5:Transformations
to achieve linearity
Mastering the
subtleties
Adding additional predictors
Generalizing to other types of predictors and
effects
Pulling it all
together
Unit 8 / Page 2
© Andrew Ho, Harvard Graduate School of Education
In this unit, we’re going to cover…
• The dichotomous (dummy) variable as a predictor like any other• Naming conventions: the variable name vs. the reference category• Equivalence of the two-sample -test• The dummy variable in a multiple regression• Graphic displays of regression findings: How do you decide which
effects to highlight?• Adjusting means: A simple way of presenting findings for
categorical question predictors• Displaying and interpreting conditional regression lines
Unit 8 / Page 3
© Andrew Ho, Harvard Graduate School of Education
Categorical variables
Unit 8 / Page 4
kkXXXY 22110
Assumptions focus on Y or, more precisely, : i.i.d. normal with mean 0
No assumptions about the distributions of Xs, except that they are free of measurement error…
Nominal variables(unordered values)Sex ReligionPolitical Party
Ordinal variables(ordered values)Educational levelEnglish learner status (?)Test scores (?)
Another important distinctionDichotomies (only 2 categories)Polychotomies (>2 categories)
Dummy (or indicator) variables0/1 variables whose sole purpose is to identify yes/no membership in a particular category
femaleif
maleif
1
0
FEMALE
treatedif1
controlif0
TREAT
By convention, the variable name corresponds to the category given the value 1By convention, the category given the value 0 is called the reference category
Categorical variables are those whose values denote categories
In multiple regression, nominal variables can only enter as a dummy variables (or many dummies, as we’ll see next unit).
© Andrew Ho, Harvard Graduate School of Education
Do mandatory seat belt laws reduce fatalities?
Unit 8 / Page 5
Source: Calkins, LN & Zlatoper, TJ (2001). The effects of mandatory seat belt laws on motor vehicle fatalities in the United States, Social Science Quarterly, 82(4), 716-732
State-level data from all 50 states in 1997 occfatal – number of occupant fatalities in 1997 beltlaw – whether the state has a mandatory seatbelt law miles – total vehicle miles driven in the state (in millions) (note: currently, only NH has no mandatory seatbelt laws.
10. GA 973 1 93317 9. FL 1478 0 134007 8. DE 84 0 8007 7. CT 199 1 28552 6. CO 375 0 37746 5. CA 1817 1 285612 4. AR 427 0 28144 3. AZ 458 0 43491 2. AK 47 0 4387 1. AL 777 0 53458 state occfatal beltlaw miles
. list state occfatal beltlaw miles, clean
Outcome: Number of occupant fatalities in 1997.Question predictor: Mandatory seatbelt law (1 for states with the law, 0 otherwise)Covariate: Total miles driven
Hypothesis 1: Seat belt laws save lives because seat belts save lives
Hypothesis 2: The Offset Hypothesis: Seat belts encourage riskier driving behavior that may offset any benefit associated with increased seat belt use
milesbeltlawoccfatal
XXY
210
22110
© Andrew Ho, Harvard Graduate School of Education
Standard univariate and bivariate exploratory descriptives
Unit 8 / Page 6
Log(Numberof
occupantfatalities)
Mandatoryseatbelt
law
Log(Totalvehicle
miles drivenin thestate)
4
6
8
4 6 8
0
.5
1
0 .5 1
8
10
12
8 10 12
Number ofoccupantfatalitiesin 1997
Mandatoryseatbelt
law
Totalvehiclemiles
driven inthe state
0
1000
2000
0 1000 2000
0
.5
1
0 .5 1
0
100000
200000
300000
0 100000 200000 300000
This is one of those cases where both outcome and predictor suggest that log transformations will aid model fit substantially.
Coefficients must be interpreted on the log scale for Y (raise e to the coefficient for an estimated percent increase/decrease).
05
1015
20F
requ
ency
0 500 1000 1500 2000Number of occupant fatalities in 1997
010
2030
Fre
quen
cy
0 100000 200000 300000Total vehicle miles driven in the state
Distribution of occfatal Distribution of miles
Before transformation After transformation
© Andrew Ho, Harvard Graduate School of Education
Mean differences: The old-fashioned way. Stata’s by options
Unit 8 / Page 7
Why not start simple? Did states with mandatory seat belt laws have fewer fatalities?
occfatal 14 688.8571 583.8415 83 2012 Variable Obs Mean Std. Dev. Min Max
-> beltlaw = 1
occfatal 36 416.0833 337.0085 44 1478 Variable Obs Mean Std. Dev. Min Max
-> beltlaw = 0
. bysort beltlaw: summarize occfatal
logocc 14 6.208296 .8718579 4.41884 7.606884 Variable Obs Mean Std. Dev. Min Max
-> beltlaw = 1
logocc 36 5.642575 .9755272 3.78419 7.298445 Variable Obs Mean Std. Dev. Min Max
-> beltlaw = 0
. bysort beltlaw: summarize logocc
020
040
060
080
0M
ean
fata
litie
s in
200
7
No law Mandatory seat belts
. graph bar (mean) occfatal, over(beltlaw, relabel(1 "No law" 2 "Mandatory seat belts")) ytitle(Mean fatalities in 2007). graph bar (mean) occfatal, over(beltlaw, relabel(1 "No law" 2 "Mandatory seat belts")) ytitle(Mean fatalities in 2007)
02
46
Mea
n lo
g(fa
talit
ies)
in 2
007
No law Mandatory seat belts
. graph bar (mean) logocc, over(beltlaw, relabel(1 "No law" 2 "Mandatory seat belts")) ytitle(Mean log(fatalities) in 2007). graph bar (mean) logocc, over(beltlaw, relabel(1 "No law" 2 "Mandatory seat belts")) ytitle(Mean log(fatalities) in 2007)
Is this surprising?
© Andrew Ho, Harvard Graduate School of Education
t-tests for significant mean differences: The old-fashioned way
Unit 8 / Page 8
Pr(T < t) = 0.0220 Pr(|T| > |t|) = 0.0439 Pr(T > t) = 0.9780 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Ho: diff = 0 degrees of freedom = 48 diff = mean(0) - mean(1) t = -2.0694 diff -272.7738 131.812 -537.7997 -7.74794 combined 50 492.46 61.13366 432.2802 369.6073 615.3127 1 14 688.8571 156.0382 583.8415 351.7571 1025.957 0 36 416.0833 56.16808 337.0085 302.0561 530.1106 Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Two-sample t test with equal variances
. ttest occfatal, by(beltlaw)
Pr(T < t) = 0.0322 Pr(|T| > |t|) = 0.0643 Pr(T > t) = 0.9678 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Ho: diff = 0 degrees of freedom = 48 diff = mean(0) - mean(1) t = -1.8935 diff -.5657208 .2987713 -1.166441 .0349992 combined 50 5.800977 .1376414 .9732718 5.524377 6.077578 1 14 6.208296 .2330138 .8718579 5.704901 6.711692 0 36 5.642575 .1625879 .9755272 5.312505 5.972646 Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Two-sample t test with equal variances
. ttest logocc, by(beltlaw)
02
004
006
008
00M
ean
fat
aliti
es
in 2
00
7
No law Mandatory seat belts
02
46
Mea
n lo
g(f
atal
itie
s) in
200
7
No law Mandatory seat belts
• A review of two-sample t-tests. – The standard error of the difference: The estimated standard deviation of the
distribution of mean differences under repeated sampling.– The test statistic: – The decision rule:
• Our conclusion: States with mandatory seat belt laws have higher average numbers of occupant fatalities than states without them (not significant on the log-scale).
0;:0 babaH
ba XXba seXXt
0reject ,2on If Hdfnntt bacrit
© Andrew Ho, Harvard Graduate School of Education
What is a slope but a difference? t-tests in a regression framework
Unit 8 / Page 9
0200
400
600
800
Mea
n fatalitie
s in
20
07
No law Mandatory seat belts
02
46
Mea
n lo
g(f
ata
litie
s)
in 2
00
7
No law Mandatory seat belts
The mean difference between states without and with mandatory seatbelt laws in a t-test and regression slope framework, respectively.
The regression line passes through the mean values when =0 and =1, as expected of an algorithm that minimizes least squared residuals
45
67
8lo
g(N
umbe
r of
occ
upan
t fat
aliti
es in
199
7)
0-No law 1-LawMandatory seatbelt law
050
010
0015
0020
00N
umbe
r of
occ
upan
t fat
aliti
es in
199
7
0-No law 1-LawMandatory seatbelt law
beltlawoccfatal
XY
10
110
© Andrew Ho, Harvard Graduate School of Education
A significant slope *is* a significant mean difference
Unit 8 / Page 10
Pr(T < t) = 0.0220 Pr(|T| > |t|) = 0.0439 Pr(T > t) = 0.9780 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Ho: diff = 0 degrees of freedom = 48 diff = mean(0) - mean(1) t = -2.0694 diff -272.7738 131.812 -537.7997 -7.74794 combined 50 492.46 61.13366 432.2802 369.6073 615.3127 1 14 688.8571 156.0382 583.8415 351.7571 1025.957 0 36 416.0833 56.16808 337.0085 302.0561 530.1106 Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Two-sample t test with equal variances
. ttest occfatal, by(beltlaw)
Pr(T < t) = 0.0322 Pr(|T| > |t|) = 0.0643 Pr(T > t) = 0.9678 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Ho: diff = 0 degrees of freedom = 48 diff = mean(0) - mean(1) t = -1.8935 diff -.5657208 .2987713 -1.166441 .0349992 combined 50 5.800977 .1376414 .9732718 5.524377 6.077578 1 14 6.208296 .2330138 .8718579 5.704901 6.711692 0 36 5.642575 .1625879 .9755272 5.312505 5.972646 Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Two-sample t test with equal variances
. ttest logocc, by(beltlaw)
_cons 416.0833 69.74838 5.97 0.000 275.8448 556.3218 beltlaw 272.7738 131.812 2.07 0.044 7.74794 537.7997 occfatal Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 9156444.42 49 186866.213 Root MSE = 418.49 Adj R-squared = 0.0628 Residual 8406436.46 48 175134.093 R-squared = 0.0819 Model 750007.956 1 750007.956 Prob > F = 0.0439 F( 1, 48) = 4.28 Source SS df MS Number of obs = 50
. regress occfatal beltlaw
_cons 5.642575 .1580949 35.69 0.000 5.324704 5.960447 beltlaw .5657208 .2987713 1.89 0.064 -.0349992 1.166441 logocc Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 46.4156426 49 .947258013 Root MSE = .94857 Adj R-squared = 0.0501 Residual 43.1896388 48 .899784141 R-squared = 0.0695 Model 3.22600386 1 3.22600386 Prob > F = 0.0643 F( 1, 48) = 3.59 Source SS df MS Number of obs = 50
. regress logocc beltlaw
• The p-values are identical. The degrees of freedom are the same. Recall F = t2.• A unit difference in X is associated with a difference of 272.77 fatalities… oh, right!
– The estimated slope is the mean difference... (given all else in the model).– On the log scale, states with laws have have 76% more occfatal.
• And the constant that we usually ignore… the predicted Y value when X = 0.– The constant is the mean of the reference category… It’s interpretable!
• What if we switch the 0/1 assignment from no-law/yes-law to yes-law/no-law?
© Andrew Ho, Harvard Graduate School of Education
Effect of reversing the 0/1 labels
• Reversing the 0/1 labels will change the sign of the slope but not the magnitude.• The constant will always be the mean of the reference (0) category.• All significance tests and the full ANOVA table will be unaffected.• In general, follow the convention of naming the variable by the 1-category, leaving
0 as a reference, to avoid confusion.Unit 8 / Page 11
_cons 6.208296 .2535159 24.49 0.000 5.698569 6.718024 nolaw -.5657208 .2987713 -1.89 0.064 -1.166441 .0349992 logocc Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 46.4156426 49 .947258013 Root MSE = .94857 Adj R-squared = 0.0501 Residual 43.1896388 48 .899784141 R-squared = 0.0695 Model 3.22600386 1 3.22600386 Prob > F = 0.0643 F( 1, 48) = 3.59 Source SS df MS Number of obs = 50
. regress logocc nolaw
_cons 5.642575 .1580949 35.69 0.000 5.324704 5.960447 beltlaw .5657208 .2987713 1.89 0.064 -.0349992 1.166441 logocc Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 46.4156426 49 .947258013 Root MSE = .94857 Adj R-squared = 0.0501 Residual 43.1896388 48 .899784141 R-squared = 0.0695 Model 3.22600386 1 3.22600386 Prob > F = 0.0643 F( 1, 48) = 3.59 Source SS df MS Number of obs = 50
. regress logocc beltlaw
45
67
8lo
g(N
um
be
r o
f o
ccu
pan
t fa
talit
ies
in 1
997)
0-Law 1-No lawNo mandatory seatbelt law
© Andrew Ho, Harvard Graduate School of Education
* p<0.05, ** p<0.01, *** p<0.001LogMiles is the log of total vehicle miles driven in 2007, in millionsBeltLaw is an indicator variable for a mandatory state seatbelt law.t statistics in parentheses df_r 48 47 df_m 1 2 F 3.585 304.2 adj. R-sq 0.050 0.925 R-sq 0.070 0.928 N 50 50 (35.69) (-9.96) _cons 5.643*** -4.123***
(23.72) LogMiles 0.955***
(1.89) (-0.57) BeltLaw 0.566 -0.0502 Law Only ANCOVA Predicting the log number of occupant fatalities in 1997
ANCOVA: ANalysis of COVAriance
• ANCOVA is a general term that can encompass both Analysis Of VAriance (ANOVA) and regression, however, it tends to specifically refer to investigation of a (usually dichotomous) question predictor with one or more (usually continuous) covariates.
• ANCOVA can be readily implemented within our familiar regression framework.
Unit 8 / Page 12
milesbeltlawocc
XXY
loglog 210
22110
_cons -4.122666 .4139898 -9.96 0.000 -4.955506 -3.289826 logmiles .955139 .0402593 23.72 0.000 .8741477 1.03613 beltlaw -.0501633 .0877473 -0.57 0.570 -.2266881 .1263615 logocc Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 46.4156426 49 .947258013 Root MSE = .26612 Adj R-squared = 0.9252 Residual 3.32848985 47 .070818933 R-squared = 0.9283 Model 43.0871528 2 21.5435764 Prob > F = 0.0000 F( 2, 47) = 304.21 Source SS df MS Number of obs = 50
. regress logocc beltlaw logmiles
States with mandatory seat belt laws have more occupant fatalities, however, when accounting for the number of miles driven, laws are negatively associated with fatalities (5% fewer). The differences are not statistically significant.
© Andrew Ho, Harvard Graduate School of Education
02
46
8
4 6 8 4 6 8
nolaw beltlaw
Fre
quen
cy
Log(Number of occupant fatalities)Graphs by Mandatory seatbelt law
Regression diagnostics and homoscedasticity
• Are residual plots still relevant?• Absolutely.• Homoscedasticity is an assumption
both in regression and for t-tests (and for ANOVA/ANCOVA in general)
Unit 8 / Page 13
-2-1
01
2R
esid
uals
5.6 5.8 6 6.2Fitted values
• Stata’s by option will continue to be useful for diagnostics, plotting, and reporting whenever there are dichotomous predictors.
. histogram logocc, by(beltlaw) freq
AL
AK
AZAR
CA
CO
CT
DE
FL
GA
HI
ID
IL
IN
IAKS
KYLA
ME
MD
MA
MI
MNMS
MO
MT NENV
NH
NJ
NM
NYNC
ND
OH
OK
OR
PA
RI
SC
SD
TN
TX
UT
VT
VA
WA
WV
WI
WY
34
56
78
Log(
Num
ber
of
occu
pant
fata
litie
s)
8 9 10 11 12 13Log(Number of occupant fatalities)
Graphical display of results
© Andrew Ho, Harvard Graduate School of EducationUnit 8 / Page 14
beltlaw=1
beltlaw=0
Difference is not statistically significant.
𝑙𝑜𝑔𝑜𝑐𝑐=−4.123− .05𝑏𝑒𝑙𝑡𝑙𝑎𝑤+.955 𝑙𝑜𝑔𝑚𝑖𝑙𝑒𝑠
© Andrew Ho, Harvard Graduate School of Education
no
no
nono
no
no
no
no
no
no
no
no
no
no
no
no
no
nono
no
no
nono
no
no
no
no
no
no
no
no
no
no
no
no
no
yes
yes
yes
yesyes
yes
yes
yes
yes
yes
yes
yesyes
yes
45
67
8L
og(N
um
ber
of o
ccu
pant
fata
litie
s)
8 9 10 11 12 13Log(Total vehicle miles driven in the state)
Visualizing the covariate’s reversal of the mean difference
Unit 8 / Page 15
Belt-law states have more occupant fatalities but also more total miles driven. When accounting for the total miles driven, states with belt laws have fewer fatalities, although the differences are not significant.
beltlaw=1beltlaw=0
© Andrew Ho, Harvard Graduate School of Education
45
67
8Lo
g(N
umbe
r of
occ
upan
t fat
aliti
es)
8 9 10 11 12 13Log(Total vehicle miles driven in the state)
45
67
8Lo
g(N
umbe
r of
occ
upan
t fat
aliti
es)
40 50 60 70 80Normal daily mean state temperature
45
67
8Lo
g(N
umbe
r of
occ
upan
t fat
aliti
es)
0 200 400 600 800 1000Population density per square mile
45
67
8Lo
g(N
umbe
r of
occ
upan
t fat
aliti
es)
20 40 60 80 100Percentage of urban miles to total miles
Incorporating other predictors
Unit 8 / Page 16
One of the nuisances arising from log transformations of outcome variables is frequent transformation of additional predictors.
© Andrew Ho, Harvard Graduate School of Education
45
67
8Lo
g(N
umbe
r of
occ
upan
t fat
aliti
es)
8 9 10 11 12 13Log(Total vehicle miles driven in the state)
45
67
8Lo
g(N
umbe
r of
occ
upan
t fat
aliti
es)
3.6 3.8 4 4.2 4.4Log(Daily mean state temperature)
45
67
8Lo
g(N
umbe
r of
occ
upan
t fat
aliti
es)
0 2 4 6 8Log(Population density)
45
67
8Lo
g(N
umbe
r of
occ
upan
t fat
aliti
es)
3 3.5 4 4.5Log(Percent urban miles)
Incorporating other predictors
Unit 8 / Page 17
This is rough. Ideally we would use rvfplots, not scatterplots. And ideally we would also use rvfplots from multiple regression models that we’re actually fitting, not just these.Density seems better, but temperature and urban miles are borderline. Try it both ways?
* p<0.05, ** p<0.01, *** p<0.001Predictors F and G are log transformations of Predictors D and E respectively.Predictor E is the percentage of urban miles driven in the state.Predictor D is the mean state daily temperature.Predictor C is the log of the population density per square mile.Predictor B is the log of total vehicle miles driven in 2007 in millions.Predictor A is an indicator variable for a mandatory state seatbelt law.t statistics in parentheses df_r 48 47 46 44 44 44 df_m 1 2 3 5 5 5 F 3.585 304.2 266.1 418.7 397.4 437.0 adj. R-sq 0.050 0.925 0.942 0.977 0.976 0.978 R-sq 0.070 0.928 0.946 0.979 0.978 0.980 N 50 50 50 50 50 50 (35.69) (-9.96) (-11.90) (-20.46) (-11.78) (-14.31) _cons 5.643*** -4.123*** -4.485*** -5.115*** -7.492*** -8.402***
(-4.67) L%Urb (G) -0.387***
(6.91) (7.07) LTemp (F) 1.128*** 1.102***
(-5.33) (-5.31) %Urb (E) -0.00902*** -0.00879***
(6.80) Temp (D) 0.0190***
(-3.81) (-2.71) (-3.43) (-2.97) LDens (C) -0.109*** -0.0590** -0.0746** -0.0635**
(23.72) (25.15) (37.80) (36.06) (38.14) LMile (B) 0.955*** 1.034*** 1.022*** 1.024*** 1.017***
(1.89) (-0.57) (-0.43) (-1.93) (-1.95) (-2.05) Law (A) 0.566 -0.0502 -0.0332 -0.0968 -0.100 -0.101* Model A Model AB ABC ABCDE ABCFG ABCEF Predicting the log number of occupant fatalities in 1997
Building a model for occupant fatalities
© Andrew Ho, Harvard Graduate School of EducationUnit 8 / Page 18
Remember: When interpreting coefficients in a model with a log outcome variable, small coefficients (less than 0.3) are approximately equal to a percent increase or decrease, e.g., -.1 predicts an approximate 10% decline in fatalities accounting for all else in the model.
What do we think of these findings?
© Andrew Ho, Harvard Graduate School of Education
Conditional regression lines, revisited
Unit 8 / Page 19
_cons -5.114937 .2499949 -20.46 0.000 -5.618768 -4.611105 pcturban -.0090171 .0016909 -5.33 0.000 -.0124248 -.0056093 temp .0189964 .0027955 6.80 0.000 .0133624 .0246304 logden -.0590361 .0217956 -2.71 0.010 -.1029623 -.0151099 logmiles 1.022367 .0270467 37.80 0.000 .9678582 1.076876 beltlaw -.0967979 .0500726 -1.93 0.060 -.1977127 .0041168 logocc Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 46.4156426 49 .947258013 Root MSE = .14737 Adj R-squared = 0.9771 Residual .955544825 44 .021716928 R-squared = 0.9794 Model 45.4600978 5 9.09201956 Prob > F = 0.0000 F( 5, 44) = 418.66 Source SS df MS Number of obs = 50
. regress logocc beltlaw logmiles logden temp pcturban
pcturbantempdenmilesbeltlawocc 009.019.log059.log022.10968.115.5log
pcturban 50 52.30357 17.40729 22.33816 85.53246 temp 50 54.256 8.459535 40.6 77.2 logden 50 4.288666 1.385434 -.0101779 6.887838 logmiles 50 10.40444 .9885524 8.386401 12.56239 beltlaw 50 .28 .4535574 0 1 Variable Obs Mean Std. Dev. Min Max
. summarize beltlaw logmiles logden temp pcturban
We have already looked at the relationship with logmiles, which is fairly obvious. Let’s look more closely at pcturban by placing it on the -axis with beltlaw on the legend. In order to plot conditional regression lines, we fix other predictor values (logmiles, logden, temp) at their means.
pcturbanbeltlawocc 009.26.54*019.289.4*059.404.10*022.10968.115.5log
© Andrew Ho, Harvard Graduate School of Education
45
67
8L
og(N
um
ber
of o
ccu
pant
fata
litie
s)
20 40 60 80 100Percentage of urban miles to total miles
Conditional regression lines with multiple covariates
Unit 8 / Page 20
With multiple covariates, we generally do not want to include the bivariate scatterplot, as the conditional regression lines may not seem to fit and will be distracting. Poor fit because pcturban-logocc relationship is positive unconditionally but negative when accounting for other variables.Scatterplot shows the former, lines show the latter.Removing the scatterplot allows us room to display more relationships. Let’s include temperature.
No law
Law
Difference is not statistically significant.
© Andrew Ho, Harvard Graduate School of Education
AKND MN
MIVT MT NH MESD WY WI
WA IL CT NY IA CO RINE NV ID MA OHUT IN PA DENJORMO WV MDKY KS NM
VANC OKTN CA GAARSCMS
TX
AL LA
FLAZ
HI
40
50
60
70
80
Nor
ma
l da
ily m
ean
sta
te te
mpe
ratu
re
0 2 4 6 8Frequency
Select prototypical temperature values
Unit 8 / Page 21
. dotplot temp, mlabel(state)
We were already showing the relationship between logocc, pcturban, and beltlaw, while attempting to hold logmiles, logden, and temp constant at their average values. Let’s try to visualize the “effect” of temp, also, by picking two prototypical values.
A warmer, southern state (65)
A cooler, northern state (50)
© Andrew Ho, Harvard Graduate School of Education
55
.56
6.5
Log
(Nu
mbe
r o
f occ
upa
nt fa
talit
ies)
20 40 60 80 100Percentage of urban miles to total miles
Five predictors: One axis, two legend, two fixed at averages.
Unit 8 / Page 22
pcturbantempbeltlawocc 009.*019.289.4*059.404.10*022.10968.115.5log
Warm state (65) vs. cool state (50) difference: 33%
No-law vs. law: 10% (not significant)
1.329762. di exp(.019*15)
Conditional regression lines assuming average log(total miles driven ) and average log(population density).
No law, warm stateLaw, warm state
No law, cool stateLaw, cool state
Note the “Main Effects” assumptions here: All effects are the same across every level and predictor.
© Andrew Ho, Harvard Graduate School of Education
Adjusted mean differences
Unit 8 / Page 23* p<0.05, ** p<0.01, *** p<0.001Predictors F and G are log transformations of Predictors D and E respectively.Predictor E is the percentage of urban miles driven in the state.Predictor D is the mean state daily temperature.Predictor C is the log of the population density per square mile.Predictor B is the log of total vehicle miles driven in 2007 in millions.Predictor A is an indicator variable for a mandatory state seatbelt law.t statistics in parentheses df_r 48 47 46 44 44 44 df_m 1 2 3 5 5 5 F 3.585 304.2 266.1 418.7 397.4 437.0 adj. R-sq 0.050 0.925 0.942 0.977 0.976 0.978 R-sq 0.070 0.928 0.946 0.979 0.978 0.980 N 50 50 50 50 50 50 (35.69) (-9.96) (-11.90) (-20.46) (-11.78) (-14.31) _cons 5.643*** -4.123*** -4.485*** -5.115*** -7.492*** -8.402***
(-4.67) L%Urb (G) -0.387***
(6.91) (7.07) LTemp (F) 1.128*** 1.102***
(-5.33) (-5.31) %Urb (E) -0.00902*** -0.00879***
(6.80) Temp (D) 0.0190***
(-3.81) (-2.71) (-3.43) (-2.97) LDens (C) -0.109*** -0.0590** -0.0746** -0.0635**
(23.72) (25.15) (37.80) (36.06) (38.14) LMile (B) 0.955*** 1.034*** 1.022*** 1.024*** 1.017***
(1.89) (-0.57) (-0.43) (-1.93) (-1.95) (-2.05) Law (A) 0.566 -0.0502 -0.0332 -0.0968 -0.100 -0.101* Model A Model AB ABC ABCDE ABCFG ABCEF Predicting the log number of occupant fatalities in 1997
The beltlaw coefficient is the mean difference given everything else in the model. This row thus shows the adjusted mean differences.Make sure that your dichotomous variable is scaled to 0/1, 1/2, or is treated as an indicator.
© Andrew Ho, Harvard Graduate School of Education
no
no
nono
no
no
no
no
no
no
no
no
no
no
no
no
no
nono
no
no
nono
no
no
no
no
no
no
no
no
no
no
no
no
no
yes
yes
yes
yesyes
yes
yes
yes
yes
yes
yes
yesyes
yes
45
67
8L
og(N
um
ber
of o
ccu
pant
fata
litie
s)
8 9 10 11 12 13Log(Total vehicle miles driven in the state)
Visualizing the adjusted mean difference for Model AB
Unit 8 / Page 24
beltlaw=1beltlaw=0
Difference is not statistically significant.
© Andrew Ho, Harvard Graduate School of Education
55
.56
6.5
Log
(Nu
mbe
r o
f occ
upa
nt fa
talit
ies)
20 40 60 80 100Percentage of urban miles to total miles
Visualizing the adjusted mean difference for Model ABCDE
Unit 8 / Page 25
No-law vs. law: 10% (not significant)
No law, warm stateLaw, warm state
No law, cool stateLaw, cool state
Conditional regression lines assuming average log(total miles driven ) and average log(population density).
© Andrew Ho, Harvard Graduate School of Education
Best practices for conditional regression lines
1) Estimate the regression coefficients (regress)2) Calculate the means of the predictors (summarize)3) Write out the (model and) prediction equation
4) Decide what to condition on, and pick prototypical values for those covariates (means are a good default choice for a single value, but, whatever you choose, you must make the prototypical value clear and relevant).
5) When plotting and adjusting, consider using _b[_cons] and _b[_coefname] to increase precision and minimize copy/paste error.
6) Be explicit about which variables you are conditioning on and adjusting for.
7) Be explicit about whether apparent differences are statistically significant.
Unit 8 / Page 26
pcturbantempdenmilesbeltlawocc
pcturbantempdenmilesbeltlawocc
009.019.log059.log022.10968.115.5log
logloglog 543210
© Andrew Ho, Harvard Graduate School of Education
• Regression models can include dichotomous predictors like any others– Variable names refer to the 1 category, 0 is the reference category.– Switching the category definitions changes only the sign of the slope and the value of the
constant.– Coefficients are estimated mean differences adjusting for all other covariates in the model.– The simple linear regression on the dichotomous predictor is equivalent to the t-test.
• Inclusion of other covariates in the regression model (ANCOVA) can change coefficients, just as before.– Investigation of sensitivity under plausible model specifications is necessary, just as before.
• Results of complex analyses can be displayed more simply using tables and graphs– As your models become more complex, the need for simple numerical and graphical
displays remains– Consider how you will communicate your results to colleagues and broader audiences– Adjusted mean differences and prototypical regression lines are powerful tools– But be clear about what variables and which levels of these variables (prototypical values)
you are conditioning on.– And be clear about whether apparent differences are statistically significant.
• General tips:– Small coefficients (magnitudes less than 0.3ish) can be interpreted as a predicted percent
change for a log(outcome) variable. This is because exp(β) ≈ 1+ β.– Use the _b[coefname] stored coefficients after running a regression model for your graph
twoway function code to estimate conditional regression lines.
What are the takeaways from this unit?
Unit 8 / Page 27
© Andrew Ho, Harvard Graduate School of Education
Glossary of terms
Unit 8 / Page 28
• Adjusted mean differences• Analysis of Covariance (ANCOVA)• Categorical variable (nominal and ordinal)• Conditional regression line• Dichotomous variable• Dummy variable• Main effects assumption• Two-sample t-test