Post on 19-Dec-2015
transcript
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 1
Unit 7: Statistical control in depth: Correlation and collinearity
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 2
The S-030 roadmap: Where’s this unit in the big picture?
Unit 2:Correlation
and causality
Unit 3:Inference for the regression model
Unit 4:Regression assumptions:
Evaluating their tenability
Unit 5:Transformations
to achieve linearity
Unit 6:The basics of
multiple regression
Unit 7:Statistical control in
depth:Correlation and
collinearity
Unit 10:Interaction and quadratic effects
Unit 8:Categorical predictors I:
Dichotomies
Unit 9:Categorical predictors II:
Polychotomies
Unit 11:Regression modeling
in practice
Unit 1:Introduction to
simple linear regression
Building a solid
foundation
Mastering the
subtleties
Adding additional predictors
Generalizing to other types of
predictors and effects
Pulling it all
together
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 3
In this unit, we’re going to learn about…
• What is really meant by statistical control?– Is statistical control always possible?: The problem of collinearity
• Learning how to examine a correlation matrix and what it foreshadows for multiple regression
• Using Venn diagrams to develop your intuition about correlation– Measuring the additional explanatory power of additional predictors
• Partial correlation—terminology, interpretation, and relationship to simple correlation
• Multiple correlation—its relationship to R2
• Suppressor effects: When statistical control can help reveal an effect
• The dangers of multicollinearity: – what it is– how to spot it– what to do about it
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 4
When and why is statistical control important?
Randomized experiments: Statistical control is not as crucial
•Researcher actively intervenes in the system observing how changes in X produce changes in Y
•Random assignment ensures that, on average, treated and control groups are equivalent on all observed and, even more importantly, unobserved variables
•Even so, statistical control still helps as it increases the precision of our estimates
16 March 1992
Lead, Lies and Data TapeTwo psychologists, both of whom have testified for the lead industry and one of whom has received tens of thousands of dollars in research grants from the industry, have filed misconduct charges against the scientist who first linked "low" levels of lead to cognitive problems in children. They don't suspect that Herbert Needleman of the University of Pittsburgh stole, faked or fabricated data. Rather, they say, he selected the data and the statistical model -- the equations for analyzing those data -- that show lead in the worst possible light…
The allegations center on a 1979 paper. It describes how Needleman and colleagues measured the lead in baby teeth, looking for a link between lead and intelligence. NIH told Pittsburgh to convene a panel of inquiry. The panel's report, submitted in December and obtained by NEWSWEEK, found that Needleman didn't "fabricate, falsify or plagiarize." It did have problems with how he decided whether or not to include particular children in his analysis, but called this "a result of a lack of scientific rigor rather than the presence of scientific misconduct." The panel found Needleman's statistical model "questionable," though. On that basis, the university launched an investigation.
Scarr, Ernhart and the Pittsburgh panel all condemn Needleman for not using a different model -- one that, say, factored in the age of each child. If he had, they say, lead would not have had an impact on IQ. But last year Environmental Protection Agency scientist (and recipient of a MacArthur Foundation "genius" award) Joel Schwartz reanalyzed Needleman's data. He factored in age explicitly. "I found essentially the identical results," he says.
Observational studies, sample surveys and quasi experiments: Statistical control is much more important
•With no active external intervention, individuals effectively “choose” their own values of X
•Individuals with particular values of X may differ on observed variables—this is when statistical control can help
•More problematic is when individuals with particular values of X may also differ on unobserved variables—then you need statistical methods that are more advanced than we cover in S-030
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 5
How statistical control can help, Example I: Cross sectional study examining predictors of reading scores in elementary
school
HEIGHTDINGARE :A Model 10.003.0ˆ
12
34
56
GRADE
Controlling for a predictor can stop us from concluding (erroneously) that a
spurious correlation is real
HEIGHT
REA
DIN
GTaller children have higher reading scores
HEIGHTGRADEDINGARE :B Model 01.090.002.0ˆ There’s no statistically
significant relationship between reading scores and height
Do we really believe this or is
there a 3rd variable for
which we should statistically
control?
Older students read better (duh)
Main effect: we’ve assumed the effect to be
the same across all grades =
parallel lines
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 6
How statistical control can help, Example II: Does the availability of guns save lives (or kill people?)
sGunLicenseMEICR :A Model 10.030.0ˆ
Urbanicity
Controlling for a predictor can reveal or reverse the direction of an effect
# GUN LICENSES
VIO
LEN
T C
RIM
E
RA
ECommunities with more gun licenses
have lower violent crime rates
sGunLicenseURBANMEICR :B Model 30.030.002.0ˆ
Do we really believe this or is
there a 3rd variable for
which we should statistically
control?
The more urban the community, the higher the
violent crime rate
Veryurban
Veryrural
There’s now a positive relationship between Gun Licenses and the
violent crime rate (the sign of the estimated regression coefficient is
reversed!)
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 7
Men
Women
FemaleesgWa :A Model 225ˆ
How statistical control should be able to help (but sometimes can’t!)
Sex discrimination in clerical salaries at Yale
If predictors are “too highly” correlated with each other, we can’t statistically
control for the effect of one and evaluate the effects the other: This is known as
(multi)collinearity
Job Status
Wages
On average, women have lower wages than men
FemaleJobStatusesgWa:C Model 0001.0225ˆ
Can we really control
statistically for the effects of job status and really
evaluate the effects of gender?
Higher status jobs pay more
There’s no statistically significant wage differential between men and
women controlling for job status
FemaleatustJobS:B Model 210ˆ On average, women are in lower status jobs than men
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 8
Two new predictors for USNews: Research Funding & Pct Doc Students
Peer Ratings of US Graduate schools of education
Peer Res PctID School Rat GRE L2Doc Fund Doc
1 Harvard 450 6.625 5.90689 17.4 35.8 2 UCLA 410 5.780 5.72792 36.4 46.8 3 Stanford 470 6.775 5.24793 15.1 48.0 4 TC 440 6.045 7.59246 30.1 37.5 5 Vanderbilt 430 6.605 4.45943 23.0 48.6 6 Northwestern 390 6.770 3.32193 8.8 47.0 7 Berkeley 440 6.050 5.42626 12.0 56.3 8 Penn 380 6.040 5.93074 19.0 41.0 9 Michigan 430 6.090 5.24793 19.0 62.710 Madison 430 5.800 6.72792 25.5 53.8 . . .
RQ: Does research production predict variation in the peer ratings of GSEs?
•Total Research $•Pct Doctoral Students
Predictor: ResFund
Mean 11.29540 Std Dev 8.13018
Stem Leaf # Boxplot 36 4 1 0 34 32 30 14 2 | 28 | 26 4 1 | 24 5158 4 | 22 08 2 | 20 13 2 | 18 14002 5 | 16 145671479 9 +-----+ 14 112 3 | | 12 068826 6 | | 10 125715679 9 *--+--* 8 5580458 7 | | 6 0256828 7 | | 4 11172356778 11 +-----+ 2 1158891237 10 | 0 33667056 8 |
UCLA
TC, NYU
HGSE
Predictor: PctDoc
Mean 38.1965517 Std Dev 16.3568807
Stem Leaf # Boxplot 9 1 1 0 8 8 0 1 0 7 7 1 | 7 | 6 9 1 | 6 2333 4 | 5 5668 4 | 5 0033444 7 | 4 567777889 9 +-----+ 4 01133 5 | | 3 566666778888888888999 21 *--+--* 3 011114 6 | | 2 555567899999 12 +-----+ 2 14 2 | 1 56778889 8 | 1 44 2 | 0 567 3 |
Stanford
Claremont
UC RiversideUSC
Penn State
HGSE
Stanford
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 9
Relationship between Peer Ratings and the two new predictors
The REG ProcedureDependent Variable: PeerRat Sum of MeanSource DF Squares Square F Value Pr > FModel 1 42509 42509 27.24 <.0001Error 85 132664 1560.74781Corrected Total 86 175172
Root MSE 39.50630 R-Square 0.2427Dependent Mean 344.82759 Adj R-Sq 0.2338Coeff Var 11.45683
Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 313.93941 7.27801 43.14 <.0001ResFund 1 2.73458 0.52398 5.22 <.0001
ResFundRatrPee 73.294.313ˆ
The REG ProcedureDependent Variable: PeerRat Sum of MeanSource DF Squares Square F Value Pr > FModel 1 38775 38775 24.16 <.0001Error 85 136397 1604.67212Corrected Total 86 175172
Root MSE 40.05836 R-Square 0.2214Dependent Mean 344.82759 Adj R-Sq 0.2122Coeff Var 11.61692
Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 295.24240 10.96333 26.93 <.0001PctDoc 1 1.29816 0.26408 4.92 <.0001
PctDocRatrPee 30.124.295ˆ
HGSE
Stanford
HGSE
Stanford
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 10
Examining the correlation matrix, Step 1: Get output (using PROC CORR)
Pearson Correlation Coefficients, N = 87 Prob > |r| under H0: Rho=0 (H0: =0)
PeerRat L2Doc GRE ResFund PctDoc
PeerRat 1.00000 0.46393 0.65654 0.49261 0.47048 <.0001 <.0001 <.0001 <.0001
L2Doc 0.46393 1.00000 0.14528 0.51096 0.31777 <.0001 0.1794 <.0001 0.0027
GRE 0.65654 0.14528 1.00000 0.40573 0.17045 <.0001 0.1794 <.0001 0.1145
ResFund 0.49261 0.51096 0.40573 1.00000 0.05695 <.0001 <.0001 <.0001 0.6003
PctDoc 0.47048 0.31777 0.17045 0.05695 1.00000 <.0001 0.0027 0.1145 0.6003
Describes cell entries (r and p-value, all with N=87)
Always list the
outcome first so the
table is easiest to
read
Notice the
symmetry
Like most computer output, it provides “too much detail”
r = 0.32**r = 0.17 (ns)
2 decimal places and *’s usually suffice* p<0.05, ** p<0.01, *** p<0.001
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 11
Examining the correlation matrix, Step 2: Create a summary table
Correlation Matrix for Peer Ratings of Graduate Schools of Education (n=87)
Peer Rating
Log2(N doc grads)
Mean GRE
Research Funding
Pct Doc students
Peer Rating
1
Log2(N doc grads)
0.46*** 1
Mean GRE
0.66*** 0.15 1
Research Funding
0.49*** 0.51*** 0.41***
1
Pct Doc students
0.47*** 0.32** 0.17 0.06 1
* p<0.05, ** p<0.01, *** p<0.001
The correlation between each predictor and Peer Ratings is statistically significant
(p<0.001).We already knew this on the basis of the
simple linear regressions, but typically, we’d estimate these correlations before looking at
those regression results
The correlation between our two original predictors—GRE and L2Doc—is not statistically significant
The percentage of doctoral students is
significantly correlated (p<0.01) with the log(# of doctoral students), but not with either mean GRE or
Research Funding
What do these correlations foreshadow for multiple
regression? The information in research funding may be
redundant with other variables already in the model, but the information in PctDoc may explain additional variation in Peer Ratings
Research funding is significantly correlated
(p<0.001) with both program size and mean
GRE scores
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 12
A visual inspection of correlations: PeerRat vs. each predictor
r = 0.46 r = 0.66
r = 0.49 r = 0.47
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 13
The scatterplot matrix: A graphic correlation matrix
.46 .66 .49 .47
.15 .51 .32
.41 .17
.06
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 14
Questions we can ask about correlations between variables:How Venn diagrams can help us understand complex interrelationships
One outcome (Y) and 2 predictors (X1 and X2)—generate 3 correlations to examine:
• Correlation between each predictor and Y: rY1 and rY2
• Correlation between the two predictors: r12
Our learning goal: To understand the interrelationships among the correlations
• How much variation in Y is explained by X1 and X2 together
• How much variation in Y is explained by X1 after controlling for X2
• How much variation in Y is explained by X2 after controlling for X1
Y
X2X1
Y
X2X1
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 15
Contrasting Venn diagrams with uncorrelated and correlated predictors
Uncorrelated predictorsUncorrelated predictors are very rare, arising mostly in designed
experiments.We can compute the overall R2 by
just summing the separate R2’s
YX2X1
22|
21|
212| YYY RRR
R2 predicting Y using only
X1
21|YR
R2 predicting Y using only
X2:
22|YR
Correlated predictorsCorrelated predictors are very common, arising in almost all
studies.We can’t just sum the separate R2 statistics because of the overlap
Y
X2X1
cba
cbaRY 212|
caRY 21| cbRY 2
2|
cba
X1 X2
YHow do correlations between
predictors affect their joint utility?• Highly correlated predictors: Jointly
explained portion “c” is large; Additional independent portions “a” and “b” are small
• Fairly uncorrelated predictors: Jointly explained portion “c” is small; additional independent portions “a” and “b” are large
X2X1
cba
Y
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 16
Measuring the additional explanatory power of an additional predictor
Assuming that X1 is already in the model, how can we measure X2’s additional contribution, over and above that already explained by X1?
Y
X2X1
cba dcba
cbRY
2
2|
d dcbaYvar Total )( dbXYarResidual v )|( 1
21)|( Xby explained XYVar Res of Propdb
b
2
22
Y
Y
r
ncorrelatio SimpleR
1|2
21|2
Y
Y
r
ncorrelatio PartialR
Clarifying terminology and notation
• Simple correlation, rY2 and RY|22 :
Proportion of variation in Y associated with X2
• Multiple correlation, RY|122 :
Proportion of variation in Y associated with both X1 and X2
• Partial correlation, rY2|1 : Y2 identifies the variables being correlated; |1 identifies the variable(s) being controlled (or partialled out)
dcba
cbrY
22
Simple Correlation2
db
brY
212
Partial Correlation2
How are partials related to simple correlations?
Comparing these 2 equations, we see that b & d are in both denominators. So the
relationship between simples and partials depends upon the size of “a” & “c”
relative to “b” & “d”
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 17
d
b
X1 X2
Y
Understanding the relationship between partial and simple correlations
dcba
cb
db
b
When “a” and “c” are small:
Simple Partial
dcba
cb
db
b
When “a” is large (and “c” is large or small):
Partial > Simple
ac
When “c” is large (and “a” isn’t very large): Partial <
Simple
dcba
cb
db
b
Most common reason:X1 is relatively
uncorrelated with Y
Most common reason:X1 is very highly correlated with Y
Most common reason:X1 is very highly
correlated with X2
Partials can equal simples
Partials can be smaller than simples
Partials can be greater than simples
c
d
b
X1 X2
Y
aa b
d
X1 X2
Y
c
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 18
Partial correlations for the USNews data, controlling for L2Doc
Pearson Partial Correlation Coefficients, N = 87 Prob > |r| under H0: Rho=0
PeerRat GRE ResFund PctDoc
PeerRat 1.00000 0.67217 0.33561 0.38461 <.0001 0.0016 0.0003
GRE 0.67217 1.00000 0.38978 0.13249 <.0001 0.0002 0.2240
ResFund 0.33561 0.38978 1.00000 -0.12934 0.0016 0.0002 0.2353
PctDoc 0.38461 0.13249 -0.12934 1.00000 0.0003 0.2240 0.2353
Describes cell entries (r and p-value, all with N=87)
Continue to list the
outcome first so the
table is easiest to
read
Again, notice
the symmetr
y
Like most computer output, it provides “too much detail”
Go to simple correlation
output
Major decision: Which variable(s), if any, should we partial out?
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 19
Understanding the link between partial correlations and MR
Partial Correlation Coefficientscontrolling for L2Doc
PeerRat GRE ResFund
PeerRat 1.00000 0.67217 0.33561 <.0001 0.0016
GRE 0.67217 1.00000 0.38978 <.0001 0.0002
ResFund 0.33561 0.38978 1.00000 0.0016 0.0002
PctDoc 0.38461 0.13249 -0.12934 0.0003 0.2240 0.2353
Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -87.29494 43.07364 -2.03 0.0459L2Doc 1 15.34201 2.94746 5.21 <.0001GRE 1 63.31660 7.60956 8.32 <.0001
Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 262.96680 20.06665 13.10 <.0001L2Doc 1 11.70357 4.31624 2.71 0.0081ResFund 1 1.91994 0.58799 3.27 0.0016
Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 233.68011 19.46286 12.01 <.0001L2Doc 1 14.25165 3.83448 3.72 0.0004PctDoc 1 0.99151 0.25964 3.82 0.0003
Partial correlations quantify the association between two variables
after controlling statistically for one (or more) predictors
Multiple regression models quantify the
association between two variables after controlling statistically for one (or more) predictors
Partial correlations and multiple regression are intimately linked The p-value for the partial correlation between Y and a predictor, say X2, after controlling statistically for other predictors, say just X1, is identical to the p-value for the slope coefficient for X2 in a
multiple regression model that includes both X2 and X1
???????
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 20
Comparing simple and partial correlations for the USNews data
Correlation Matrix for Peer Ratings of Graduate Schools of Education (n=87) – Simple correlations and partial correlations (controlling for Log2(N doctoral graduates))
Peer Rating
Log2(N doc grads)
Mean GRE
Research Funding
Peer Rating
1
Log2(N doc grads)
0.46***--
1
Mean GRE
0.66***0.67***
0.15--
1
Research Funding
0.49***0.34**
0.51***--
0.41***0.39***
1
Pct Doc students
0.47***0.38***
0.32**--
0.170.13
0.06-0.13
Cell entries are simple correlations and partial correlations* p<0.05, ** p<0.01, *** p<0.001
The partial correlation with mean GRE is virtually unchanged while
the partial correlations with Research Funding and PctDoc
students decline (but are still stat sig.)
[This makes sense because Log2(N doc grads) was virtually uncorrelated with
mean GRE, but was significantly correlated with Research Funding and
PctDoc students.]
Research Funding remains correlated
with Mean GRE after controlling for program size
PctDoc students remains uncorrelated with the other predictors, even
after controlling for program size
Peer
GREL2Doc
ResFund
PctDoc
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 21
Results of fitting additional MR models to USNews data
Comparison of regression models predicting peer ratings of US Graduate Schools of Education (n=87) (US News and World Report, 2005)
Predictor
Model C Model D
Model E
Model F
Model G
Intercept -87.29*(43.07)-2.03
313.93***
(7.28)43.14
295.24***
(10.96)4.92
-66.47(47.95)-1.39
-78.34~(39.73)-1.97
Log2(N doctoral grads)
15.34***(2.95)5.21
13.65***
(3.40)4.01
11.91***(2.85)4.19
Mean GRE
scores
63.32***(7.61)8.32
60.13***
(8.26)7.28
59.56***(7.07)8.43
Research Funding
2.73***(0.52)5.22
0.50(0.50)0.99
PctDoc 1.30***(0.26)4.92
0.78***(0.19)4.01
R2 57.0 24.3 22.1 57.5 64.0
F(df)P
55.63(2, 84)
<0.0001
27.24(1, 85)<0.000
1
24.16(1, 85)<0.000
1
37.40(3, 83)
<0.0001
49.10(3, 83)
<0.0001
Cell entries are estimated regression coefficients, (standard errors) and t-statistics.* p<0.05, ** p<0.01, *** p<0.001
Some things to consider when
selecting models to present
Does the model chosen reflect your underlying theory?
Does the model allow you to address the effects of your key question predictor(s)?
Are you unnecessarily including predictors you could reasonably set aside (the parsimony principle)?
Are you excluding predictors that are statistically significant [If so, why exclude them?]
Always realize that NO model is ever “final”
We’ll spend much, much, much more time on this topic in Unit 11
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 22
Is it always possible to statistically control?
Two parent Latino families in the National Child Care Survey: Predicting family income as a function of mother’s and father’s education (n=45)
Predictor Model A Model B Model C
Intercept -103.54(148.31)
-0.70
-5.91(114.27)
-0.05
-141.19(149.33)
-0.95
Mother’s education
31.58**(11.27)
2.80
19.60(14.12)
1.39
Father’s education
25.67**(9.18)2.80
15.89(11.50)
1.38
R2 15.4 15.4 19.1
F(df)P
7.84(1, 43)0.0076
7.83(1, 43)0.0077
4.96(2, 42)0.0116
Cell entries are estimated regression coefficients, (standard errors) and t-statistics.* p<0.05, ** p<0.01, *** p<0.001
Our language for MR has used many terms for statistical control:
• Controlling for X1
• Holding X1 constant
• Removing the effects of X1
This language assumes that we can really hold X1 constant and X2 will still vary across its full range, but is this always true?Now: What happens if holding X1 constant dramatically restricts the range in X2—can we really statistically control for one predictor and evaluate the effects of another?
Example: National Child Care Survey
•n = 45 two parent Latino families•RQ: What is the relationship
between parental education and family income?
•Two parental education predictors: Mother’s and father’s education
Model D
-133.74(140.34
)-0.95
35.02**(11.01)
3.18
19.1
10.12(1, 43)0.0027
Avereduc
Go to: Multicollinearity: What it is, why it happens, how to spot it, and what to do
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 23
Multicollinearity: What it is, why it happens, how to spot it, and what to do
Correlation Matrix (NCCS data, n=45)
Income
Mother’s education
Income 1.00
Mother’s education
0.39**
Father’s education
0.39** 0.61***
Income
MomEd DadEd
a
d
c b
What is multicollinearity? When two (or more) predictors are so highly correlated that we cannot statistically control for one predictor and evaluate the effect of the other(s)
•Mother’s & father’s education•Gender and job status at Yale•Family background & school
resources
How to spot multicollinearity
•Controlled & uncontrolled slopes differ dramatically for two (or more) predictors
•The estimated controlled slopes make no sense (e.g., the signs appear wrong!)
•Standard errors increase with added predictors
•Reject omnibus F test but fail to reject individual t-tests for the constituent predictors
What to do about multicollinearity
•Use better research designs—especially randomized trials—that eliminate confounding
•Collect more data, especially “unusual cases”
•Collapse collinear predictors into a composite
•Include just one of the collinear predictors in your MR model (but be sure to explain what you did and why you did it)
What happens when we create a
composite—Average Education
—for the NCCS data?
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 24
Correlation Matrix for Reading data
Reading Height
Reading 1
Height 0.65
Grade 0.75 0.85
Caution: Don’t assume that all strongly correlated predictors are collinear
HEIGHT
REA
DIN
G
12
34
56
Reading
Height
Gradea
d
c
b
Holding Height constant, however, there is still
variation in Grade, and that variation is associated with
Reading
04.0ight|GradeReading,Her
49.0| HeightadeReading,Grr
Partialling out Grade, there’s
virtually no effect of Height
Partialling out Height, there’s still an effect of
Grade
Conclusion: Height and grade are strongly correlated, but not
collinearDon’t abuse the phrase…
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 25
Coda: Sometimes the direction of an effect can change upon statistical control!
Suppressor effects in predicting faculty salaries at the University of Kansas
Correlation matrix for salary survey data
SALARY
RANKDEPT HD
SALARY 1.00
RANK0.66**
*1.00
DEPT HD0.69**
*0.30*
*1.00
YRS SERV
0.130.61*
**-0.08
*p<.05 **p<.01 ***p<.001
YRSSERVHDDEPTRANKSALARY 362905,15493,10217,64
•Higher ranked professors have higher salaries
•Department heads have higher salaries•Higher ranked professors are more likely
to be department heads•The more years of service, the higher the
rank
But there are two very surprising findings concerning Years of Service:
•No correlation between years of service and salary?
•No correlation between years of service and being a department head?
R2 = 65%
0 5 10 15 20
Years of Service at University
$60,000
$80,000
$100,000
Salary
Full Professor
Full Professor
Assoc Professor
Assoc Professor
Asst Professor
Dept Head
Not Dept Head
“Salary compression…the failure of the organization to recognize
seniority with adequate compensation increase while meeting current market values for lower ranked
individuals hired into the institution”McCulley & Downey (1993) Salary compression in
faculty salaries: Identification of a supressor effect. Educational and Psychological Measurement, 53,
79-86
10,493
10,493 15,905
15,905
10,493
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 26
Start looking at the results sections of papers in your substantive fields…
Michal Kurlaender & John Yun (2007) Measuring school racial composition and student outcomes
in a multiracial society, American Journal of Education, 113, 213-242
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 27
Another example of presenting regression results in journals
Barbara Pan, Meredith Rowe, Judith Singer and Catherine Snow (2005) Maternal correlates of
growth in toddler vocabulary production in low-income families, Child Development, 76(4) 763-
782
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 28
What’s the big takeaway from this unit?
• Statistical control is a very powerful tool– The ability to statistically control for the effects of some predictors when evaluating
the effects of other predictors greatly expands the utility of statistical models– It allows you to acknowledge the effects of some predictors and then put all
individuals on a “level playing field” that holds those controlled predictors constant• The pattern of correlations can help presage multiple regression results
– Learn how to examine a correlation matrix and foreshadow how the predictors will behave in a multiple regression model
– If you have one (or more) control predictors, consider examining a partial correlation matrix that removes that effect
• Controlled effects can be similar to or different from uncontrolled effects– The effects of some predictors will persist upon statistical control while the effects of
others will change– Be sure to examine how your predictors effects change as you fit more complex
statistical models– Ask yourself whether the observed changes make sense
• Beware of the dangers of multicollinearity– Sometimes it isn’t possible to statistically control– When your predictors are highly correlated, you may think you’re statistically
controlling for the effects of one when you’re evaluating the effects of the other, but this may not be possible
– But similarly, just because predictors are highly correlated, don’t assume that you’ll have collinearity problems
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 29
Appendix: Annotated PC-SAS Code for Estimating Partial Correlations
Note that the handouts include only annotations for the needed additional code. For the complete program, check program “Unit 7—Statistical Control in Depth” on the website. Note also that this annotation builds on the knowledge from “Unit 2 – Correlation and Causality”.
Note that the handouts include only annotations for the needed additional code. For the complete program, check program “Unit 7—Statistical Control in Depth” on the website. Note also that this annotation builds on the knowledge from “Unit 2 – Correlation and Causality”.
proc corr data=one; var PeerRat L2Doc GRE ResFund PctDoc;
proc corr data=one; partial l2doc; var PeerRat GRE ResFund PctDoc;
proc corr estimates simple correlations between the variables specified. Its var statement syntax is var1 var2 var3 … varn.
proc corr estimates simple correlations between the variables specified. Its var statement syntax is var1 var2 var3 … varn.
proc corr can also estimate partial correlations. Use a partial statement to identify the variable(s) being controlled (partialled out).
proc corr can also estimate partial correlations. Use a partial statement to identify the variable(s) being controlled (partialled out).
Glossary terms included in Unit 7
• Correlation• Cross-sectional data• Main effects assumption/model• Multicollinearity• Statistical control