Lecture 6:Introduction to Linear Regression
24 April 2007
2
Linear regression: main idea
Linear regression can be used to study anoutcome as a linear function of a predictorExample: 60 cities in the US were evaluatedfor numerous characteristics, including:
the percentage of the population that was“disadvantaged”median education level
3
Binary education variable10
1520
2530
% o
f pop
ulat
ion
with
inco
me
< $3
000
Low Education High Education
4
Linear regression vs. ANOVAThese means could be compared by a t-test or ANOVA
Mean in low education group: 15.7%Mean in high education group: 13.2%
Regression provides a unified equation:
where Xi= 1 for high education 0 for low education (X is a“dummy variable” or “indicator variable” that designatesgroup)
ii
i10i
X5.27.51Y
XY
5
Interpreting the modelis the predicted mean of the outcome for
Xi, that observation’s value for X.
Xi=0 (Low education)
Xi=1 (High education)
0
i
7.1505.27.51Y
iY
ii
i10i
X5.27.51Y
XY
10
i
2.1315.27.51Y
6
Interpretation
0 is the mean outcome for thereference group, or the group forwhich Xi=0.Here, 0 is the average percent of thepopulation that is disadvantaged forcities with low education.
7
Interpretation
1 is the difference in the meanoutcome between the two groups(when Xi=1 vs. when Xi=0)Here, 1 is difference in the averagepercent of the population that isdisadvantaged for cities with higheducation compared to cities with loweducation.
8
Why use linear regression?
Linear regression is very powerful. Itcan be used for many things:
Binary XContinuous XCategorical XAdjustment for confoundingInteractionCurved relationships between X and Y
9
Regression Analysis
A regression is a description of aresponse measure, Y ,the dependentvariable, as a function of anexplanatory variable, X, theindependent variable.Goal: prediction or estimation of thevalue of one variable, Y , based on thevalue of the other variable, X.
10
Regression Analysis
A simple relationship between the twovariables is a linear relationship(straight line relationship)
Other names: linear, simple linear, leastsquares regression
11
Galton’s Example
1000 records of heights of familygroupsReally tall fathers tend on average tohave tall sons but not quite as tall asthe really tall fathersThere is a “regression” of a son’s heighttoward the average height for sons
12
Galton’s ExampleRegression of Son's Stature on Father'sE(Y) = 33.73 + 0.516*X
Son
's H
eigh
t
Father's Height (inches)60 62 64 66 68 70 72 74
64
66
68
70
72
74
13
Regression Analysis:Population Model
Probability Model: independent responses
y1, y2,…,yn are sampled from
Yi ~ N( i, 2)
Systematic Model: µi = E(yi|xi) = 0 + 1xiwhere: 0 = intercept
1 = slope
14
Another way to write the model
Systematic: yi = 0 + 1xi + i
Probability: i ~ N(0, 2)
The response, Yi, is a linear function ofXi plus some random, normallydistributed error, I
Data = Signal + noise
15
Geometric Interpretation
16
Model1) Yi ~ N( i, 2)2) µi = E(yi|xi) = 0 + 1xi
OR1) yi = 0 + 1xi + i2) i ~ N(0, 2)
where: 0 = intercept1 = slope
The response, Yi, is a linear function of Xiplus some random, normally distributederror, i
17
Interpretation of Coefficients
Mean Model: µ = E(y|x) = 0 + 1x0 = expected response when X = 0
Since: E(y|x=0) = 0 + 1(0) = 0
1 = change in expected response per 1 unitincrease in X
Since: E(y|x+1) = 0 + 1(x+1)And: E(y|x) = 0 + 1x
E(y) from x to x+1 = 1
18
From Galton’s Example
E(Y|x) = 0 + 1xE(Y|x) = 33.7 + 0.52x
where: Y = son’s height (inches)x = father’s height (inches)
Expected son’s height =33.7 inches whenfather’s height is 0 inchesExpected difference in heights for sons whosefathers’ heights differ by one inch = 0.52inches
19
City/Education Example10
1520
2530
9 10 11 12 13Median education
% o
f pop
ulat
ion
with
inco
me
< $3
000
20
Model
where Xi = the median educationlevel in city i
when Xi=0
when Xi=1
when Xi=2
ii
i10i
X0.22.36Y
XY
0
i
36.200.22.36Y
10
i
34.210.22.36Y
232.220.22.36Y
10
i
21
Interpretation
0 is the mean outcome for thereference group, or the group forwhich Xi=0.Here, 0 is the average percent of thepopulation that is disadvantaged forcities with median education level of 0.
22
Interpretation
1 is the difference in the meanoutcome for a one unit change in X.Here, 1 is difference in the averagepercent of the population that isdisadvantaged between two cities,when the first city has 1% highermedian education level than the secondcity.
23
Finding ’s from the graph
0 is the Y-intercept of the line, or theaverage value of Y when X=0.
1 is the slope of the line, or the averagechange in Y per unit change in X.
y=mx+bb= 0, m= 1
21
211 xx
yyˆNotation:
1 represents the true slope (in the population)
b1 and are sample estimates of the slope1ˆ
24
Where is our intercept?10
1520
2530
3540
4550
5560
0 2 4 6 8 10 12 14Median education
% o
f pop
ulat
ion
with
inco
me
< $3
000
25
Centering
0 makes no sense!We can change X to fix this problemby a process called centering
1. Pick a value of X (c) within the range ofthe data
2. For each observation, generateX_centered = Xi-c
3. Redo the regression with X_centered
26
We’ll use c=12,a high school degree
1015
2025
30
9 10 11 12 13Median education
% o
f pop
ulat
ion
with
inco
me
< $3
000
27
New equation
1 has not changed
0 now corresponds to X=12, not X=0
Note: with X=0, we have
12X0.22.12Y
12XY
ii
i10i
36.22412.21200.22.12Yi
28
Interpretation
0 is the mean outcome for the referencegroup, or the group for which Xi-12=0, orwhen Xi=12.Here, 0 (12.2%) is the average percent ofthe population that is disadvantaged for citieswith a median education level of 12, theequivalent of a high school degree.The interpretation of 1 has not changed.
29
Centering in Galton ExampleMake 6 feet (72 inch) fathers the ‘reference group’Create a new X variable, X*, by subtracting 72 fromour old X variable, X* = X – 72
Then: E(Y|x*) = 0 + 1x*= 0 + 1(x – 72)
So, 0 = expected response when X = 72,since E(Y|x=72) = 0 + 1(72 – 72) = 0
Center X’s whenever interpretations call for it!
30
Population Comparisons
0: changes depending on centering of X,which doesn’t affect association of interestReal concern: is X associated with Y?Assess by testing 1:Does 1=0 in the population from which thissample was drawn?
Hypothesis testingConfidence interval
31
Hypothesis testing
H0: 1=0Test statistic:
df = n-k-1n = number of observationsk = number of predictors (X’s)
1
1obs ˆSE
0ˆt
32
Hypothesis testing foreducation example
H0: 1=0Test statistic:
df = n-k-1 = 60-1-1 = 58n = number of observations = 60k = number of predictors (X’s) = 1
p<2*(1-0.995)p<0.01
36.30.59
00.2-tobs
33
Interpretation and conclusionIf there were no association between medianeducation and percentage of disadvantagedcitizens in the population, there would be lessthan a 1% chance of observing data as ormore extreme than ours.
The null probability is very small, so:reject the null hypothesisconclude that median education level andpercentage of disadvantaged citizens areassociated in the population
34
Confidence IntervalNo need to specify a hypothesis:
3.2,-0.8-0.59021.20.2
ˆSEtˆ1cr1
35
Interpretation and conclusion
We are 95% confident that the truepopulation decrease in percentage ofdisadvantaged citizens per additional year ofmedian education is between 3.2 and 0.8.
Since this interval does not contain 0, webelieve percentage of disadvantaged citizensand median education are associated amongcities in the United States.
36
So far…Linear regression is used for continuous outcomevariables
0: mean outcome when X=0Binary X = “dummy variable” for group
1: mean difference in outcome between groupsContinuous X
1: mean difference in outcome corresponding toa 1-unit increase in XCenter X to give meaning to 0
Test 1=0 in the population
Linear Regression:Multiple covariates andconfounding
38
Dataset
Hourly wage information from 9,918workers, along with informationregarding age, gender, years ofexperience, etc.We’ll focus on predicting hourly wagewith available information.
39
Regression: Hourly wage vs.Years of experience
010
2030
4050
0 20 40 60Years of Experience
Hou
rly W
age
40
What are the parameters?For each person, their actual hourly wage (Yi)and predicted hourly wage are known.
is the residual or errorThe parameters are found by minimizing thesum of the squared error
The parameters are the “least squares”estimates
i10i
iii
XYYY
n
1i
2i10i XYmin
iY
41
Notesfor any known pointon the line
is always true
The regression line equation
XY 10
i10i XY
ii10i XY
42
Model 1Model 1: Predict income by years of experience
so the average hourly wage for someonewith no experience at all is about $8.40.
so for every additional year of experience,the predicted hourly wage increases about 4 cents.
For 10 years of additional experience, the predicted hourlywage increases about 40 cents.
38.8ˆ0
04.0ˆ1
iii10i X04.038.8YXˆˆY
43
Should we center X?
0 years of experience is within therange of the dataThe average hourly wage correspondingto 0 years of experience makes sense
No need to center X
44
What happens if we alsoconsider gender? (Model 2)
010
2030
4050
0 20 40 60Years of Experience
Men's hourly wage Women's hourly wagefit2_men fit2_women
Hou
rly W
age
45
Model 2: Gender effect,no experience
For a man with no experience:
For a woman with no experience:0
i
ˆ9.27$)0(2.20-0)(04.027.9Y
20
i
ˆˆ$7.072.20(1)-0.04(0)9.27Y
)enderG(2.20-)Experience(04.027.9Y
)enderG(ˆ)Experience(ˆˆY
iii
i2i10i
46
Model 2: Gender effect,10 years experience
For a man with 10 years of experience:
For a woman with 10 years of experience:
)enderG(2.20-)Experience(04.027.9Y
)enderG(ˆ)Experience(ˆˆY
iii
i2i10i
(10)ˆˆ9.67$)0(2.20-0)1(04.027.9Y
10
i
(1)ˆ(10)ˆˆ7.47$)1(2.20-0)1(04.027.9Y
210
i
47
Model 2: Experience effect,males
For a man with no experience:
For a man with 10 years of experience:0
i
ˆ9.27$)0(2.20-0)(04.027.9Y
)enderG(2.20-)Experience(04.027.9Y
)enderG(ˆ)Experience(ˆˆY
iii
i2i10i
(10)ˆˆ9.67$)0(2.20-0)1(04.027.9Y
10
i
48
Model 2: Experience effect,females
For a woman with no experience:
For a woman with 10 years of experience:
)enderG(2.20-)Experience(04.027.9Y
)enderG(ˆ)Experience(ˆˆY
iii
i2i10i
210
i
ˆ(10)ˆˆ7.47$)1(2.20-0)1(04.027.9Y
20
i
ˆˆ$7.072.20(1)-0.04(0)9.27Y
49
Interpretation: Model 2
: the average hourly wage for a manwith no experience at all is about $9.30.
: for every additional year ofexperience, the predicted hourly wage increasesabout 4 cents for both men and women.
: the expected hourly wage is $2.20lower for women than it is for men at anyexperience level.
27.9ˆ0
04.0ˆ1
20.2ˆ2
50
Model 1 vs. Model 2Model 1:
Model 2:
95% CI for 1 in Model 1: (0.001, 0.07)and from Model 2 is within this CI
Gender is not a confounder
)enderG(2.20-)Experience(04.027.9Y iii
ii Experience04.038.8Y
1ˆ
51
What happens if we considerage, instead? (Model 3)
The relationship is harder to graph with twocontinuous predictors, since now theregression is in a 3-dimensional space.
Notice that age is centered at 40 years.Age ranged between 18 and 64 in thisdataset.
40)-Age(ˆ)Experience(ˆˆY i2i10i
52
Model 3: Age effect,no experience
For a 40-year-old with no experience:
For a 41-year-old with no experience:0
i
ˆ50.62$)4040(0.920)(82.05.26Y
40)-Age(0.92)Experience(82.05.6240)-Age(ˆ)Experience(ˆˆY
ii
i2i10i
20
i
ˆˆ42.72$)4041(0.920)(82.05.26Y
53
Model 3: Age effect,10 years experience
For a 40-year-old with 10 years of experience:
For a 41-year-old with 10 years of experience:
10ˆˆ30.18$)4040(0.920)1(82.05.26Y
10
i
40)-Age(0.92)Experience(82.05.6240)-Age(ˆ)Experience(ˆˆY
ii
i2i10i
1ˆ10ˆˆ22.19$)4041(0.920)1(82.05.26Y
210
i
54
Model 3: Experience effect,40 year old
For a 40-year-old with no experience:
For a 40-year-old with 10 years of experience:0
i
ˆ50.62$)4040(0.920)(82.05.26Y
40)-Age(0.92)Experience(82.05.6240)-Age(ˆ)Experience(ˆˆY
ii
i2i10i
10ˆˆ30.18$)4040(0.920)1(82.05.26Y
10
i
55
Model 3: Experience effect,41 year old
For a 41-year-old with no experience:
For a 41-year-old with 10 years of experience:
40)-Age(0.92)Experience(82.05.6240)-Age(ˆ)Experience(ˆˆY
ii
i2i10i
20
i
ˆˆ42.72$)4041(0.920)(82.05.26Y
1ˆ10ˆˆ22.19$)4041(0.920)1(82.05.26Y
210
i
56
Interpretation: Model 3: the average hourly wage for a 40-
year-old with no experience at all is about$26.50
: for every additional year ofexperience, the predicted hourly wage decreasesabout 82 cents for two people of the same age(or “adjusting for age”)
: for every additional year of age, theexpected hourly wage increases about 92 centsfor two people with the same amount ofexperience (or “adjusting for experience”)
5.26ˆ0
82.0ˆ1
92.0ˆ2
57
Model 1 vs. Model 3Model 1:
Model 3:
95% CI for 1 in Model 1: (0.001, 0.07)and from Model 3 is outside this CI
Age is a confounder. When we adjust for age,the apparent effect of experience on wagechanges.
ii Experience04.038.8Y
1ˆ
40)-Age(0.92)Experience(82.05.62Y iii
58
The Coefficient of Determination
R2 is the “coefficient of determination”R2 measures the ability to predict Yusing XVariability explained by X is
SSM =Total variability is SST =
2)ˆ( yyi
2)( yyi
59
The Coefficient of Determination
R2 is defined as
Measures the proportion of totalvariability explained by the model
2
22
)(
)ˆ(
yy
yy
SSTSSMR
i
i
60
R2 is the square of r, “Pearson’scorrelation coefficient”
r is a rough way of evaluating theassociation between two continuousvariables.
The Coefficient of Determination
61
So, what is R2?
The coefficient of determination, R2
evaluates the entire model.R2 shows the proportion of the totalvariation in Y that has beenpredicted by this model.
Model 1: 0.0076; 0.8% of variationexplainedModel 2: 0.05; 5% of variation explainedModel 3: 0.20; 20% of variation explained
62
What is the adjusted R2?
In both models 2 and 3, the new predictoradded a great deal to the model
R2 increased a lotMore importantly, both new predictors werestatistically significant
R2 always goes up!The adjusted R2 is adjusted for the number ofX’s in the model, so it only goes up whenhelpful predictors are added.
63
SummaryRegression by least squaresInterpreting regression coefficientsAdding a 2nd predictor to a model
Binary X added: 2 parallel linesContinuous X added: 3-dimensional graphfor both, new interpretation reflecting new model
Is the new X a confounder?Compare 1 across models