Instrumental Variables
Isaac MbitiUniversity of Virginia
Basics • Goal: we want to to estimate the returns to education (for
example)• We collect data on wages/earnings, years of schooling, and other
individual level data.• Outline of lecture • Basic OLS (ordinary least squares regression) – We look at basic
relationship btw wages and schooling• Problems with OLS- what problems do we face using OLS • How to address these problems with IV (instrumental varibles)
methods
Basics • Consider estimating the returns to education.
• Y = wages/earnings• S = years of schooling,• ρ = returns to schooling – the coefficient of interest• ν is the error term (often denoted with e)• For clarity purposes we will focus on simple bivariate regression
o Extends to multivariate regression case
iii vSY ++= ρα
Basics- OLS regression20
4060
80
6 8 10 12 14 16yrs_schooling
earnings Fitted values
Earnings
Basics- OLS regression20
4060
80
6 8 10 12 14 16yrs_schooling
earnings Fitted values
EarningsWhat command do you use in stata to do a scatter plot?
What can I learn from looking at this scatter plot?
“Error” is the diff btw prediction and the actual data point
Basics
• OLS (ordinary least squares) finds the “best fit line”o Minimizes the sum of squared errors
• Which letters denote the slope and intercept?ρ is the _____? And α is the _______• In Stata we use the command:
regress {dep_var} {independent}, [options]Other examples:Regress wages yrs_schooling, regress birth_weight mothers_smoking, robustregress test_score textbooks, cluster(classroomid)
iii vSY ++= ρα
Basics- OLS regression
_cons 2.729021 3.002237 0.91 0.366 -3.228821 8.686862
yrs_schooling 4.331217 .2711029 15.98 0.000 3.793222 4.869212
earnings Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 10323.4933 99 104.27771 Root MSE = 5.406
Adj R-squared = 0.7197
Residual 2864.05109 98 29.2250111 R-squared = 0.7226
Model 7459.44217 1 7459.44217 Prob > F = 0.0000
F( 1, 98) = 255.24
Source SS df MS Number of obs = 100
. regress earnin yrs_schooling
How do we read this regression output table?
Basics • Consider estimating the returns to education.
• OLS finds the “best fit line”o Minimizes the sum of squared errors
• In Stata we use the command:regress {dep_var} {independent}, [options]
Other examples:Regress wages yrs_schooling, regress birth_weight mothers_smoking, robustregress test_score textbooks, cluster(classroomid)
iii vSY ++= ρα
Basics- Assumptions of OLS• Linearity• Each person’s earnings (Yi) are a linear function of their education
(Si), plus an individual-specific error term, νi
• νi may be called the error term, the residual, or the deviation. It is a random variable, meaning that the value that any individual gets is a random draw from a distribution. I think of νi as representing the “luck” factor.
• For a given level of schooling (S), not everyone makes the same Y because some are lucky and make more and others make less.
Basics- Assumptions of OLS• Assumption 2: Error term has zero mean, E(νi)=0• This means that the positive errors and the negative errors cancel
out, so that on average, the error is zero.
Basics- Assumptions of OLS• Assumption 3: Homoskedasticity-• In math: var(νi |S) is constant• Intuitively this means that the variance of the error term does not
depend on S (our independent variable)
Basics- Assumptions of OLS• Assumption 3: Homoskedasticity-• An example of a violation of this assumptions is if our data is
clustered• Dataset 1: 100 standard 5 students each from different schools• Dataset 2: 100 standard 5 students, 10 students each from 10
schools• Suppose you do the following in stata with both datasets• Regress test_scores textbooks • which analysis would be problematic?
Basics- Assumptions of OLS• Assumption 3: Homoscedasticity-• Violations of this assumption do not affect the slope coefficients• BUT: it affects the standard error of the coefficients• This means our T-statistics and confidence intervals will be wrong• SO we could give bad policy advice! • (eg we say we should implement a program when we really
shouldn’t)• So we fix this by clustering or using robust standard errors• In stata:
regress birth_weight mothers_smoking, robustregress test_score textbooks, cluster(classroomid)
Basics- OLS regression
_cons 2.729021 3.002237 0.91 0.366 -3.228821 8.686862
yrs_schooling 4.331217 .2711029 15.98 0.000 3.793222 4.869212
earnings Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 10323.4933 99 104.27771 Root MSE = 5.406
Adj R-squared = 0.7197
Residual 2864.05109 98 29.2250111 R-squared = 0.7226
Model 7459.44217 1 7459.44217 Prob > F = 0.0000
F( 1, 98) = 255.24
Source SS df MS Number of obs = 100
. regress earnin yrs_schooling
If we violate homoscedasticity – Standard errors, T statistics, P value and Confidence Intervals will be wrong!
Basics- Assumptions of OLS• Assumption 4: Cov(S,νi ) =0 • Recall our regression:
• What does this assumption mean?• No relationship between schooling and the error term• Recall we are thinking of the error term as “luck”.• So the assumption states that “lucky people” have similar
education as “unlucky people”•
iii vSY ++= ρα
Basics- Assumptions of OLS• Assumption 4: Cov(S,νi ) =0 • Use a simulated data set where I violate this assumption and I
know the true relationship between variables.
• Ie. I am going to create a data set where the above relationship is true (what is the intercept? What is the slope?)
• I also create the data such that Cov(S,νi ) ≠0
iii vSY ++= 7.32
Basics- Assumptions of OLS• Assumption 4: Cov(S,νi ) =0 • Use a simulated data set where I violate this assumption.• What does this look like?•
-20
-10
010
20
5 10 15 20yrs_schooling
v Fitted values
Very important note:I can only create this graph because I am using simulated data. You cannot create such a graph with regular data
“Luck”
So what?• Lets run the ols regression on the simulated data.
iii vSY ++= ρα
_cons -4.316475 .9084094 -4.75 0.000 -6.099087 -2.533864
yrs_schooling 4.291932 .081199 52.86 0.000 4.132591 4.451272
earnings Coef. Std. Err. t P>|t| [95% Conf. Interval]
Robust
Root MSE = 5.7537
R-squared = 0.7447
Prob > F = 0.0000
F( 1, 998) = 2793.86
Linear regression Number of obs = 1000
. regress earnings yrs_school, robust
Did my regression give me the right answer?
OLS Assumptions –When All is OK
S Y
V
Omitted Variables • Often S and V are actually correlated because of omitted
variables• An omitted variable Cov(S,νi ) ≠0• From our simulation what problem will this cause in OLS?
• Recall previous lectures: • We want to know the impact of a program • But its really hard bc people who enroll in programs are more
motivated (for example).• This is an example of an omitted variable problem• ALL methods we discussed are trying to address this problem• RCT, Diff in Diff , RD and now we look at one more… IV
(instrumental variables)
IV basics • Often S and V are actually correlated because of omitted
variables• An omitted variable Cov(S,νi ) ≠0• Classic example of omitted variable problem:
o unobserved ability (which is correlated with schooling)
• In this case Cov(S, V) >0 ie higher ability people get more schooling
'i i i iY S A vα ρ= + + +
IV basics • Could we solve the problem by controlling for A? • Yes BUT ONLY IF
o WE CAN MEASURE A properly? (unlikely in practice)o A is the only Omitted variable? (hard to argue and very
unlikely)o SO adding lots of variables to the regression is not
sufficient. • For exposition purposes let us suppose A is the only omitted
variable• If we can’t measure A then we have a problem• Simply estimating this regression would lead to overestimates of ρ• Why would we be overestimating?
Omitted VariablesS Y
V
We can’t measure A so A is part of V
Since V and S are correlated and BOTH affect Y
Is S driving Y or is it V (including A)?
Could Y be driven solely by V (which includes A)?
Omitted Variable Bias-Just trust me on this
• By knowing (or hypothesizing) about direction of the relationships between S and V; Y and V; we can actually figure out if the omitted variable problem will lead us to overestimate or underestimate the true relationship between Y and S if we use OLS
• if S and V are positively correlated & Y and V are positively correlated, OLS will overestimate the true relationship
• if S and V are negatively correlated & Y and V are negatively correlated, OLS will overestimate the true relationship
• if S and V are positively correlated & Y and V are negatively correlated, OLS will underestimate the true relationship
• if S and V are negatively correlated & Y and V are postively correlated, OLS will underestimate the true relationship
Let’s Start Simple..
• We can solve the problem using an instrumental variable (z) which is correlated with S but not A or v
• The assumption that z is uncorrelated with A or is called the exclusion restriction.
'i i i iY S A vα ρ= + + +
IV intuition: The exclusion restriction
S Y
V
Z
IV intuition: The Exclusion Restriction
S Y
V
Z
???
How Do We Estimate ρ?( , ) ( , ) / ( )( , ) ( , ) / ( )
i i i i i
i i i i i
Cov Y Z Cov Y Z V ZCov S Z Cov S Z V Z
ρ = =
• Note: with a simple regression Eg reg y on x, the OLS estimate is Cov(Y,X)/V(X)
• So the denominator is the OLS regression between schooling and our instrument Zo We call this the FIRST STAGE o Does Z predict schooling?
• First stage coefficient can’t be zero! That is the instrument has to have some predictive power
How Do We Estimate ρ?( , ) ( , ) / ( )( , ) ( , ) / ( )
i i i i i
i i i i i
Cov Y Z Cov Y Z V ZCov S Z Cov S Z V Z
ρ = =
• Note: with a simple regression Eg reg y on x, the OLS estimate is Cov(Y,X)/V(X)
• The numerator is the OLS regression between Earnings and our instrument (Z)
• We call this the reduced form relationshipo (SIMILAR TO INTENT TO TREAT)
• So IV works by taking the coefficients from the reduced form relationship and dividing by coefficients from the first stage
• It’s the ratio of the effect of Z on earnings divided by the effect of Z on schooling
Where do we get instruments from?
• Economic theory• Natural/policy experiments
o Eg Duflo (2001) examines impact of rapid school construction on education and wages.• Take away – sometimes the variation in diff in
diff can support an IV estimation strategy
Where do we get instruments from?
• Natural experiments:o Angrist and Kruger (1991) use quarter of birth +
compulsory schooling laws in the US as an instrument for years of schooling
o How/why does this work? • School year starts Sept 1. You have to be age 6 by that
date. Now compare someone born Aug 30 to someone born Sept 3- very similar in age but older kids meets criteria, younger one has to wait till next year.
• School leaving laws say you have to be in school till age 16. suppose both leave at 16 who will have more schooling?
• Is this a valid instrument?
Exercise
• Want to examine effect of personal computer (PC) on gpa in college. PC is a dummy variable
1. Why might PC be correlated with u?2. Explain why PC is likely to be related to parental income.
Does this mean parental income is a good IV for PC? Why or why not?
3. Suppose the university randomly gave grants for PC purchasing to some students. How could you use this to construct an instrumental variable estimate of the equation of interest (above)?
(exercise from Woolridge text)
iii uPCGPA ++= 10 ββ
Quarter of Birth: An Example of an IV for Schooling (Angrist and Krueger, QJE 1991)
• Why might this work?o States only allows kids to start school after they turn 6. o For exposition pretend school starts Sept 1.o So, kids turning 6 from Sept 2-Dec31 can’t join until the following year. But Kids
turning 6 in Sept 1 and before are able to enroll in the school. o But compulsory schooling laws require kids to stay in school until their 16th birthday o So…kids with Sept2 -December birthdays end up spending less time in school than kids
born in before Sept1 → quarter of birth predictive of years of schooling o Meanwhile, hard to imagine that quarter of birth affects earnings for any reason besides
completed schooling (or does it?)
Let’s Add Covariates to the Model
• Important because maybe people born in the South are more likely to give birth later in the year than people in the North (I made this up) and region of birth is correlated with earnings. No problem: just controls for region/state of birth
• The coefficient on Z in first equation is the first stage and the coefficient on Z in second equation is reduced form
'10 11 1
'20 21 2
i i i i
i i i i
S X ZY X Z
π π ξ
π π ξ
= + +
= + +
21
11
πρπ
=
Two-Stage Least Squares (2SLS)
First Stage: Estimate predicted schooling using the Z (and the other covariates)Second Stage: Plug that predicted schooling into equation of interest (structural equation). The estimated coefficient on the predicted schooling will be the estimated ρ
Two-Stage Least Squares (2SLS)
First Stage: Estimate predicted schooling using the Z (and the other covariates)
Second Stage: Plug that predicted schooling into equation of interest (structural equation). The estimated coefficient on the predicted schooling will be the estimated ρ
iii eZS ++= 10 ππ
iii uSY ++= ρα
iS
“S-hat”
How To Think About This
2SLS retains only variation in S that is generated from variationin X. This variation is not correlated with ability and so we canconsistently estimate the effect of schooling on earnings
If I use S in the regression of interest- many things drive Sincluding A and other unobservablesIf I use S_hat (predicted from instrument)- ALL the variationin S_hat is driven by the instrument so we “break link” betweenschooling and unobservables
The Wald Estimator• (This is the British guy who told RAF to reinforce fuselage)• Simplest IV estimator: A single dummy instrument, with one
endogenous variable and no covariates • Not so useful in practice but helps to think about intuition
• Only reason for any relationship between Z and Y is that Z affects S so if numerator nonzero, must be because of S. Denominator is just for rescaling so we can answer the question of how much S affects Y.
[ | 1] [ | 0][ | 1] [ | 0]
i i i i
i i i i
E Y Z E Y ZE S Z E S Z
ρ = − ==
= − =
Example with Quarter of Birth 1st Quarter 4th Quarter Difference
Compute the wald estimator comparing Q1 to Q4
Example with Quarter of Birth 1st Quarter 4th Quarter Difference
Exercises• Suppose you want to test whether girls who attend school a
girls high school do better in math than girls in coed (mixed) schools. You have a sample from high school girls and score is a standardized math test. Girlhs = attend all girls school
1. What other factors would you control for in the equation? (be realistic about things that are in data)
2. write a regression equation for #13. Suppose parental support and motivation are unobserved
factors in the error term in #2. Are they likely correlated with girlhs? explain
4. Discuss the assumptions needed for the number of girls high schools within a 20 km radius of a girls home to be a valid IV for girlhs
Back to the Basics • A good instrument must
1. Be correlated with the endogenous right hand side variable• We also want the First Stage to be informative and
strongly statistically significant. (A good F test) 2. Uncorrelated with the error term
• Good news is that we can test condition (1)…- this is the importance of the first stage
• Bad news is that we cannot test (2)
Very Silly and Embarrassing Mistakes• Don’t do 2SLS by hand (getting predicted values of
endogenous variable, plugging in to equation of interest, doing OLS) because standard errors WILL BE WRONG
• Always put all of the controls in the first and second stage equations! o First stage residual (S minus Shat) uncorrelated by construction with all covariates in
first stage. But..these first stage residuals which are included in error in second stage, may be correlated with any X’s that were not in the first stage→INCONSISTENTestimates!
o What’s good enough for the second stage is good enough for the first stage!!
Forbidden Regressions• Imagine endogenous variable is a dummy. It is FORBIDDEN
to get predicted value for this dummy using a probit model and to plug this into second stage equation o WHY? Only OLS is guaranteed to produce first stage residuals which
are uncorrelated with fitted values.
• If want to examine effect of schooling on earnings but believe nonlinear relationship, include S and S2. But..o Treat S and S2 as two endogenous variables and so need two
instruments (the square of the original instrument is fine)
Another Example of an IV• Draft lottery numbers-• During the vietnam war there was a lottery for who would be
required to join the army (“the draft”).• Basically all men were given a random number based on
birthday. If the number was high- not drafted. If low- you were drafted (ie you were required to do military service)
• Angrist uses this to examine the relationship between military service and earnings
What does IV tell us?• Notice that the IV estimator is the ratio of the change in Y due to change
in Z to the change in X due to change in Z
• You can see that easily from Wald estimator• Lets go back to date of birth and compulsory schooling example
[ | 1] [ | 0][ | 1] [ | 0]
i i i i
i i i i
E Y Z E Y ZE S Z E S Z
ρ = − ==
= − =
What does IV tell us?• Suppose you have two types of people in the population:
o "ambitious" and "non-ambitious" people.o Distributed evenly in population and across AK birth groups.o Ambitious people get more years of education.
• Who is going to respond to the "treatment" in this setting?• “Aug 31 and before" + ambitious will get educated anyway
and “Aug 31 and before" + non-ambitious forced to get additional year of schooling Sept 2 and after +ambitious -will get educated (so doesn’t respond to treatment) and sept 3 and after + non-ambitious will drop out asap. so really the estimates are driven by “Aug31 and before" + non-ambitious people
• --> IV produces a "LOCAL AVERAGE TREATMENT EFFECT" not necessarily the same as the treatment effect on the whole population.
Assumptions Need to Make to Interpret IV Estimate as LATE
1. Independence: Instrument as good as randomly assigned (eg: random draft)
2. Exclusion: Instrument only affects outcome via the endogenous right hand side variable (no other mechanism)
3. First stage: Instrument has an effect on endogenous variable4. Monotonicity: Instrument has either no effect or same effect
for everyone. • If we don’t have monotonicty some the instrument pushes some people into
treatment while pushing other people out of treatment
We’ll get to what a local average treatment effect (LATE) is in a second, but first, a bit more on the assumptions...
Useful Notation
Yi(d,z) Potential outcome of person i were this person to have treatment d and IV z.
Causal effect of veteran status (serving in the army:
Causal effect of draft eligibility:
D1i Whether join military given z=1 (draft eligible, low number)
D0i Whether join military given z=0 (draft ineligible, high number)
(1, ) (0, )
( ,1) ( ,0)
i i i i
i i i i
Y z Y z
Y D Y D
−
−
What is Observed? 0 1 0 0 1
0 0 1 1 0
( )[ ],
i i i i i i i
i i i i
D D D D z zE D D D
π π ξπ π
= + − = + +≡ ≡ −
For any individual, only see one potential treatment: D1i or D0i (but not both)..ie, you don’t see whether person i would have joined military if had gotten different draft number (z)
Average causal effect of zi on Di. 1E[ ]iπ
LATE Theorem Suppose independence, exclusion, first stage, and monotonicity, then an instrument can be used to estimate the average causal effect on the affected group.
Examples: o IV estimate of effect of military service on earnings gives us causal effect of military on
earnings for men who only served because were drafted (wouldn’t have served otherwise)o IV estimate of schooling on earnings (when IV is birth month) gives us causal effect of
schooling on earnings for people who stayed in school a few extra months before dropping out because started school younger
1 0 1 0
1
[ | 1] [ | 0] [ | ][ | 1] [ | 0]
[ | 0]
i i i ii i i i
i i i i
i i
E Y z E Y z E Y Y D DE D z E D z
E ρ π
= − == − >
= − == >
LATE Theorem• Compliers: Get treated if z=1 and don’t get treated if z=0: D1i=1
and D0i=0• Always-takers: Always get treated:D1i =D0i=1, • Never-takers: : Never get treated:D1i =D0i=0,
LATE is effect of treatment on compliers.
Analogy: We want to know effect of medicine on health in a randomized trial. Some people always take medicine and some never take medicine. IV will only tell us effect of medicine on compliers.
Treatment Effect on the Treated • Average causal effect on compliers ≠ treatment effect on the
treated o The treated consist of compliers (with z=1) + always takers but the always takers may
have a different effect than compliers. • Examples: people who take medicine no matter what may be those that benefit most
from medicine. People who complete 12 years of schooling regardless of whether they’re forced to be in school may benefit most from school.
o Effect of treatment on the treated is weighted average of effects on compliers and always takers (people who would go to military regardless)
1 0
1 0 0 0
1 0 1 0 1 0
[ | 1][ | 1] [ 1| 1]
[ | ] [ , 1| 1]
i i i
i i i i i
i i i i i i i i
E Y Y DE Y Y D P D D
E Y Y D D P D D z D
− == − = = =+= − > > = =
Effect on treated (people who serve)
Effect on always takers
Effect on compliers
Average Treatment Effect• Unconditional average treatment effect is weighted average of
effect on compliers, always-takers, and never-takers
IV in Randomized Trials• Imagine randomized trial where no one in control group has
access to intervention but participation voluntary among those assigned to treatment
• Can’t simply compare those who got the treatment to those who didn’t because self-selection (among those offered treatment) into who gets treated. Usually positive selection (those who take the medicine probably healthier people).
• However, IV solves the compliance problem and estimates the effect of treatment (taking the drug) on the treated (those who actually take the drug)
Impact of Training Program
Comparisons by Training Status (OLS)
Comparisons by Assignment Status (ITT)
Instrumental Variables
3970 1117 1825
Earnings as dependent variable (men only)
Treatment: JTPA training programOnly 60% of those assigned to training actually received the training, 2% of those assigned to control group, received training
ITT= Intention to treat, measures causal effect of being offered treatment. Because some of the people offered treatment didn’t receive treatment, does not measure causal effect of the treatment
IV: ITT divided by difference in compliance rates (first stage) measure effect of treatment on the people who actually get treated. In general, this is LATE but b/c there are practically no always-takers, LATE=treatment effect on treated
IV in Fuzzy RD• In many applications of RD, we have imperfect compliance
across the discontinuity.We can use whether you were above or below the threshold as an instrument for the program take-up (in this case completing secondary school)
Compliers • Different (valid) IVs for same causal relationship can estimate
different things • Effect of schooling on earnings
o Quarter of birth IVs and compulsory schooling IVs affect same people (potential high school dropouts) and so should have similar estimates
o Proximity of college would impact different group of people. o If same results for both, might conclude homogeneous effects of schooling …suggestive
of external validity
• Effect of family size on children’s education o IV for family size using sex ratioo IV using twins o These should generate different compliant populations,. Since get similar results (no
effect of family size), might conclude that there really is no effect for anybody (at least in Israel)
Characterizing Compliers • Of course we can’t see in the data who are the compliers vs.
always-takers vs. never-takers (we don’t what they would’ve done if different z)
• But we can examine the characteristics of complierso Example: Relative likelihood that complier is college grad = first stage of college grads /
first stage of all others o In studies of effect of family size on kids education,
• Twins compliers are more likely to be older (younger women probably would have had an additional child even without having had twins)
• Twins compliers more educated while sex ratio compliers less educated