+ All Categories
Home > Documents > Lecture 15-3 Cross Section and Panel (Truncated Regression, Heckman Sample Selection)

Lecture 15-3 Cross Section and Panel (Truncated Regression, Heckman Sample Selection)

Date post: 06-Jan-2016
Category:
Upload: dan-gibbons
View: 34 times
Download: 7 times
Share this document with a friend
Description:
Cross Sectional Slides

of 50

Transcript
  • *Research MethodLecture 15-3

    Truncated regressionandHeckman Sample Selection Corrections

  • Truncated regressionTruncated regression is different from censored regression in the following way:

    Censored regressions: dependent variable may be censored, but you can include the censored observations in the regression

    Truncated regressions: A subset of observations are dropped, thus, only the truncated data are available for the regression.*

  • Reasons data truncation happens

    Example 1 (Truncation by survey design): The Gary Negative income experiment data, which is used extensively in the economic literature, samples only those families whose income is less than 1.5 times the 1976 poverty line. In this case, families whose incomes are greater than that threshold are dropped from the regression due to the survey design.

    *

  • Example 2 (Incidental Truncation): In the wage offer regression of married women, only those who are working has wage information. Thus, the regression cannot include women who are not working. In this case, it is the peoples decision, not the surveyors decision, that determines the sample selection. *

  • When applying OLS to a truncated data causes a biasBefore learning the techniques to deal with truncated data, it is important to know when applying OLS to a truncated data would cause a bias. *

  • Suppose that you consider the following regression:

    yi=0+1xi+ui

    And suppose that you have a random sample of size N. We also assume that all the OLS assumptions are satisfied. (The most important assumption is E(ui|xi)=0)

    *

  • Now, suppose that, instead of using all the N observations, you select a subsample of the original sample, then run OLS using this sub-sample (truncated sample) only.

    Then, under what conditions, would this OLS be unbiased. And under what conditions, would this OLS be biased?*

  • A: Running OLS using only the selected subsample (truncated data) would not cause a bias if(A-1) Sample selection is randomly done.

    (A-2) Sample selection is determined solely by the value of x-variable. For example, suppose that x is age. Then if you select sample if age is greater than 20 years old, this OLS is unbiased.*

  • B: Running OLS using only the selected subsample (truncated data) would cause bias if(B-1) Sample selection is determined by the value of y-variable. For example, suppose that y is the family income, and further suppose that you select the sample if y is greater than certain threshold. Then this OLS is biased.*

  • (B-2) Sample selection is correlated with ui. For example, if you are running wage regression: wage=0+1(educ)+u, where u contains unobserved ability. If sample is selected based on the unobserved ability, this OLS is biased. In practice, this situation happens when the selection is based on the survey participants decision. For example, in wage regression, a persons decision whether to work or not determines if the person is included in the data or not. Since the decision is likely to be based on unobserved factors which is contained in u, the selection is likely to be correlated with u.

    *

  • Understanding why these conditions indicate running OLS on the truncated data is unbiasednes/biasednessNow, we know the conditions under which OLS using a truncated data would be cause biased or not. Now let me explain why these conditions cause/does not cause biases. (There are some repetition in the explanations, but they are more elaborate containing very important information. So please read them carefully.) *

  • Consider the following regression yi=0+1xi+uiSuppose that this regression satisfies all the OLS assumptions. Now, let si be a selection indicator: If si=1, then this person is included in the regression. If si=0, then this person is dropped from the data.

    *

  • Then running OLS using the selected subsample means you run OLS using only the observation with si=1. This is equivalent to running the following regression. siyi=0si+1sixi+siuiIn this regression, sixi is the explanatory variable, and siui is the error term.The crucial condition under which this OLS is unbiased is the zero conditional mean assumption: E(siui|sixi)=0. Thus we need check under what conditions this is satisfied. *

  • To check E(siui|sixi)=0, it is sufficient to check if E(siui|xi, si)=0. (If the latter is zero, the former is also zero.)But, further notice that E(siui|xi,si)=siE(ui|xi,si) since si is a function of si which is in the conditional set. Thus, it is sufficient to check the condition which ensures E(ui|xi, si)=0.To simplify the notation from now, I drop i-subscript. So I will check the condition under which E(u|x, s)=0*

  • Condition under which running OLS on the selected subsample (truncated data) is unbiased.(A-1) Sample selection is done randomly. In this case, s is independent of u and x. Then we have E(u|x,s)=E(u|x). But since the original regression satisfy OLS conditions, we have E(u|x)=0. Therefore, in this case, this OLS is unbiased. *

  • (A-2) Sample is selected based solely on the value of x-variable. For example, if x is age, and you select the person if the age is greater than 20 years old. Then s=1 if x20, and s=0 if x
  • Condition under which running OLS on the selected subsample (truncated data) is biased.

    (B-1) Sample selection is based on the value of y-variable. For example, y is monthly family income, and you select families whose income is smaller than $500. Then, s=1 if y

  • E(u|x, s=1)=E(u|x, y500) =E(u|x, 0+1x+u 500) =E(u|x, u 500-0-1x) E(u|x)

    Thus, E(u|x,s=1) 0. Similarly, you can show that E(u|x,s=0) 0.

    Thus E(u|x,s) 0. Thus, this OLS is biased.

    *Since, the set {u 500-0-1x} directly depends on u, you cannot drop this from the conditioning set. Thus, this is not equal to E(u|x) which means that this is not equal to zero.

  • (B-2) Sample selection is correlated with ui. This happens when it is the peoples decision, not the surveyor's decision, that determines the sample selection. This type of truncation is called the incidental truncation. The bias that arises from this type of sample selection is called the Sample Selection Bias.The leading example is the wage offer regression of married women: wage= 0+1edu+ui. When the woman decides not to work, the wage information is not available. Thus, this women will be dropped from the data. Since it is the womans decision, this sample selection is likely to be based on some unobservable factors which are contained in ui. *

  • For example, the women decides to work if the wage offer is greater than her reservation wage. This reservation wage is likely to be determined by some unobserved factors in u, such as unobserved ability, unobserved family backgrounds etc. Thus the selection criteria is likely to be correlated with u. This in turn means that s is correlated with u. Now, mathematically, it can be shown as follows.*

  • If s is correlated with u, then you cannot drop s from the conditioning set. Thus we have E(u|x,s)E(u|x)

    This means that E(u|x,s) 0. Thus, this OLS is biased.

    Again, this type of bias is called the Sample Selection Bias.

    *

  • A slightly more complicated caseSuppose, x is IQ, and the survey participant responds to your survey if IQ>v. In this case, the sample selection is based on x-variable and a random error v. Then, if you run OLS using only the truncated data, will it cause a bias? AnswerCase 1: If v is independent of u, then it does not cause a bias.Case 2: If v is correlated with u, then this is the same case as (B-2). Thus, the OLS will be biased.

    *

  • Estimation methods when data is truncated.When you have (B-1) type truncation, then we use truncated regression

    When you have (B-2) type truncation (incidental truncation), then we use the Heckman Sample Selection Correction method. This is also called the Heckit model.

    I will explain these methods one by one.*

  • The Truncated RegressionWhen the data truncation is (B-1) type, you apply the Truncated Regression model. To explain again, (B-1) type truncation happens because the surveyor samples people based on the value of y-variable. *

  • Suppose that the following regression satisfies all the OLS assumptions.

    yi=0+1xi+ui, ui~N(0,2)

    But, you sample only if yi

  • *Family income per monthEduc of household head$500Example of (B-1) type data truncationThese observations are dropped from the data.True regressionBiased regression when applying OLS to truncated data

  • As can be seen, running OLS on the truncated data will cause biases.

    The model that produces unbiased estimate is based on the Maximum Likelihood Estimation.*

  • The estimation method is as follows.For each observation, we can write ui=yi-0-1xi. Thus, the likelihood contribution is the height of the density function.However, since we select sample only if yi
  • *

  • Thus, the likelihood contribution for ith observation is obtained by plugging in ui=yi-0-1xi in the conditional density function. This is given by

    The likelihood function is given by

    The values of 0,1, that maximizes L is the estimators of the Truncated Regression. *

  • The partial effectsThe estimated 1 shows the effect of x on y. Thus, you can interpret the parameters as if they were OLS parameters. *

  • ExerciseWe do not have a suitable data for truncated regression. Therefore, let us truncate the data by ourselves to check how the truncated regression works.

    EX1. Use JPSC_familyinc.dta to estimate the following model using all the observation.

    (family income)=0+1(husband educ)+u

    Family income is in 10,000 yen.*

  • EX2. Then run the OLS using only the observations whose familyinc
  • *OLS using all the observationsObs with familyinc800 are dropped. The parameter on huseduc is biased towards zero.

  • *Truncated regression model with the upper truncation limit equal to 800: Obs with familyinc800 are automatically dropped from this regression. Bias seems to be corrected, but not perfect in this example.

  • Heckman Sample Selection Bias Correction (Heckit Model)Most common reason for data truncation is (B-2) type: the incidental truncation.This data truncation usually occurs because sample selection is determined by the peoples decision, not the surveyors decision. Consider the wage regression example. If the person has chosen to work, the person has self-selected into the sample. If the person has decided not to work, the person has self-selected out of the sample.Bias caused by this type of truncation is called the Sample Selection Bias. *

  • Bias correction for this type of data truncation is done by the Heckman Sample Selection Correction Method. It is also called the Heckit model.Consider the wage regression model. In Heckit, you have wage equation and sample selection equation.

    Wage eq: yi=xi+ui and ui~N(0,u2)Selection equ: si*=zi+ei,and ei~N(0,1)Such that the person work if si*>0. That is si=1 if si*>0, and si=0 if si*0.

    *

  • In the above equations, I am using the following vector notations. =(0,1,2,,k)T. xi=(1,xi1, xi2,,xik) and =(0, 1,.., m)T and zi=(1, zi1, zi2,..,zim).We assume that xi and zi are exogenous in a sense that E(ui|xi, zi)=0.Further, assume that xi is a strict subset of zi. That is, all the x-variables are also a part of zi. For example, xi=(1, experi, agei), and zi=(1, experi, agei, kidslt6i).We require that zi contains at least one variable that is not contained in xi.

    *

  • The structural error, ui, and the sample selection si are correlated only if ui and ei are correlated. In other words, the sample selection causes a bias only if ui and ei are correlated.

    Let use denote the correlation between ui and ei by =corr(ui, ei).

    *

  • The data requirement of the Heckit model is as follows.

    1. yi is available only for the observations who are currently working.

    2: However, xi and zi are available both for those who are working, and for those who re not working.*

  • Now, I will describe the Heckit model.First, the expected value of yi given the fact that the person has participated in the labor force (i.e., si=1) is written as

    Using a result of bivariate normal distribution, the last term can be shown to be E(ui|ei>-zi,zi)= . But the term, , is the inverse mills ratio, (zi). *

  • Thus, we have

    Heckman showed that sample selection bias can be viewed as an omitted variable bias, where the omitted variable is (zi).*

  • Important thing to note is that, (zi) can be easily estimated. How? Note that the selection equation is simply a probit model of a labor force participation. So, estimate the sample selection equation by probit to estimate . Then compute . Then, you can correct the bias by including in the wage regression, then estimate the model using OLS.Heckman showed that this method corrects for the sample selection bias. This method is the Heckit model.Next slide summarizes the Heckit model.

    *

  • Heckman Two-step Sample Selection Correction Method (Heckit model)Wage eq: yi=xi+ui and ui~N(0,u2)Selection equ: si*=zi+ei,and ei~N(0,1)Such that the person work if si*>0. The person does not work if si*0.

    Assumption 1: E(ui|xi, zi)=0Assumption 2: xi is a strict subset of zi.

    If ui and ei are correlated, OLS estimation of wage equation (using only the observations who are working) is biased.

    *

  • First step: Estimate sample selection equation parameters using Probit. Then, compute .Second step: Plug in in the wage equation, then estimate the equation using OLS. That is: estimate the following.

    In this model, is the coefficient for . If 0, then the sample selection bias is present. If =0, then it is evidence that sample selection bias is not present.

    *

  • Note, when you exactly follow this procedure, you get the correct coefficients, but you dont get the correct standard errors. For the exact formula of standard error, consult Wooldridge (2002). The Stata automatically computes the correct standard errors. *

  • ExerciseUsing Mroz.dta estimate the wage offer equation using Heckit model. The explanatory variables for wage offer equation are educ exper expersq. The explanatory variables for the sample selection equation is educ, exper, expersq, nwifeinc, age, kidslt6, kidsge6.*

  • *Estimating Heckit Manually. (note: you will not get the correct standard errors.The first step:The probit selectdion equation

  • *The second step:

    Note the standard errors are not correct.

  • *Heckit estimated automatically. Note H0:=0 cannot be rejected. So there is little evidence that sample selection bias is present.


Recommended